Universal Data Analytics as Semantic Spacetime

Part 5. Graph spacetime, association types, and causal semantics

Data analysis inevitably involves relationships between sources and the signals they give rise to — with, of course, their associated semantics. So far we looked at only simple tabular associations (so-called documents), typical of what’s usually stored in databases. But we can do much better than this when complex causation is involved. Charting relationships qualitatively and quantitatively, as a graph (the mathematical name for a network), may offer a more natural and complete way of representing processes. The graph format has the additional benefit of visually mapping to the trajectories of unfolding processes (such as fluid motion, state transitions, or even logical flow charts).

Types are not enough…

I share this story because we sometimes imagine that modelling choices are easy, and that analysis will be more or less the same no matter what static data model we choose. Any model is surely as good as any other — it’s just a matter of style, right? But this is not so. Sometimes we set ourselves up for success or failure by the choices we make at the start. That’s also true about the choice of tools. Happily, Go with ArangoDB, have proven to be the clearest and most efficient version of these ideas I’ve been able to produce so far!

In this fifth post of the series, I want to show how we can represent completely general semantic relationships, between events in a process, using the semantic spacetime tools and a graph representation. Using Go and ArangoDB, this has never been easier, but you might have to tune your thinking to a different frequency to avoid some pitfalls.

In computer coding, the concept of data types is by now quite ubiquitous and is used with a specific purpose in mind: to constrain programmers, by being intentionally inflexible! Data types, like double-entry book-keeping in accounts or categories, are designed to highlight errors of reasoning by forcing a discipline onto users. Types have to match in all transactions, else syntax error!

Graph spacetime

          A  B  C  D  E  F  G  H
A 0 - - - - - - -
^ B - 0 - 3 - - - -
| C - - - - - - - -
| D - 1 - 0 - - - -
row E - - - - 0 - - -
| F - - - - - 0 - -
| G - - - - - - 0 -
v H - - - - - - - 0
<--- column ---> matrix(r,c) = value
matrix(2,4) = 3
matrix(4,2) = 1

Now there is a reason to store information separately from the nodes, because some of it belongs to both, or the connection between them. Shared information belongs to the directed path between the end-points.

This is also a key-value structure here, of course. The labels are still the keys — but now two dimensional key-pairs (k1, k2), which act as (y,x) coordinates formed from the nodes. The values belonging to the paths between them are the numbers in the table itself. This matrix is completely equivalent to a graph or network. The values within the table, at these coordinates, represent links or edges from row node to column node. If the process edges have no arrow head or directionality, the matrix must be symmetrical, because A connected with B implies B connected with A, but that doesn’t have to be the case. A one way street connects A to B but not vice versa. Also, the diagonal values are often zero, because we don’t usually imagine needing to connect a node to itself. Every node is next to itself, right? But this is also a failure of imagination, perpetuated by our prejudices about space. If we think of the arrows as processes, not distances, then any node can talk to itself or change itself. Indeed, this kind of “self-pumping” turns out to be very important in directed processes like recurrent networks for keeping nodes alive.

In the table above, we see that B influences D with a value 3, but D only influences B with a value of 1. Perhaps B called D 3times, but D only called B once. Clearly, we can use as as a kind of histogram too, by interpreting the numbers as frequencies e.g. conversation frequency. To get the one-dimensional summary, we just have to sum over the columns of the “ledger”. This is how a statistician might look at a graph — but it’s far from the whole story.

We’re conditioned to think about space as Euclidean space, i.e. a world of a few things or locations that’s full of inert padding. Graphs are a more direct representation of processes, without all the “wasted” space in between. We might draw graphs embedded in Euclidean space (as a picture) in order to satisfy our senses, but we don’t have to. The matrix shows that the empty space of the page has no meaning — all the information is in the matrix. The empty space is not represented. Classically, scientists have focused on just putting numbers in the matrix, but in the modern era, we need much more detailed symbolic information to make sense of processes. Graphs and their fraught relationship to Euclidean space are at the heart of machines learning and statistics.

Graphs are good at representing highly inhomogeneous processes (data with types, which includes ontologies as well as control information). In a Euclidean space, every point or place is just the same as the next, and we can only compare relative locations. This assumption of translational invariance in data is a big topic in mathematics and physics (group theory), but it’s founded on a prejudice that the world is ultimately homogeneous at its root. This isn’t a useful point of view in a semantic spacetime, e.g. in biology, technology, sociology, etc. Euclidean space was invented to model the homogeneity and isotropy of spacetimes — the fact that in mostly empty space nothing much happens except at special places, which is an assumption about the representation, not the process. Historically, we then called those special places “bodies” or “particles” (matter) to distinguish “nothing happening” from “something happening”. Graphs work differently: nodes provide a kind of material location at every vertex, so the idea of matter is redundant. This means graphs can have different properties at every single node or point.

It’s hard to imagine different properties at different absolute locations in Euclidean space, partly because there is an assumption of its emptiness (featurelessness or typelessness), and there isn’t a natural way to describe boundaries on an arbitrary scale in Euclidean space, so we tend to treat it as featureless. But this is a mistake, if you don’t assign meaning to the quadrants on a page (e.g. upper right is good, lower left is bad, as in marketing graphics), no conclusion had any intrinsic meaning, only relative meaning. But we know this isn’t natural in the world around us: absolute properties model context, or boundary information. If you’re in New York, Times Square is obviously different from Central Park not only in size, but in its features. Our inability to escape Euclidean ideas is tied to our sensory and cognitive model of distance. This is the semantic trap that the field of statistics easily falls into: letting representation dictate interpretation. But in physics, we’ve rediscovered graph importance especially in quantum physics (path integrals, lattices, etc.) — events form trajectories, where transitions between states are the important feature, not distances. Trajectories were once treated as secondary patterns in physics, but in fact they are the fundamental description of processes. The empty spaces we introduce in between steps is just scaffolding that we end up integrating away.

As we start out modelling phenomena with graphs, all these issues of pride and prejudice will get in the way all the time. Is a graph a spacetime? Is Euclidean more fundamental or is a graph more fundamental? Semantic spacetime is a model which allows us to see these issues clearly. And Machine Learning is an area where the tension between Euclidean and graph representations reaches a crescendo — so, as you expect, I’ll be discussing tis more in the rest of the series.

Graphs: associations, links, or “edges” vs types

X must always contain a Y (internally at nodes)

(an obligation imposed on the modeller), to less rigid but potentially more messy assertions:

X seems to be associated with Y to some degree Z (on the path between nodes).

The latter is a promise about the journey, possibly inferred from observation.

In the past, semantic tools like RDF, Topic Maps, and OWL, were attempted to try to model expressions like these, but — although popular with a small nice of users — these were largely unsuccessful outside the realm of academic papers. The reason was that logical models quickly incur significant (and sometimes unresolvable) complexity, due to the rigidity of their type models — this is what we saw in the OO example cited at the start of this post. It’s a “square peg and round hole” kind of issue. To get beyond these pitfalls, my own research led me to the interpretation of graph links or “edges” as collective sets, using the Four Basic Types of Semantic Spacetime.

Let’s look at some edge cases (pun intended!). Glancing through some books on graph databases, I found the following relationships used to argue the virtues of graph database models:

  1. A emailed B
  2. A is the child of B
  3. A employs B
  4. A is an alias for B
  5. A belongs to B

These are not the kinds of association that I (personally) would have chosen to expound the virtues of graphs, but let’s start here anyway. The first thing we can note is that these are directed relationships (see figure 1) between different kinds of entity. Each relationship is an arrow going in one direction. The expressions can’t be reversed without changing their interpretation. They impose an order of events onto the As and the Bs. One state FOLLOWS another, as ordered events. If we try to reverse the relations, we need a different linguistic expression, e.g.

A emailed B => B was emailed by A

see also figure 1, below.

Figure 1: A directed, labelled “edge” or link in a graph between objects A and B.

The direction of a relationship is an expression of its underlying causality, i.e. the stream of changes that make up a world line or trajectory in the “spacetime” being represented. In the case of emailing someone, this link is a pretty minor association, likely to be only a tiny part of a larger process, which is probably what we’re interested in. To see the bigger picture, we need to know and combine all the steps in that larger process. Indeed, it’s in the combination and recombination of links into process chains that the difficulties arise for semantic graphs. The rules for combining allowed types in paths may be quite different every time! e.g. A emails B to go to the shop to buy a C that will repair a D, etc. The non-generalizability of these association names — and composition between association meanings — is a big challenge to modelling. It’s what eventually breaks logical approaches to semantics, life RDF and OWL, because type-mismatches make it impossible to formulate general manageable rules of inference.

Nodes, “vertices”, agents — and their types

It’s not hard to see how we could get into trouble trying to reason about graphs in this way. There’s a hundred ways in which A could influence B. We don’t want to throw away the specific description, but it may not matter for all intents and purposes. We keep it because we almost certainly want to read it back as part of a story interpretation, but we usually don’t need it for searching. We don’t want to end up in the situation of having to enumerate every detail, like this:

IF A (emailed OR spoke OR shouted OR texted OR tweeted OR *) TO B

Using types is no guarantee of not getting in a muddle. That’s one reason why people resort to numerical statistics and probabilities that seem to replace meaning with a threshold value — creating a layer of obfuscation. Semantics add overhead and formality, but they reveal crucial information about processes.

Rethinking associative link semantics

  • A “CONTAINS” B (a kind of spatial relationship),
  • A “FOLLOWS” B (a kind of temporal relationship),
  • A “EXPRESSES” B (a local property) and
  • A “is NEAR to” B.

These relationships apply just as well in either real and virtual spacetimes. If we classify all relations into these intentional types first, and use the more specific labels (emailing etc) as an supplementary attribute, then we can simplify all the possible ways that A could influence B, leading to simple rules. We just need to decide whether what A did led to what B did (FOLLOWS), or whether A simply declared something local (EXPRESS), and so on.

The key difference in this scheme lies in the notion of causal transitivity. If A emails B and B emails C, can we infer that A (effectively) emails C? Maybe, but probably not. It actually depends on the circumstances or context, the agents, and the process! This is what we need to understand when modelling graphs effectively.

I’ll be returning to discuss this matter further in future posts, but for now let’s dive into the details and look at how to represent these annotated meta-relationships using the Arangolang tools.

Modelling sort-of graphs in Go

var child_of = make(map[string]string)child_of[“A”] = B

The interpretation of this association, i.e. its semantics, is encoded in the name of the array itself (“child of”). If we copy the data into an array with a different name, its meaning is lost.

Another way to code this would be to write it as a quasi-link, which makes contact with graph theoretic terms:

type VectorPair struct {
From string
To string
}
var employs = make(map[Pair]bool)employs[VectorPair{From: “A”, To: “B”}] = true

The code for these examples is here.

Associations in Arango

Using different collections allows us to optimize searching by type. By separating vertices (nodes) and edges (links) according to their semantics, we can avoid having to browse candidates that we know to be impossible matches during searching. It’s like separating files into different directories (recall this discussion in part 3). If we want Mark’s files, then there’s no need to look in Sarah’s home directory. We could put all files into one directory and filter by username, but then we’re just making trouble for ourselves.

Figure 2: A graph representation of some “entities”. It’s a natural way to think about relationships in pictorial terms — but the difference between salvation and damnation lies in the proper labelling of the arrows.

The basic template for making these collections was shown, without explanation, in part 2, with a code example. We can improve on this template, both to hide the bureaucracy of it and to define a reusable model for all new cases. But first we need to think about what that model needs to contain.

Associations with causal semantics

Forward: “Contains”,Backward: “Is part of”,Negative-forward: “does not contain”Negative-backward: “is not part of”

The last two are for the exclusion of nodes — see the passport example in the next post. Having these texts defined in a single place, and keeping them logically together is a useful trick. In the semantic spacetime library, I’ve defined an ASSOCIATION structure to help do this. We can use a key-value approach to index or alias all the variants as members of a single group, with a single alias to refer to the group.

Figure 3: separating link semantics from the link using a look-up table (map)Figure 3: separating link semantics from the link using a look-up table (map).

This is almost enough to finish the model, but there’s one more detail: the other look-up table, i.e. the semantic spacetime type of the association (see figure 3). Recall that there are four basic types, so we should add which of those types this particular link interpretation is. We defined those types as a key-value table in parts 3 and 4 of the series. We can add it as an integer value to an association group to make the semantics explicit. This is going to help us to select algorithms easily later.

Making an SST toolkit

ASSOCIATIONS[“groupalias”] -> structured edge type values

The right hand side of this will be composed of the related forms of an edge group relation. For each type of directed association, there are four possible variants we might need in reasoning about the graph:

  1. The forward explanation text.
  2. The reverse direction text.
  3. The forward negative.
  4. The reverse negative.

Each link has a spacetime type too. Recall the four fundamental type classes from part 4 of the series: CONTAINS, FOLLOWS, EXPRESSES, NEAR. This is the STtype. We can now represent the link in figure 2 as a tuple

(A,B,ASSOCIATION[“alias”],”+”)

where the “+” means the forward direction is being used.

For the Go representation of the right hand side, we use a struct data type. For the Arango representation, we can use a document. Documents map directly to structs in Go, and we can simply use Go’s ability to map its internal structures to a JSON file format. This piece of black magic isn’t exactly well documented, but is extremely useful. It looks like this:

type Association struct {
Key string `json:”_key”`
STtype int `json:”STType”`
Fwd string `json:”Fwd”`
Bwd string `json:”Bwd”`
NFwd string `json:”NFwd”`
NBwd string `json:”NBwd”`
}

This amount of bureaucracy might seem like overkill for a simple model, but building up this basic library will save much difficulty later. This is always the case in programming. Time invested in planning pays off in the end. Quick fixes take a while to convert into solid solutions — so borrow my advice, having made all those mistakes. In Part 10, we’ll use it all to look at a big graph.

ArangoDB’s interior representation of documents follows a JSON model, which is pretty convenient. In Golang, the representation is by struct. We can map structs to JSON names in Go, see the quoted extras on each line of the code box above. Notice also that the primary key used is called “_key” in JSON. This is expected by ArangoDB. Every record must have a field with this name. If you don’t add it in exactly this way, Arango will add its own and use that instead. Sometimes, we have to follow these rules.

In Go, we can now define a map and start to populate it with data:

var ASSOCIATIONS = make(map[string]Association)ASSOCIATIONS[“CONTAINS”] = Association{ “CONTAINS”, GR_CONTAINS, ”contains”, ”belongs to or is part of”, ”does not contain”, ”is not part of” }

The semantic spacetime metatype of the edge name is one of the four ST types, which we used in the key value example of parts 3 and 4 (see figure 4).

var STTYPES []IntKeyValueconst GR_NEAR int = 1      // approx like
const GR_FOLLOWS int = 2 // i.e. influenced by
const GR_CONTAINS int = 3 // inside/outside
const GR_EXPRESSES int = 4 // represents, etc

The big payoff in setting up this apparently frustrating layer of bureaucracy lies no so much in storing as in reading back the data later.

Figure 4: the four elementary spacetime relationship types.

Associating with edges in Arango

CreateLink(node1, ”group-alias”, node2, weight)CreateLink("Norwegian", "IS_LIKE", "Swedish", 60/100)

We also want to keep the different kinds of link separate, as different collections, by STtype, because these are orthogonal relations, so they would never be searched at the same time. I’ll keep this for the next post so that we can complete a basic library of methods for future use. For now, you can try defining links and browsing the interface to see what they look like in the browser (see figure 5).

Figure 5: browsing the links collection in Arango, we see the document structure of links.

Notice how the key names for source and destination nodes in the link data must also use the underscored style, i.e. “_from” and “_to”, and they must use a fully qualified collection name “Nodes/node1” not just “node1”. These are the rules of ArangoDB, and it’s nice to be able to forget about them by absorbing these into the SST library layer in Go.

There’s one technical issue to solve here concerning the key names for edges or links. Arango allows you to keep adding links, even if they have the same attributes, unless you deliberately fix the key name based on a unique set of parameters. If we want to apply the principle of convergence to graphs, we’ll need to ensure convergent and idempotent link management. We can use hashing to do that. With fully differentiated collections, a database will start to look something like this (figure 6):

Figure 6: a fully formed semantic spacetime model may have several node and edge collections

Note: In computer algorithms, it’s normal to search for data by looking at the objects (nodes) first. But in a graph representation, the links between nodes automatically include their endpoints and searching links first can save a lot of time by searching the links. This makes searches O(N) instead of O(N²), and any nodes that don’t have links can be ignored, since they play no role in the process concerned. Since we’ve separated these into different “directories” or collections, we don’t need any if-then-else logic to determine this.

Summary: building on the SST package

In the next installment, I’ll go further to show how we can make a “standard model” for Semantic Spacetime, building on what we’ve discussed so far — and then learn how to modify it. Then we can easily import the templates as an de-facto library, and modify them for any project going forwards.

Read more in this popular science book:

Science, research, technology, author - see Https://markburgess.org Https://chitek-i.org