Universal Data Analytics as Semantic Spacetime
Part 3: The basics of space and time, and the most familiar multi-model databases you never knew…
In case you hadn’t noticed, there’s a very important multi-model database hidden in plain sight, right under our noses. We might never think about it in those terms, but we rely on it every day for everything we do. It’s the very space around us! We keep stuff on shelves, organize it in drawers, or leave it lying around. In IT, we created another shared space for our stuff, using another multi-model database — the good old hierarchical file system, consisting of all the directories and files on our disks. From there, we build other structures including key-lookup databases. At risk of pointing out the obvious, it’s worth a quick post to reflect on the similarities between these data sharing models, and see the practicalities surrounding multi-model database design — and the features we’ll use to describe semantic spacetime.
This post is elementary for more experienced programmers, but may be of interest to general readers.
Data with coordinates!
Databases are basically storage systems, warehouses, or parking lots, for data or even real goods, depending on how you look at it. Our world is less of an oyster than it is a database.
The contents of RAM or the surface of a disk are covered in blocks which are numbered sequentially with addresses, like parking spaces, and are used to park data for later retrieval. Dividing space into fixed size coordinate blocks helps to make it predictable, regular, and reusable. We build all kinds of abstractions on top of this basic model to arrange the parked information into virtual views, which looks like files, directories, hierarchies, ledgers, or graphs.
So, at a low level, memory of any kind is like a cargo ship, piled high with containers, all the same size, but containing different data. Without some kind of coordinates, map, or index of what’s in which container, finding anything in the pile would be a long and laborious process of systematic searching. Databases offer these maps in a wide variety, as well as query languages (of which SQL is the most famous). Let’s compare some of these views and how we approach them with our tools Go and in ArangoDB.
Containment structures and data values
ArangoDB’s approach to storage uses a hierarchy of container structures, with a few layers, to slice and dice data:
- Database — a named data repository (like a disk volume).
- Collection — a set of related records, under a common role name (like a directory or folder). Collections don’t normally represent different types of entity (as in SQL tables) but rather documents that play a similar role in a computation.
- Document — a single structured record of, like a programming “struct”, a form, or file within a folder.
- Values / data — the contents of a document in a variety of data formats: numbers, pictures, text, etc.
Data types are combinations of the primitive data types: integer, floating point, strings, and byte arrays. In older languages, using the ASCII or UTF-8 representations, strings and byte arrays were the same thing. In modern UNICODE representations, a single character may not fit into a byte. This means that strings and byte arrays have to be interconverted sometimes.
Let’s not forget why we want to keep data together under a single banner. This often has to do with context or the provenance of data. If data are measured under a single experiment, under the same conditions, they would belong together somehow. On the other hand, for data that are measured under different conditions, it’s important to highlight that so that we can distinguish them. It’s about chain of custody, the back-story of how results were found, and therefore what they mean in broader terms.
Data semantics and “schemas”
The shadow side of data is how we interpret the values we obtain. This has two aspects:
- Naming of data types to distinguish interpretations
- Matching methods and their interpretations to data types
In programming, semantics are basically associated with a unique name.
Data: [Type|Category|name] → Attributes | Behaviours
There are ways of naming types differently to make semantics explicit, but this adds a layer of bureaucracy that many researchers would find arduous.
In Go programming, related data values can be collected under a “container type” using structured types. The syntax is similar to the old C syntax, except for the order of type and name.
type xyz struct {Member1 int
Member2 string
Member3 float64
}
A minor stumbling block in Go is that members can only be referred to outside their object if they begin with a capital letter (facepalm).
The term “schema” was usually associated with a rigid template for data, from the Relational (SQL) database model. Modern “document oriented databases” allow templates to be ad hoc and extensible. This freedom is powerful, but can also lead to inconsistency without discipline. A completely general multi-model database, like ArangoDB, offers great freedom, and thus hands a certain responsibility to the user to maintain self-discipline.
How we choose to store data is a technical issue — a matter of templating and naming — basically a kind of bureaucracy. Some major models include:
- Data as files, archived by category, for related sets of numbers, text, films, etc.
- Spreadsheets / tables / ledgers or related transactions or data points, usually used for numbers in experiments or financial transactions.
- Taxonomy or hierarchy of inheritance for characteristics
For each of these, there is a database technology for the choice.
If we think about an application like the Internet of Things, the “things” are of many different kinds: devices, sensors, phones, and of course humans(!) They have different characteristics, which we capture as different attribute types. We could represent them as different “forms” to fill out, which translate into different “document types” or different “schema” in a database sense.
The filesystem: files and directories
A filesystem, the software that manages data on our disks, is a rather simple database for all kinds of structured and unstructured data. We can think of it in a number of ways. As a key-value store, keys are filenames and the values are the file contents, or we can think of it as a document store, which is quite obvious. We can also think of the files as cells in a spreadsheet, table, or ledger, with rows and columns, collected and labels as different directories.
Thought of as a graph database, a file system is mostly treelike and hierarchical (except for a few symbolic links) — it could be called an irregular Cayley Tree. As a tabular structure, it’s a list of files. As a hierarchy it’s a collection of containers (directories) for items (files), just as collections are containers for documents.
From the perspective of a filesystem, a database is something like a disk or logical volume. A collection is something like a directory, and a document is something like a file. Just as we would put files into a directory that has a special significance (to keep related things together and to keep them separate from unrelated things) so we would use collections to organize and categorize things in a database like Arango. What makes all of these structures useful for semantics is that they can be given names.
We could, of course, put every document in a single collection and use internal labels to separate nodes by filtering out certain search criteria, but that would be computationally and cognitively more expensive, like opening every file’s inode on your hard-disk to see if it contains the word “money” rather than keeping money related matters in a separate search-space, like a hash. In a graphs, this choice depends on how common criteria emerge. Linking millions of members to a single hub to represent type is more cumbersome than labelling each node with a type in its document, but equivalent. On the other hand, it makes their clustering by type explicit to see. Conversely,if nodes become linked by hubs, we can use this to infer new types!
DATABASE FILESYSTEMOpen database | Mount FilesystemCreate or Open Collection | Make directory or Change to directoryQuery collection | Read directory contentsRead document | Read filePrimary key | Filename / URIGraph edge | Hyperlink (“see also” reference)
The relationships between file objects, in a container hierarchy, are straightforward (see figure 2 and the table above). Files are inside directories, and directories may be inside other directories. In other words, they inherit the name and significance of their container too. The code for exploring these concepts is here.
If we want to think of this as a graph, with labelled “semantics”, then we could represent the hierarchy in any number of ways. For example, “DIRECTORY contains SUBDIRECTORY” or “DIRECTORY is the parent of SUBDIRECTORY”. See figure 3 below. How we choose to model these structures will play an important role in making analytics easy or cumbersome down the line. It’s worth clearing your mind of preconceived ideas, and allowing yourself to be challenged by alternative representations, to make the best of each case.
A casual user might not care much about the distinction between the two alternative interpretations, but we shouldn’t dismiss apparently small differences too easily. Descriptions may have quite large differences of meaning, in the right context. That’s why we need a way to make semantic representations and explanations systematic and consistent, from the beginning, without squeezing the life out of them with formal logic.
The ledger, spreadsheet, or row/column model
An analogy between data sets and the rows and columns in a two dimensional table is the most common one made by relational (SQL) databases. For these, a single template data type is considered to be a labelled column in a table (i.e. a collection of different values related by a common label or interpretation). See figure 4 below. The different instances of that template are considered to be the rows of the column (different documents or values in the Arango model). We conventionally list comparable things downwards in a spreadsheet column.
Are two objects comparable or not? In order for two things to be considered comparable, they have to have the same shape — or data type. That means they need to share the same data model, or have the same “struct” template. This spreadsheet analogy, pitted alongside concerns about the consistency of data input by hand, led to the definition of “normalization laws” for data schemas, or the so-called Three Normal Forms. The relevance of the normal forms has changed somewhat as databases have been generalized and automated data collection has replaced entry by hand. Modern structures like data warehouses and data lakes explicitly break those rules, repeating values for efficiency of retrieval.
Inheritance of semantics and attributes
With structured data types and later Object Orientation came the notion of inheritance — where we can nest data types inside others. It’s a lot like a constrained version of keeping subdirectories inside other directories, and it has the same kind of limitations around multiple-inheritance.
This ability to build abstractions with these tools is a double edged sword, containing double edged swords, with double edged turtles all the way down! It seems powerful, but it also gets you quickly out of control unless care is taken. If intent is important to you in data modelling, which is usually the case in data science, in the form of an intended model or hypothesis, then too much nesting of concepts may undermine your ability to keep control of that model.
Layer upon inscrutable layer of data typing may be ok for machinery that works without interpretation, but when interpretation matters it’s the human interpretation that we care about. That’s one reason why there are controversies around machine learning applications such as facial recognition, etc. When a machine decision doesn’t match the human decision, we question the reliability of the technology. Actually, the reliability might be ok, but the design might be flawed because of too much of the wrong abstraction. These are the subtleties of automation. We’ll need to come back to this issue. It’s all about semantics.
Collections in Arango can’t be nested, as far as I can see — though they can be sharded. The use of collections as “subdirectories” is limited to a single layer of containers — but then one can use edges to link different collections as long as a program discipline is maintained. The lack of nesting isn’t necessarily a bad thing, as excessive recursion is very hard for humans to understand. Even programmers overdo recursion and make program code harder to understand for themselves, if not for the computer’s somewhat deeper stack.
Graphs as indexing structures
So much for key value pairs and their infinite generalizations. What about graphical relationships? Graphs add a level of generalization to the kinds of structure one can build with nested Russian dolls. In a graph, the connectivity expresses both containment and relational semantics, through four basic types of link that I’ll describe in the next post.
Directory containers can be represented as graph hubs. A hub is like a network router that interconnects Local Area Networks, without the need for prefix routing numbers. Hubs have the role of binding together a number of members through a single representative, which acts as an exchange. We sometimes call the sum of a hub and their members a supernode.
As you work with graphs, you’ll want to use the idea of spanning trees to parse data, and so you’ll end up using the same kinds of paradigms for addressing and parsing graph structures as you would use for file searches: i.e.
- Go to a hub
- List the nearest neighbours of the hub
- If any of the neighbours are themselves hubs, descend recursively
As far as semantic types are concerned, graph databases make it easy to label links by type and therefore code clean distinctions. Vannevar Bush (one of the doctoral advisors to Claude Shannon) is often credited with inventing the concept of hypertext in his forward-looking essay As We May Think. It discussed how we may organize information in a way that would be intuitive to humans. It spawned later technologies for indexing by librarians, known as Topic Maps (see figure 6), which were one of the first semantic graph technologies invented as a way of browsing webs of embedded concepts — not just as archival records (documents) but as semantically significant “data type instances”. Today, Topic Maps are all but superseded by the simpler technology of web and hypertext. Attempts to bring back some of the type-oriented semantics through the Semantic Web project (and its data format RDF) have been basically unsuccessful, as I’ll explain later.
Graphs as learning value stores
A second way that graphs can be used is as “yet another kind of key-value store”, with all the attendant learning possibilities. Both the nodes and the links can now act as keys, and the values can be attached to these as attributes of those structures. In ArangoDB, this is very simple: both nodes and links (vertices and edges, as they’re officially called) are simply documents. This means we can add as many details, quantitative or qualitative, as we like. We typically want to explain the interpretation of the node or link, and perhaps its relative importance according to the processes we are modelling.
Looking around at common usage for graphs, including some of the major players, the data designers miss great opportunities by failing to use the degrees of freedom available to label edges. Typically, one finds all the information in nodes, and only a “yes” or “no” for a single type of link (typically node1 and node2 happened together or were co-activated). But if we don’t use the learning capacity of links too, then we lose the relative frequencies or importance of the nodes updated over time. From a semantic spacetime perspective, capturing the processes involves traces of both space and time — best not to throw away valuable signal.
What does this mean practically? A node in an Arango graph has a mandatory key in its JSON representation, called “_key:”. A link has mandatory fields “_from:” and “_to:”, as well as “_key:”. Normally the key will be a combination of to and from links, as we want to make similar connections unique. This means we can address entities (nodes) and relationships (links) and document their attributes using direct key lookup. There’s no need to search around in a graphical structure as long as we name items well. For example, we can use a graph to learn the strength of a relationship, by updating a “weight:” or “strength:” value as a number each time two nodes talk to one another by some process in the real world. I’ll be returning to this in future posts. This is a powerful generalization of the histogram concept. Now every strand in the web of relationships encodes type and value information for more detailed and sophisticated modelling.
Note: In computer programming, it’s normal to search for data by looking at the objects first, using loops to search first one way and then another. But in a graph representation, searching the links between nodes is a better strategy, as this automatically includes their endpoints, and there are no combinatorics to think about. Searching links first can save a lot of time, making searches O(N) instead of O(N-squared).
Final note: scratchpads or archive?
In programming we’re used to the idea that a data structure (i.e. a representation) is the enabler of an algorithm. On the other hand, we tend to think of a database as a more permanent storage option. This is a false dichotomy. When you visit an airport or a shopping mall, there is short term parking and long term parking, with different trade-offs about proximity in space and journey time. The trade-offs are about how parking interacts with other processes that need to share the space and the time. In other words, they are about the local semantics of space and time.
The role of any storage can be classified by timescale (the characteristic speed of change versus the speed of reading and writing). In the modern age, there is no clear separation of speed between what happens in RAM and what happens via a third-party software service. The value of a database is not necessarily only as a permanent storage model, but rather as an ad hoc temporary overlay that assists in a specific computation. Just think about the processor caching hierarchy in IT. Everything we do in organizing resources is about building a hierarchy of storage for efficient retrieval from space (near or far) over a timescale which is decided by the purpose of retrieving the data.
In the next installment, we’ll begin the process of understanding semantics by looking into simple key value stores, which are the universal scratchpad technology for IT. I’ll show how they relate to associative arrays, or maps as they are now called in Go, and how they generalize to the most complex graphs.