Universal Data Analytics as Semantic Spacetime

Mark Burgess
12 min readAug 26, 2021

Part 4. Keys, patterns, maps, and learning — the beginnings of representational semantics in data

Imagine what happens when you walk into a smart building, wielding your biometrics and wearing digital electronic devices — recording your environment with embedded sensors, smart glasses or pads. A torrent of information interactions and registrations is unleashed, leading to a virtual “wiring up” of devices with services for identification and authentication. It’s a lot of information, which could be represented by flowcharts, websites, data circuitry, supply chains, user logs, cartography, etc — in all cases, there are complex interactions between key-value look-ups, document archives, scratch-pads used by algorithms, and graphical maps of events play a basic role in both what we experience and create. In this post, I’ll explain (hands-on) the mechanics of representing spacetime processes as associative memory, both in Go code and in the ArangoDB database.

Space and time underlie all data

Pattern is the foundation of all information, and patterns need space. A pattern can be a simple bit (on or off), a complex string, one or more documents with structured fields, or a complex web of nodes expressing intricate relationships — and all changing in time. Without some kind of notion of space, there couldn’t be information. Space is what we use for memory. It’s the fabric of state.

We probably tend to think of space as the kind of open space we learn in school (Euclidean space), but there are other kinds too. If we ignore reserved mathematical terminology, then space is just a collection of distinct locations, without any assumed structure: a single bit (a light bulb), a row of buckets, a spider’s web, or an infinite chess board, are all versions of space. They contain degrees of freedom that can hold and express information.

Time, on the other hand, is the phenomenon of state changing, expressed at these locations. Changes to that state are known as a process’s proper time. So space and time are not independent things. Together, they describe processes (or spacetime phenomena) — and we strive to observe, understand, and even intentionally manipulate these processes for a purpose. This purpose is what we refer to by semantics.

The options for representing space in IT begin with the smallest atoms: bits, from there words, and we are already into the realm of key-value stores. From associating names with value, we can form arrays of them, more structured packages like “documents” or “tables” (vectors and tuples), and finally graphs or networks with completely ad hoc structures. I’ll work through these in the languages of the chosen tools, starting with the basic atomic building blocks.

Key-Value pairs

Space is memory, and memory means “variables”. Key-value associations were called associative arrays in the heyday of Perl. Now they are more pretentiously called “maps” in Go, and in programming libraries. Famous cloud-embedded databases like Redis, Zookeeper, Consul, and etcd, specialize in key-values, but in fact every kind of memory is some form of key value store.

A key is a “data address” in the broadest sense (just a variable name), and a value is something we want to remember (variable contents). The keys and the values might have multiple lines, representing several “dimensions”. Local key stores like The Windows Registry, BerkeleyDB, SQLite, TokyoCabinet, and others, pioneered the use of structured data to store distributed configurations, logs, and data records locally on every host. ArangoDB also has it as a simple option too.

A key-value pair is the simplest kind of table or ledger too (see figure 1): a two column list, it has labels (names or numbers) as keys, and values (any kinds of data) as values corresponding to each key. It has no other structure. It could be dates and times, names and phone numbers, or whatever basic association we like. Key value lists are extremely useful for counting or jotting down values. If we treat them not as static lookup tables, but rather as dynamical living values, then we can also use them to update moving averages, current positions, scores, balances, and so on. A key-value list like figure 1 is basically a histogram.

Figure 1: Key-Value pairs are natural histograms. As we update the values, we learn more about the profile generated by the keys. It’s all about the interpretation.

We might think of a key-value database as an itemized list. It can be ordered somehow, e.g. by numbering the keys, like in a hotel, with rooms labelled by room number and floor coordinates (room,floor). Yes, a hotel is a key-value store for humans. However, in other examples, keys have no particular order (they could be just names), so to find the location of the data, we need to search through the data. If the hotel itself is a key-value store for humans by room number, the check-in ledger is a key-value store for rooms by name.

Key: Mr BurgessValue: Room 123, Suite with sea view, jacuzzi, and butler.

Searching linearly through a list is inefficient when the list is long, so one uses a hash table, in which a non-integer key is turned into the integer number of some reserved data slot — something like booking a hotel room and being assigned a room number. Hashes are usually integer numbers (say replacing “Mr Burgess” with 1836537), and expensive hashes are often used in cryptography (so they may be called cryptohashes). They aim to assign a unique number to any key string of data to help assign a non-ordered name to an ordered memory structure. There may be occasional accidental collisions between hashes (as when two people assigned the same room in a hotel) so we have to be careful and handle those cases in large tables.

Key-Value Maps in Go

A map in Go is a tool for using key-value stores. It’s what we used to call associative arrays in Perl. It’s what many technologists use Redis for. All databases are some variation on key-value stores.

Golang maps are associative arrays. One of the quirks of Go is that, somewhat inconsistently, we have to use the “make(data type)” memory allocation function in Go to set up structures that contain multiple key-value pairs. It’s one of those many ad hoc things that you trip over when using different languages, but it’s quickly learned, so not a show stopper. The following is how you could define an associative array to map integers to strings, and from strings to strings.

var mytable = make(map[int]string)mytable[4] = “coffee table”var association = make(map[string]string)association[“number”] = “an idea”fmt.Println(“test”, association[“number”])

Maps work as databases work for storing data, so they make a idea pairing.

See full example code.

As scientific users, who are not dedicated developers, we don’t really want the hassle of fussy programming APIs, especially for a database, as these are largely concerned with the details of exception handling. Robustness can be secondary to simplicity in a friendly environment. We’d like to use some simple layer of abstractions that we can trust to take care of these details in normal circumstances. So, let’s begin by building a layer on top of the flexible but fragile Go APIs for mapping and for ArangoDB to create robust and repeatable read/write functions.

Key-value “document” structures in Go

Before maps or associative arrays were common in programming languages, we had to use alternative paired lists. To declare a key value pair as a structured list, in Go, associating a name with an integer number, we could use the following pattern. First define an associated pair as a data type under a common name IntKeyValue, with members K and V for key and value respectively. We could then make an array of these, which is an ordered collection of them, rather than the ad hoc map described above.

// declare a data type IntKeyValuetype IntKeyValue struct {
K string
V int
}
// declare a variable ARRAY of this typevar kv []IntKeyValue// static initialization, with type templateskv = []IntKeyValue{
IntKeyValue{
K: “NEAR”,
V: 1,
},
IntKeyValue{
K: “FOLLOWS”,
V: 2,
},
IntKeyValue{
K: “CONTAINS”,
V: 3,
},
IntKeyValue{
K: “EXPRESSES”,
V: 4,
},
}

The full code to this example is at kv.go

This is slightly different from using a map, and we’ll have use for both representations.

Key-values in ArangoDB

The straightforward relationship between key-value databases and maps in Go (and other languages) means we use databases for caching maps and sharing them between different applications. It’s easy to save and load entire key-value maps like notebooks, so we can exchange memory simply between code and database. Databases are (after all) a shared memory construct, by design. Things we’d like to be able to do:

  • Assign pairs as a map
  • Store them in a database
  • Load a database into a map
  • Update the map directly without loading it
  • etc

By analogy to maps, we would like to be able to work directly with the database to simply say (without getting an error if we do it twice):

AddKeyValue(name,value)

Alas, it’s not yet quite that simple for a database service, because we have to relate the implementation of data types in program code to the equivalent (but independently implemented) data types in the database. Because of the general model of containment, described in the previous post, there’s some bureaucracy involved in reaching this level of simplicity. Let’s show how to implement that simple layer of usability to avoid future headaches.

These code examples will all end up in a library SST.go that you can modify and use as a template for other cases.

To store a key, we have to open a database, and open the container that the keys and values will be kept in (see previous post). To open a database, which is run as a service rather than a local library capability, we have to establish a connection to the service on a network port. This all feels like a big headache to the casual user, so we want these details buried as quickly as possible. Just opening a database can be a headache. The first time, we have to create it, which is represented differently in the API than opening it the second time, and thereafter. There are valid reasons for this, from a programming perspective, but these aren’t really the concerns of a data scientist. We want to be able to say:

  • create_or_open_database(name) once, then
  • read/write to that database as we please

The first step is to connect and assign keys and values to a named container. Let’s make all that go away.

Service protocol: sessions and transactions

A database runs as a service, something like an ongoing dialogue. Think of a phone conversation. You have two options for sending instructions:

  • A text message in a bottle (single transaction) — data drop, asynchronous and simple.
  • Establish and keep a conversation going — a synchronous handshake, with on-going obligations on both sides.

Text messages seem nice because you’re not aware of an open line, i.e. the ongoing connection, to the other party. It feels as though you can write and read whenever you like asynchronously and without obligation. In IT, a text message is called an “unreliable” or an asynchronous message. An on-going conversation is called connection oriented, “reliable” (with handshaking) or synchronous message, because your handshake is in constant touch with the other party. It feels like more work and you have to remember to close it to avoid paying unnecessarily.

The transactional nature of text messages is illusory. In order to be able to send these one-off transactions, your phone maintains the on-going connection to the phone company network for you, via cell towers. If you lose the signal, you can’t read or write. This is what we want for the database service too. We don’t want to be opening and closing connections to the service all the time, as it’s very time consuming to set up the handshake (like turning your phone on and off).

The steps we need are, by analogy:

  1. Open a client connection or “handle” to a service, and keep it open. This involves a username and a password.
  2. Open a specific database by name (it can be the same one for everything if you like)
  3. Decide a “directory” or collection name for your data, give that a name, and make it “current”, like changing to that directory
  4. Finally begin to read and write.

We can boil all this down to a straightforward standard approach, which can be used in every program in future:

var dbname string = “SemanticSpacetime”
var service_url string = “http://localhost:8529"
var user string = “root”
var pwd string = “mark”
db := OpenDatabase(dbname, service_url, user, pwd)

The database connection “db” is basically a global variable, pointed to by a URL and with a username and password that we set up in part 2. You’ll need to pass this database handle “db” to all subsequent functions.

The code for this function is shown in full in kv.go.

Freely exchanging data from code to database

From here, we want to be able add keys and values to an ArangoDB directly from the Go structure we defined above. We want to be able to write something like this:

// Add a single valueAddIntKV(“ST_Types”, db, IntKeyValue{“somekey”,99,})// Store/Add entire map to DBSaveIntKVMap(“ST_Types”,db,kv)// Retrieve map from DBPrintIntKV(db,”ST_Types”)// Import constant lookup table from DBvar mymap = make(map[string]int)LoadIntKV2Map(db,”ST_Types”, mymap)// Increment the value for a key, e.g. counting a histogramIncrementIntKV(db,”ST_Types_Map”,”EXPRESSES”)

The full code for a program to implement this is in Part4/kv.go. This is not the last simplification we can make, but it shows the approach that we can use as the basis for a library.

Example: time-series patterns

As a final note, let’s think about the way we choose the names of keys we use in a key value store. This is where we need to be aware of space and time, because that’s our model for finding stuff. The naming of objects is basically where their semantics are labelled. So we should choose keys carefully.

Time series are a kind of key value store which are the basis for logging and continuous measurements from sensors. The keys are some kind of timestamp. We have various options for counting time:

  • A simple sequence number.
  • The system clock time.
  • The NTP network time.
  • Any pattern for an incremental process.

Of course, there are clocks. Wall clocks wrap around after 12 or 24 hours and start again. That’s actually useful when we want to describe the template of a day, because days are a repeated phenomenon. The problem with Unix clock time is that it represents an infinite number of possible keys. If one isn’t careful, data collected and labelled with an infinity of keys may grow and grow without limit. But, in science, we build on repetition in order to confirm evidence, so we often want keys to represent classes of event, not separate unique events.

Take the example of the human working week. We can extend the clock idea here too and with a finite number of times. Monday-Sunday, 0–23 hours, etc. Every week, we plan and schedule events by calendar (day of the week, time (hours, minute)-slots). We don’t typically resolve time down to the second, because the processes we undertake last minutes to hours. So we can make a simple repeating time-key with just the right resolution, no more and no less:

Mon:Hr09:Min40_45

The semantics of this key are those of a repeating 5-minute time slot throughout the working week. The result is a sharply defined periodogram, like weather recording data, or seismograph, recorded on a cylindrical drum. As we update data values for this key (say counting events that happen during the key) data will eventually accumulate over many weeks, and we learn the weekly patterns.

Figure 2: Server traffic is a weekly periodic phenomenon. The times are “fixed points” or idempotent times. Every new time maps to one key on the time axis — into the circle that represents the working week. Imagine this graph glued to a cylinder, so that the end and the beginning join. By accumulating values for these keys, we see a strong learning pattern that represents human activity during our working week. The peaks represent Monday, Tuesday, etc, ending with Sunday. Using this mapping we immediately see the significance of the data without expensive autocorrelation analyses.

This simple kind of machine learning was used as the basis for CFEngine’s machine learning of computer behaviours in datacentres since the 1990s. It gives us an expectation value based on semantically labelled key-value samples. The complete code to generate this type of key is shown in Part4/here_and_now.go. Spacetime location is the natural key for unlocking situational data.

Key naming and semantics

Naming things well is one of the hardest problems in knowledge representation — and we’ll return to this topic again. Names and semantics go hand in hand.

Remarkably, most computer system designers don’t seem to think of data key divergence as a problem. However, it is a problem for two reasons:

  1. It’s obviously unsustainable to keep an infinite amount of data even if every item has its own unique key.
  2. Once data are unlimited in number, their interpretation is no longer clear and keys become useless.

Software engineers often find it convenient to imagine infinite resources, but power users like scientists want to manage the amount as well as the result. An infinite number of names isn’t helpful. Numbers, by themselves, don’t carry meaning. We need to know what the numbers mean. The tools for naming and numbering are all straightforward in our tool set.

Summary

Key-value pairs are the simplest kind of spatial point object that can carry semantics: a named location with memory. What we’ve seen in this post, through a few different lenses, is the importance of capturing the semantics that model a process. For understanding patterns, which are repeated intentionally or at least deterministically, the naming and differentiation of phenomena according to their process of origin is how we represent these patterns as semantic knowledge.

In the next installment, I want to dive deeper into this hidden encoding of meaning by looking at extending key-value pattern to more complicated spacetime geometries: directed relationships or graphs and their causality.

--

--

Mark Burgess

@markburgess_osl on Twitter and Instagram. Science, research, technology advisor and author - see Http://markburgess.org and Https://chitek-i.org