Part 1: New Horizons for Sensory Learning
Every volume of space, each snapshot in time, embodies a microcosm of information bursting with meaning. What’s it all for? In our modern world of smart technology, information can be offered to us in the form of services that are engineered for human consumption. In the world of (say) biology, by contrast, the rich information forms an ad hoc catalogue of diseases, cures, foodstuffs, and more, shadowing all kinds of diverse possibilities that we observe and measure ad hoc. Joined together, such snapshots recorded in space and time form a larger map of the world around us — a map of spacetime with an interpreted meaning known as semantics.
Neck deep in observations and interactions, we humans wade through a river of information as we go about our business. Some information is intentional — and can be harnessed to govern, even enhance our lives. Sometimes it appears to be just unintentional background noise that we mine for secrets about the universe — encoded rules and principles about phenomena that we strive to decrypt. The difference? Interpretation.
We model information in order to find a way to use it. But modelling isn’t easy. It has to find hidden simplicity in what might first appear ad-hoc.
In this post (the first of a series on the subject of data analysis and the Semantic Spacetime model), I’ll try to explain a little about modelling using some practical tools. The environment may be overflowing with data in the form of numbers, images, and text — but it also expresses complex, context-dependent semantics, yet we don’t always pay as close attention to this meaning as to raw numbers or descriptions, perhaps because the need to interpret implies that world isn’t straightforwardly objective, which has traditionally been something of a taboo notion.
During the last decade, technology has begun to improve in its support of capturing and modelling the many meanings of data, as well as the raw numbers and strings of observations themselves. We’ve gone from “focused data” to “big data” to “machine learning” and “AI”. All this implies different kinds of reasoning. New kinds of data models have been developed, along with new kinds of databases, with the potential for making realtime learning of richly annotated data a reality. We can make use of these.
A lot has been written on a theoretical level, but I’ll do my best to illustrate some of these developments, from a practical and pedestrian viewpoint: to show how we can approach data analytics, including semantics, in a simple way: one that’s based on universal principles. It’s a way to cut through all the complexity in the fields and adopt a surprisingly simple picture of the world that can save a lot of time and effort. I’ll use a practical version of what I’ve already discussed in my book Smart Spacetime (SST), as well as in a number of papers belonging to the Semantic Spacetime Project.
Smart homes, cities, ecosystems — all patterns in spacetime!
In bygone days, it was only scientists who collected or cared about data. The intrepid scientist would perform highly constrained and isolated experiments, controlling the conditions for repeatability. He or she would then collect numbers into columns, by quill into a trusty lab notebook, or today by smart device into data stores, compute averages and error functions, then finally present a Gaussian average picture of the phenomenon in a paper for publication. It’s an approach that has carried us for centuries, but which is also less important today. The ultimate goal of such an isolated experiment was to look for patterns, as Alfred North Whitehead put it
“to see the general in the particular and the eternal in the transitory”
Today, fewer problems are about such isolated experiments, or even about discovering natural phenomena. We’re using data more widely and more casually — building intentional systems that use data for feedback and control in realtime, as well as for documenting histories and processes that we already understand, but perhaps can’t fully comprehend without assistance. We look for diagnostics, causal patterns, and forensic traces amongst implied relationships. So, observations are no longer just to look for scientific repeatability in experiments (i.e. statistically invariant patterns). They are also about human-cybernetic system control. We might be searching a single customer record in a purchase history, a path through a log of supply chain transactions or financial dealings, for isolated anomalies, or simply to calm an inflamed relationship. Anomalous phenomena, customer relations, supply chain delivery, forensic analysis, biological mutation, social justice — these are all aspects of one giant bio-information network that governs the world around us.
Today, data are (or “is” if you prefer — take your grammatical pick) more often collected directly and automatically, by soft or hard sensors, even by filling in online forms — as part of some semi-controlled environment. They are then fed into files or some kind of structured storage, with various degrees of sophistication. The “measurements” yield representations that are numerical, symbolic, image-based, matrix valued, and so on. And each cluster, document, or “tuple” of data has its own significance, which may be lost unless it can be captured and explained as part of a model that captures that importance. This is what we refer to as data semantics.
Databases — actually database systems — are the tools we use for managing data — not just storing but also for filtering and interpreting. There’s a few basic model types to choose from, and they influence our usage significantly. Conceptually, databases aren’t as different as their designers and vendors like to claim (for market differentiation), but the devilry is in the implementation, and in the usage. Some common cases:
- Key-value data: are the simplest scratch-pad representations of tables or ledgers. All databases are basically key-value stores of some kind with more or less complicated “values”. The “rows” are labelled by keys which are names of some kind, and can be generalized to URIs to add more structure. Even simple disk storage is a key-value store, with a structured index on top. In programming, key-value stores are associative arrays or maps. Key-value stores can store tabular data, like time-series and event logs, when the keys are timestamps, or ledgers in which keys are entities and values are financial transactions. They’re used as configuration data, or for keeping “settings”, as in the Windows Registry or etcd. KV stores are very convenient to use in machine-learning of scalar, vector, and matrix data, like the monitoring of signal traces, and accumulated counting of histograms, measurement by measurement (a name/number key value pair is exactly a histogram). Keys represent a symbolic interface to the world, and to avoid explosive growth, they should be used with convergent learning models.
- Relational, column, or tabular data (SQL): this is the kind of database that most are now taught in college. It’s the simplest generalization of a key-value store, based on “tables”, which are basically templates for data types or schemas. Each element in a table is a named “column” (as in a spreadsheet) and each instance is a row, as in a ledger. The model is based on a spreadsheet model. The use of rigid schemas is associated with the three normal forms, and the transactional properties of updates during reading and writing are a huge issue amongst different implementations. The same is true for all kinds of database.
- Document representations: like relational or tabular data, these are another generalization of key-value stores, often called noSQL databases, as they were the first notable deviation from that model. The values in a key-value store benefit from stronger and more detailed semantic elaboration, like putting a folder or document in an archive, rather than just a number on a ledger. Because that’s harder to find, we also make key-value indexes to organize documents by different criteria. For instance, in collecting data for personnel files or digital twins, it’s helpful to encapsulate related aspects of a device under a single reference.
- Graph representations: are a relatively new trend, used for expressing structural relationships — data of a different kind, especially those that are non-linear — not just serial. For example, an “org-chart” would be represented as a graph. A map of services could be a network. Functional relationships between components and processes like circuit diagrams, flowcharts and other processes, can all be described by graphs. Ad hoc networks, like mobile communications during emergency operations form real-time changing graph connections to nearby mobile devices. The locations of nodes within a Euclidean or spherical coordinate embedding can be used to represent geo-spatial data. The hub-spoke pattern (see figure 2) is a common configuration that links all the data models in this list: a hub as as a primary key for referring to its satellite nodes as data (like network address prefixes), a hub binds independent items into a single entity (a document), and it acts as a cluster focal point for anchoring related items in space or time as a graph.
There are always choices to be made: should we use a time-series of documents or a document of time-series? Data modelling is an art, and it’s important to get it right, because it forms the foundation for algorithmic efficiency. In the Semantic Spacetime (SST) model, we try to make use of all of these data formats at the right moment.
In cognitive systems (sensory driven data), the keys that coordinatize data are combinations of the values which sensors record. These combinations express “contexts”. In archival systems, the keys might be names or addresses, serial numbers, or IP locations. These are all just the “names” of “things” in a general labelled spacetime. The way we name things is of critical importance in modelling, because (although we might have endless stories to tell from amassed data storage) — unless we know how to begin a story, how to begin searching for an answer, all that content will remain lost.
These are some of the challenges of data analysis in the 21st century. The breadth and generality of this story is vast in its implications, but these are the challenges. I hope this hints at the issues that will unfold gradually throughout this series.
Why Semantic Spacetime?
As our series unfolds, I’ll try to show that one way to prevent an explosion of complexity in data context and semantics is to go back to first principles and find out what data actually mean, on a deeper level. This is the goal of the Semantic Spacetime model, which is based on Promise Theory. It’s a way to mitigate the potential explosion of complexity, and therefore the need for brute force data processing on a massive scale. The underlying hypothesis is that spacetime relationships underpin everything there is, and while there are different ways to represent data relationships (as key-value pairs, documents, and graphs) the bottom line is that what we are describing is process, i.e. data in motion, which is what spacetime is.
There’s a simple argument for why all of our advanced human concepts must ultimately be modelled on increasingly obfuscated ideas about space and time (for details, see the book Smart Spacetime). Basically it goes like this. Events in spacetime make up everything there is in our environment: something happens here or there, left or right, up or down, it triggers this sensor or that sensor, before or after, etc. It can be argued that biological senses could only have evolved to distinguish these differences, because that’s all there was in the beginning, and so all of our ideas are therefore just different layers of abstraction for those spacetime relationships. No one knew about cars or politics in the primordial soup. The research behind this can be found on my project page.
There’s a lot to cover in this series, but it will be worth taking the time to develop a few of these ideas, because they unlock universal techniques based on the simplest of considerations.
This series is aimed at anyone who is interested, but probably more for scientists than developers, i.e. those who didn’t grow up suckling from Kafka, reading children’s stories in Python, or had parents who would take them for rides on Kubernetes. It’s for researchers and technologists who want to focus on method, on a manageable scale, rather than on taming an ever growing pile of technology and churning through as much data as possible. I won’t be talking in detail about TensorFlow or Pregel, or indeed any Google scale tools, because that’s the long tail of the distribution — there’s still a lot we can do in data processing without going there.
To join my audience you just need a laptop, a database, and a straightforward programming language, all of which I’m going to choose for you! In the next part, I’ll look at the practical issues of setting up the tools. You can find the code to the examples in the series here.