Semantic Spacetime and Data Analytics

Appendix 1: Exploring the digital barrier reef with CAIDA and ArangoDB

We probably think we know the Internet, but how many can really say they understand how it works? What do you think of when you hear the word? Your favourite websites? Social media? Perhaps the WiFi router supplied by your service provider. Understanding all its moving parts is not for the faint-hearted, yet a few researchers do this for a living (if you call that living!). Like all research, there are both ambiguities and controversies involved, so we're always looking for a new approach. A Semantic Spacetime approach presents new opportunities to unravel those issues, enabled by the new hybrid database technologies for document-graph representation. In this Appendix to the main series, I want to showcase an application of SST to Internet data analytics.

The Internet isn’t so much a web as it is a “living network” — a busy ecosystem of ongoing processes, all of which are connected together to form a large cooperative graph. The graph might appear straightforward, as any snapshot of something in time seems to be, but we shouldn’t mistake a snapshot for the real thing.

Figure 1: A local sample of Internet structure from CAIDA data.

Understanding the Internet takes a lot more than collecting and mapping out the names of websites, domains, IP addresses, or even seeing the physical connections between them. There’s a physical layer, to be sure, but there are also multiple virtual layers in which the actual links between one user and another fluctuate and multiply wildly in real time, much faster than a human could comprehend. There is virtual motion, and also semantic circuitry to expose. The resulting ambiguity is a bit like quantum mechanics — the best picture we can describe is a probabilistic one, an equilibration of underlying flows, which is summarized by an average "field", which has to be probed by each every observer for themselves to form a current picture. There are slowly moving trends too: addresses change hands, companies come and go, names change, technology is replaced. The Internet is a patchwork of overlapping stories occurring across a multitude of timescales.

CAIDA in Semantic Spacetime

The Center for Applied Internet Data Analysis (CAIDA for short) is a small research wing, located in a pleasant forest grove, deep within the La Jolla campus of the University of California San Diego (UCSD). I’ve visited it only once in person, some 20 years ago, and it felt like a holiday resort with sea views and forest walks! Yet, from this idyllic vantage point, for the past 25 years CAIDA has been researching, mapping, and visualizing the Internet, in spite of the odds being stacked against them. You may have seen beautiful orchid-like maps as well as deep insights into the murky processes that drive it. These are some of the women and men, led by Kimberly (kc) Claffy, who try to answer the big questions. Getting a grip on data about the Internet is fraught with ambiguity and difficulty. Today the Internet is controlled largely by corporate America’s tech giants, who are not too forthcoming when it comes to independent research. So much for the rumours of no central control. Ingenuity is needed to get around the obstacles.

Figure 2: All socio-technical networks are multi-scale graphs of complex overlapping processes.

On one level, the Internet is a graph — indeed, it’s several, depending on how you look at it — but how should one capture its nuances and model it? That’s a long term project with enormous scope, so we have to begin with the elementary questions — to capture its main structural and semantic relationships.

Mapping out the Internet is a bit like taking satellite imagery of the Earth: we can see some things in plain sight, but we can’t see the insides of buildings or underground tunnels. Similarly, the Internet hides some information, both for security and privacy reasons, as well as for reasons of scale. Some information has to be based on inference. Probing the data requires sophisticated machine learning — not the headline grabbing Deep Learning kind, but rather the more widespread learning about changing signals, by automated sampling and reasoning, used to run the biggest investigative and predictive endeavours of our time, from particle physics to supply chain management.

CAIDA publishes a portion of their results as monthly public sets that capture a snapshot tomography of the inferred structure based on complex and heuristically-guided algorithms. This is the ITDK project (for Internet Topology Data Kit). It’s a small part of the whole picture, but one we can use as an example of semantic spacetime modelling, because it offers a recognizable picture of the Internet — something roughly analogous to the wavefunction for the Internet.

Semantic Spacetime Again

One answer to organizing network information, qualitatively and quantitatively, is to use Semantic Spacetime — as highlighted in the main body of this series. Semantic Spacetime models processes as graphs (something like scaffolding built from Feynman diagrams, but generalized to capture more aspects of a process). Semantic Spacetime retains special labels that distinguish the meaning of locations and their relationships, with respect to different flows and processes, mapping between different causal influences.

Figure 3: Probing the local structure of the Internet using traceroute (see part 7 of this series).

In Part 7 of the main series about Semantic Spacetime and Data Analytics, I showed how network probing, using traceroute, could reveal the structure of a network, thread by thread and step by step (see figure 4). It’s a laborious undertaking that CAIDA has specialized in — sewing together endless probes into a kind of snapshot, accumulated from an ensemble of millions of independent processes. ITDK is summarized methodically as a published history of effective snapshots for the entire Internet.

Figure 4: Building up a traceroute as radial slices taken over multi-slit maps forms a superposition of possible boundary conditioned paths.

Where are space and time in this picture? The Internet’s nodes and organizations are different flavours of space: the succession of summaries describes one kind of (adiabatic equilibrium) time, while the individual probes map out transitions that form the proper times of millions of independent observer processes. As usual, it's spacetime all the way down.

The result is effectively a giant non-equilibrium field of fluctuating causal influence–quite similar to a quantum mechanical wavefunction in a number of ways (see the discussion in part 7). The main difference is that the individual causal pathways remain in pinpoint graphical form, rather than being smeared out as a functional Euclidean embedding, or a field. But, while embedding functions are a popular approach for approximating inferences in modern machine learning, the Internet is too complex an ecosystem to be represented as a purely quantitative field. It’s richness of semantics is what holds the keys to its understanding on multiple levels.

Isn’t it just a graph database?

As a database, a richly semantic graph is unlike a relational type model, which computer science has championed for the past three decades. Relational models rely on indexing to find items by random access lookup, using key semantics alone to label or find features. But a graph is not a random access structure: a directed graph is the embodiment of processes or causal flows. It has both qualitative and quantitative meaning at each point–something like the solid state physics of protein chains or granular alloys. The major classes of flow are framed as directed link relationships, and they reveal a deeper underlying sense beneath the highly specific details of connections and role semantics. Only the most recent hybrid graph databases can represent these structures efficiently in all their detail. Here, I used ArangoDB for its simplicity.

The major edge or link types of a semantic spacetime fall into four basic classes, which — on a deep level — boil down to differences in location and time in physical and virtual spaces. Summarizing snapshots over all space is straightforward, but the deeper causal time structure, on a microscopic level, remains a big challenge–because its independent dimensionality is huge.

The Semantic Spacetime idea was developed based on underlying Promise Theory — as a description of graphs and their functions; it addresses the key causal distinctions: granular containment of patterns within larger patterns, sub-regions within regions, the ordering of events by sequence, the capturing and expressing of local properties, and notions of similarity — and these can all be stored easily within partitioned “collections’’ for efficient searching.

There are plenty of advantages to putting data into a database. With any large amount of data, the challenge is how to organize it and access it to answer questions. But until recently, databases have been optimized for warehousing or spreadsheet archiving — they weren’t suited to asking the kind of questions scientists need. That’s changing today, as machine learning and data science take hold in business too. New databases, like the ArangoDB used in this series, enable efficient semantic structuring of data that goes beyond the connected spreadsheet model — to something more like a connected “brain”.

In one sense, then, the Internet could be trivialized — it’s just the largest man-made graph database we never really noticed. But that’s too simplistic. Unlike most infrastructure networks, e.g. the power grid or the sewage network, the Internet is highly inhomogeneous. It has enormous variety, expressing rich semantics, from sophisticated services to the functional roles of all the moving parts–more like bioinformatics than the electrical power grid. Nodes (graph vertices) and links (graph edges) are not all the same, because they are home to very different identities. The structures are every bit as complex and as targeted as bioinformatic networks (effectively artificial life), yet they change more quickly. The Internet is functional, not just connected. It’s both a Yellow Pages (by function) and a White Pages (by name) directory of places and services. Each node has very different behaviours, i.e. different characteristics or different semantics, as well as different roles within processes across a variety of scales. This is where a hybrid database like ArangoDB enters the picture.

Hybrid graphs combine local and global indices

Getting data is half the battle. Analyzing it craves a structural understanding that mirrors both its dynamics and its semantics. Little of what we hear today about graph analytics emphasizes this dynamical or “living” nature of the nodes and links. How do we capture that aspect in a database?

Databases have traditionally been thought of as lifeless archives: “data lakes” or “data warehouses”, not living digital ecosystems with exploratory, active tentacles. Finding something in a classical database is like searching for a tomb in a graveyard — you have to read all the labels on all the tombstones until yoou find the one you're looking for. If you’re lucky there’s a smaller index map telling you where to go (figure 5). But this is where graphs come into the picture–every node in a graph has an index pointing to its close neighbours, and can optimize for searches along paths in a way that global indices can’t.

Graphs are the embodiment of processes — more like a treasure hunt than a brute force directory search. Graphs arise from processes, they represent processes (flows and contractual bindings), and we interact with them through later processes of parsing and traversal of the data. Traditional representations of graphs were rooted in ideas about logic, like the flat graphs of the semantic web (RDF, OWL, etc). These have proven to be rigid and inflexible for complex applications. The complexity of “living things” isn’t very logical when it comes down to it, and being too rigid in ones ideals for modelling works against you.

Figure 5: In a traditional data warehouse a tombstone or legend marks each location, but we need an index to find each body. A graph is like a treasure hunt, each node pointing with clues to the next.

Understanding the Internet in a dynamical way is a more subtle challenge that spans several virtual layers: locations have addresses, different paths and organizations have borders, subnet boundaries and so forth, but there are also domain names and BGP AS numbers, all on different scales. The boundaries of the cellular structure are complex (see figure 7). Inventory data may form data lakes and warehouses, but graphs are ecosystems not inventory. All this has to deal with partial information and uncertainty — you know, the stuff of science.

Internet tomography with ArangoDB

Creating a data model inside a database (no matter which of the SQL or noSQL options you choose) isn’t necessarily hard, but creating a model of an Internet-scale graph — one that can be searched and updated effectively to answer relevant questions — is a challenge that can be computationally expensive. Today we have a new generation of databases that can address these modelling challenges, as long as we get the modelling right. This is where the choice of ArangoDB for modelling Semantic Spacetime comes into play.

What’s wrong with relational SQL? That’s probably the wrong question to ask. The trap that many fall into when modelling multi-key relationships is to think in terms of a type model (a schema or object-class model) in which things (labelled by some primary key) are the central focus. It’s a global-entity model view, which SQL was designed for. In a graph, the preferred approach is to look at the local processes in a model. There is context in every position. The connections between every node in a graph behave like a local index or service directory, with labelled edges offering both yellow pages and white pages directories to relevant leads. In entity modelling, entity attributes are associated with a primary key, whose significance is global. With a document graph, both global attributes and local contextualized relationships become the keys to unlocking insights.

ArangoDB was chosen for its ability to partition characteristics by their roles as well as their attributes into a “collection” model. The Semantic Spacetime principles tell you how. There’s no need to formally lock down data schemas, when the ecosystem needs to express diversity. Instead, we can let data expand organically, and we’re always thinking about how we would want to search the information (as a process).

Figure 6 below shows the skeletal model for CAIDA’s ITDK time-independent snapshot data sets. Unlike a classical entity data model, it separates both globally indexable keys and their local index relationships based on the kinds of question one would expect to ask — i.e. based on process semantics rather than presumed data type or schema regularity. Like all graphs, it’s actually mainly based on triplets (rigid and opinionated triplet models were tried before with technologies like OWL and RDF, but would often get into trouble). It’s not a pure triplet model however, because labels can be hierarchically structured documents. Annotated document attributes avoid many of the pitfalls of pure RDF models as long as you use them in a sensible way. In a Semantic Spacetime approach, we avoid trouble by ensuring that graph relations are always a representation of a causal model of either space or time — for some abstract or real space. This is what keeps you out of trouble.

Figure 6: A semantic spacetime model for relation semantics for the elements of the ITDK model. This is a timeless representation, so it fails to capture to full dynamics — yet this is a step towards integrating different observations from traceroute, BGP, and DNS.

Although the arrow relationships fall into four key categories (containment, precedence, expression, and nearness) of Semantic Spacetime, the precisely named meaning still has to be added to clarify the full semantics. In a functional system, semantics are of primary importance to understanding behaviour. The core skeleton of the model consists of these bones:

THE BASIC MODEL:* <Devices>: locations known to be single bits of machinery. A device may have multiple IP addresses. these are labelled N1, N2, …    <device> HAS_INTERFACE (+expresses) <ip>    <device> ADJ_NODE (+near) <device>    <device> ADJ_UNKNOWN (+near) <unknown>* <Unknown>: an unknown intermediary involved in tunnelling from one device to another, e.g. a switch, repeater, or MPLS mesh. This has an anonymous IP address, so we have to make up a name for it.    <device> ADJ_UNKNOWN (+near) <unknown>    <IPv4/v6>: addresses expressed by <devices>    <device> DEVICE_IN (-contains) <country)    <region> REGION_IN (-contains) <country>* <AS>: A BGP “Autonomous System” entity consists of many devices, may span multiple regions, but non-overlapping    <device> PART_OF (-contains) <AS>* <DNSdomain>: BIND entity consists of many IP addresses, may overlap    <DNSdomain> HAS_ADDR (+expresses) <ip>

From ITDK snapshot data, which are timeless within each sampling epoch, we can’t infer process orderings or events, so only three of the four relations are captured by the snapshot data.

Adjacency is a measure of nearness or proximity in space — a kind of spatial similarity (NEAR). It’s not a property of one location, so it belongs to both nodes. Nodes can be adjacent in different senses however. We use different kinds of labels for links between different kinds of entities, because these would be searched differently, and we benefit from not having to filter out unwanted connections.

Containment of regions within other regions is a straightforward hierarchy relation (CONTAINS/IS_CONTAINED_BY), attribute expression, HAS_INTERFACE and HAS_ADDRESS are expressions of properties local to a node. We could also represent these as document attributes and arrange for an index to find them quickly. However, by making HAS_INTERFACE a link relation, we can search by both the owner and the property, and allow for multiple nodes to express the same property without having to discover the non-uniqueness by brute force searching.

Modelling is always based on the answering of particular questions rather than trying to capture an absolute representation of reality as we believe it to be (our idea is based on a subjective point of view).

Figure 7: The aggregation of nodes into cellular grains offers a view based on scaled spatial regions. these have socio-technical significance with respect to different processes and applications.

The core associations in the model appear superficially as triplets (node,link,node), as one might expect in a semantic web model (e.g. RDF), but there are hidden benefits to a document graph: links in an internet connection are promise-theoretic bindings from sender address to receiver address, so we need to label each intentional binding with extra delivery-chain data too. The full detail of the Internet supply chain can’t easily be captured by something like RDF without making a mess. Luckily, the flexibility of a document graph approach makes light work of it. It serves as a great example for the future of analytical modelling.

Quick wins

Big data present non-trivial issues. It took 5 days to upload the full Internet map fragments into ArangoDB with the help of CAIDA’s team, but less than an hour to extract some science — in this case, to confirm the well known power law scaling (see figure 3), believed to be driven by the hypothesis of preferential attachment.

Figure 8: A log-log plot of internet node Near adjacency degrees (log k, log N(k)) showing the approximate power law for preferential attachment by counting of inferred adjacencies.

This is certainly old news, but it paves the way to answering more complex questions — which paths from A to B pass through a list of sovereignties without acceptable data policy, or which pass close to geo-region of a natural disaster; are there any paths in which one cannot guarantee data integrity due to black box firewalling? These are questions that rely on semantics, not on connectivity alone. Eigenvector centrality scores, etc.

Once data are organized along spacetime principles and stored in an efficient database, these questions become straightforward to answer. And this is only the beginning.

Big Data present challenges both for the representation, presentation, querying, and computational methods that we use to find meanings across a range of scales. Scaling is nearly always the underestimated and least well understood part of computation. Graph methods and technologies have yet to embrace optimizations like Monte Carlo approaches that were developed decades before, but this will no doubt influence the next phase in the development of graph database platforms, and I've written privately on this subject already in consulting.

Some other kinds of queries:

// Show IP addresses for a Device N1FOR ip in Contains FILTER ip.semantics == "HAS_INTERFACE" && ip._from == "Devices/N1"  RETURN ip._to// Show Regions within 40,000 km of x,yFOR reg IN Region FILTER GEO_DISTANCE([4.60971, -74.08175],reg.coordinates) < 400000000 FOR dev IN Contains FILTER dev.semantics == "DEV_IN" FILTER dev._to == reg._id RETURN dev._to// Show Devices within 40,000 km of x,y = 4.60971, -74.08175FOR reg IN Region FILTER GEO_DISTANCE([4.60971, -74.08175],reg.coordinates) < 400000000 FOR dev IN Contains FILTER dev.semantics == "DEV_IN" FILTER dev._to == reg._id RETURN dev._from// Show IP addresses within 40,000 km of x,yFOR reg IN Region FILTER GEO_DISTANCE([4.60971, -74.08175],reg.coordinates) < 400000000 FOR dev IN Contains FILTER dev.semantics == "DEV_IN" && dev._to == reg._id FOR aif IN Contains FILTER aif.semantics=="HAS_INTERFACE" && aif._from == dev._from RETURN aif._to// Show neighbours of N2FOR a in Near FILTER a.semantics == "ADJ_NODE" && a._from == "Devices/N2" RETURN a._to// Shortest path from N2 to N3 FOR v, e IN ANY SHORTEST_PATH 'Devices/N2' TO 'Devices/N3' Near RETURN [v._key, e._key]"// Show geo-locations for N2FOR p1 IN Contains FILTER p1._from == "Devices/N2" FOR p2 IN Contains FILTER p2._from == p1._to RETURN { source: p1._to, region: p2._to, country: p2._from }// Show IP addresses for Device N40957FOR p1 IN Contains FILTER p1._from == "Devices/N40957" && p1.semantics == "HAS_INTERFACE"  RETURN { source: p1._to }// Show DNS domains in N40957FOR p1 IN Contains FILTER p1._from == "Devices/N40957" && p1.semantics == "HAS_INTERFACE" FOR p2 IN Expresses FILTER p2._to == p1._to RETURN { source: p1._to, region: p2._to, country: p2._from }// Show Devices for domain DNS/example_comFOR p1 IN Expresses FILTER p1._from == "DNS/example_com" FOR p2 IN Contains FILTER p1._to == p2._to RETURN p2._from// Show me the hyperlinks/wormholes (unknown routing hubs/fabrics)FOR p1 IN Unknown RETURN p1// Show me the Devices addresses that are linked by Unknown tunnels...FOR unkn IN Contains FILTER unkn.semantics == "ADJ_UNKNOWN" RETURN unkn._from

Next steps, new horizons

This example is the merest of beginnings, illustrating how graph spacetime techniques are simple enablers. There’s a long way to go to find a representation which is conducive to capturing all the processes at play in the Internet. Like all complex issues, it’s a matter of scales–in both space and over time. How we renormalize detail into coarse grained regions or approximations is how we identify roles. Integrating time into the picture will be a key challenge. A decade ago, it might have been sufficient to look at the state of the network summarized by the year. Today, the pace of change makes that view useless for studying dynamical events, such as major outages or security infiltrations. The good news is that technology is finally catching up with the need for this type of analysis–combining a hybrid database with a formal model like semantic spacetime grants new purchase on the previously intractable.

Of course, there isn’t a unique way to model anything, but Semantic Spacetime combined with document graphs is a strong set of principles to steer by. It’s all about what you want to do with data. That’s is both a benefit and a hazard. It’s a hazard because we are profoundly relativistic in our beliefs and we end up seeing what we want to see — dealing with semantics is always a subjective issue between signal and receiver. But, it’s a benefit because we now have the tools to explore all those different views without starting from scratch. Data collection shouldn’t be mere inventory or stamp collecting — it should be more like forensic inquiry, building causal inferences with statistical meaning.

The challenge of using a database like ArangoDB to navigate CAIDA topology data is interesting from a few perspectives. First, it shows how something apparently as simple as data circuitry can be a lot more complicated than expected, once we consider what processes are going on inside it. It also reveals that how we model with the right tools can be critical to either help or hinder progress. The skeletal model of the Internet reveals just a few core concepts, but also reveals several independent “spaces” that coexist within the network (representing IP, DNS, BGP, geospatial data), each with its own semantics and dynamics.

This overlap of conceptual spaces, concepts, and contexts will only grow as mobile edge computing (think 5G and beyond) begins to develop. Technologists have so far only bragged about the speed of 5G, but its promise goes far beyond its speediness to semantics and service too. Having semantic principles that sew together document information with process graphs is a glimpse of the future of data analytics. And exploratory technologies like ArangoDB are pushing the envelope here — think supply chains, economic and monetary networks, with a million shadow economies via loyalty points and memberships. The security and effectiveness of the service economy hides in plain sight for CAIDA’s foundational work to probe.

As a test case for graph modelling with a database like ArangoDB, CAIDA’s topology data illustrates just one of many network models with non-trivial semantics. Arango’s internal “collection model” facilitates this modelling, where some of the first generation graph databases made matters difficult for the modeller with their purely type-based approaches. If we were to add e-commerce into the picture, imagine what things might look like then!

Today the industry is only awakening to the possibilities of graph modelling, as it prefers to focus on the more publicized number crunching techniques, like those embraced by Google with its PageRank algorithm, yet this is just the tip of a huge iceberg that focuses on just one aspect of the network of data. Getting data is not easy, so we can thank CAIDA for the gift of already rich public data sets. Still missing from the ITDK data is the crucial aspect of time on all its scales. In terms of the physics of Information, we've found here a steady state picture of thermal states, or something like a system wavefunction, but we're still missing low level equations of motion for the underlying processes.

Capturing a network’s true nature(s) is more like mapping a flourishing ocean reef, spanning subject fields from biology, ecology, technology, financial data, and more. The field of semantic data representation is swimming into new territory.

Science, research, technology, author - see Https://markburgess.org Https://chitek-i.org