Semantic Spacetime and Data Analytics

Figure 1: A local sample of Internet structure from CAIDA data.

CAIDA in Semantic Spacetime

The Center for Applied Internet Data Analysis (CAIDA for short) is a small research wing, located in a pleasant forest grove, deep within the La Jolla campus of the University of California San Diego (UCSD). I’ve visited it only once in person, some 20 years ago, and it felt like a holiday resort with sea views and forest walks! Yet, from this idyllic vantage point, for the past 25 years CAIDA has been researching, mapping, and visualizing the Internet, in spite of the odds being stacked against them. You may have seen beautiful orchid-like maps as well as deep insights into the murky processes that drive it. These are some of the women and men, led by Kimberly (kc) Claffy, who try to answer the big questions. Getting a grip on data about the Internet is fraught with ambiguity and difficulty. Today the Internet is controlled largely by corporate America’s tech giants, who are not too forthcoming when it comes to independent research. So much for the rumours of no central control. Ingenuity is needed to get around the obstacles.

Figure 2: All socio-technical networks are multi-scale graphs of complex overlapping processes.

Semantic Spacetime Again

One answer to organizing network information, qualitatively and quantitatively, is to use Semantic Spacetime — as highlighted in the main body of this series. Semantic Spacetime models processes as graphs (something like scaffolding built from Feynman diagrams, but generalized to capture more aspects of a process). Semantic Spacetime retains special labels that distinguish the meaning of locations and their relationships, with respect to different flows and processes, mapping between different causal influences.

Figure 3: Probing the local structure of the Internet using traceroute (see part 7 of this series).
Figure 4: Building up a traceroute as radial slices taken over multi-slit maps forms a superposition of possible boundary conditioned paths.
Watch a video introduction to the method for this article

Isn’t it just a graph database?

As a database, a richly semantic graph is unlike a relational type model, which computer science has championed for the past three decades. Relational models rely on indexing to find items by random access lookup, using key semantics alone to label or find features. But a graph is not a random access structure: a directed graph is the embodiment of processes or causal flows. It has both qualitative and quantitative meaning at each point–something like the solid state physics of protein chains or granular alloys. The major classes of flow are framed as directed link relationships, and they reveal a deeper underlying sense beneath the highly specific details of connections and role semantics. Only the most recent hybrid graph databases can represent these structures efficiently in all their detail. Here, I used ArangoDB for its simplicity.

Hybrid graphs combine local and global indices

Getting data is half the battle. Analyzing it craves a structural understanding that mirrors both its dynamics and its semantics. Little of what we hear today about graph analytics emphasizes this dynamical or “living” nature of the nodes and links. How do we capture that aspect in a database?

Figure 5: In a traditional data warehouse a tombstone or legend marks each location, but we need an index to find each body. A graph is like a treasure hunt, each node pointing with clues to the next.

Internet tomography with ArangoDB

Creating a data model inside a database (no matter which of the SQL or noSQL options you choose) isn’t necessarily hard, but creating a model of an Internet-scale graph — one that can be searched and updated effectively to answer relevant questions — is a challenge that can be computationally expensive. Today we have a new generation of databases that can address these modelling challenges, as long as we get the modelling right. This is where the choice of ArangoDB for modelling Semantic Spacetime comes into play.

Figure 6: A semantic spacetime model for relation semantics for the elements of the ITDK model. This is a timeless representation, so it fails to capture to full dynamics — yet this is a step towards integrating different observations from traceroute, BGP, and DNS.
THE BASIC MODEL:* <Devices>: locations known to be single bits of machinery. A device may have multiple IP addresses. these are labelled N1, N2, …    <device> HAS_INTERFACE (+expresses) <ip>    <device> ADJ_NODE (+near) <device>    <device> ADJ_UNKNOWN (+near) <unknown>* <Unknown>: an unknown intermediary involved in tunnelling from one device to another, e.g. a switch, repeater, or MPLS mesh. This has an anonymous IP address, so we have to make up a name for it.    <device> ADJ_UNKNOWN (+near) <unknown>    <IPv4/v6>: addresses expressed by <devices>    <device> DEVICE_IN (-contains) <country)    <region> REGION_IN (-contains) <country>* <AS>: A BGP “Autonomous System” entity consists of many devices, may span multiple regions, but non-overlapping    <device> PART_OF (-contains) <AS>* <DNSdomain>: BIND entity consists of many IP addresses, may overlap    <DNSdomain> HAS_ADDR (+expresses) <ip>

Modelling is always based on the answering of particular questions rather than trying to capture an absolute representation of reality as we believe it to be (our idea is based on a subjective point of view).

Figure 7: The aggregation of nodes into cellular grains offers a view based on scaled spatial regions. these have socio-technical significance with respect to different processes and applications.

Quick wins

Big data present non-trivial issues. It took 5 days to upload the full Internet map fragments into ArangoDB with the help of CAIDA’s team, but less than an hour to extract some science — in this case, to confirm the well known power law scaling (see figure 3), believed to be driven by the hypothesis of preferential attachment.

Figure 8: A log-log plot of internet node Near adjacency degrees (log k, log N(k)) showing the approximate power law for preferential attachment by counting of inferred adjacencies.
// Show IP addresses for a Device N1FOR ip in Contains FILTER ip.semantics == "HAS_INTERFACE" && ip._from == "Devices/N1"  RETURN ip._to// Show Regions within 40,000 km of x,yFOR reg IN Region FILTER GEO_DISTANCE([4.60971, -74.08175],reg.coordinates) < 400000000 FOR dev IN Contains FILTER dev.semantics == "DEV_IN" FILTER dev._to == reg._id RETURN dev._to// Show Devices within 40,000 km of x,y = 4.60971, -74.08175FOR reg IN Region FILTER GEO_DISTANCE([4.60971, -74.08175],reg.coordinates) < 400000000 FOR dev IN Contains FILTER dev.semantics == "DEV_IN" FILTER dev._to == reg._id RETURN dev._from// Show IP addresses within 40,000 km of x,yFOR reg IN Region FILTER GEO_DISTANCE([4.60971, -74.08175],reg.coordinates) < 400000000 FOR dev IN Contains FILTER dev.semantics == "DEV_IN" && dev._to == reg._id FOR aif IN Contains FILTER aif.semantics=="HAS_INTERFACE" && aif._from == dev._from RETURN aif._to// Show neighbours of N2FOR a in Near FILTER a.semantics == "ADJ_NODE" && a._from == "Devices/N2" RETURN a._to// Shortest path from N2 to N3 FOR v, e IN ANY SHORTEST_PATH 'Devices/N2' TO 'Devices/N3' Near RETURN [v._key, e._key]"// Show geo-locations for N2FOR p1 IN Contains FILTER p1._from == "Devices/N2" FOR p2 IN Contains FILTER p2._from == p1._to RETURN { source: p1._to, region: p2._to, country: p2._from }// Show IP addresses for Device N40957FOR p1 IN Contains FILTER p1._from == "Devices/N40957" && p1.semantics == "HAS_INTERFACE"  RETURN { source: p1._to }// Show DNS domains in N40957FOR p1 IN Contains FILTER p1._from == "Devices/N40957" && p1.semantics == "HAS_INTERFACE" FOR p2 IN Expresses FILTER p2._to == p1._to RETURN { source: p1._to, region: p2._to, country: p2._from }// Show Devices for domain DNS/example_comFOR p1 IN Expresses FILTER p1._from == "DNS/example_com" FOR p2 IN Contains FILTER p1._to == p2._to RETURN p2._from// Show me the hyperlinks/wormholes (unknown routing hubs/fabrics)FOR p1 IN Unknown RETURN p1// Show me the Devices addresses that are linked by Unknown tunnels...FOR unkn IN Contains FILTER unkn.semantics == "ADJ_UNKNOWN" RETURN unkn._from

Next steps, new horizons

This example is the merest of beginnings, illustrating how graph spacetime techniques are simple enablers. There’s a long way to go to find a representation which is conducive to capturing all the processes at play in the Internet. Like all complex issues, it’s a matter of scales–in both space and over time. How we renormalize detail into coarse grained regions or approximations is how we identify roles. Integrating time into the picture will be a key challenge. A decade ago, it might have been sufficient to look at the state of the network summarized by the year. Today, the pace of change makes that view useless for studying dynamical events, such as major outages or security infiltrations. The good news is that technology is finally catching up with the need for this type of analysis–combining a hybrid database with a formal model like semantic spacetime grants new purchase on the previously intractable.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mark Burgess

Mark Burgess

174 Followers

@markburgess_osl on Twitter and Instagram. Science, research, technology advisor and author - see Http://markburgess.org and Https://chitek-i.org