Semantic Spacetime and Data Analytics

Figure 1: A local sample of Internet structure from CAIDA data.

CAIDA in Semantic Spacetime

Figure 2: All socio-technical networks are multi-scale graphs of complex overlapping processes.

Mapping out the Internet is a bit like taking satellite imagery of the Earth: we can see some things in plain sight, but we can’t see the insides of buildings or underground tunnels. Similarly, the Internet hides some information, both for security and privacy reasons, as well as for reasons of scale. Some information has to be based on inference. Probing the data requires sophisticated machine learning — not the headline grabbing Deep Learning kind, but rather the more widespread learning about changing signals, by automated sampling and reasoning, used to run the biggest investigative and predictive endeavours of our time, from particle physics to supply chain management.

Semantic Spacetime Again

Figure 3: Probing the local structure of the Internet using traceroute (see part 7 of this series).
Figure 4: Building up a traceroute as radial slices taken over multi-slit maps forms a superposition of possible boundary conditioned paths.
Watch a video introduction to the method for this article

Isn’t it just a graph database?

Hybrid graphs combine local and global indices

Figure 5: In a traditional data warehouse a tombstone or legend marks each location, but we need an index to find each body. A graph is like a treasure hunt, each node pointing with clues to the next.

Internet tomography with ArangoDB

What’s wrong with relational SQL? That’s probably the wrong question to ask. The trap that many fall into when modelling multi-key relationships is to think in terms of a type model (a schema or object-class model) in which things (labelled by some primary key) are the central focus. It’s a global-entity model view, which SQL was designed for. In a graph, the preferred approach is to look at the local processes in a model. There is context in every position. The connections between every node in a graph behave like a local index or service directory, with labelled edges offering both yellow pages and white pages directories to relevant leads. In entity modelling, entity attributes are associated with a primary key, whose significance is global. With a document graph, both global attributes and local contextualized relationships become the keys to unlocking insights.

Figure 6: A semantic spacetime model for relation semantics for the elements of the ITDK model. This is a timeless representation, so it fails to capture to full dynamics — yet this is a step towards integrating different observations from traceroute, BGP, and DNS.
THE BASIC MODEL:* <Devices>: locations known to be single bits of machinery. A device may have multiple IP addresses. these are labelled N1, N2, …    <device> HAS_INTERFACE (+expresses) <ip>    <device> ADJ_NODE (+near) <device>    <device> ADJ_UNKNOWN (+near) <unknown>* <Unknown>: an unknown intermediary involved in tunnelling from one device to another, e.g. a switch, repeater, or MPLS mesh. This has an anonymous IP address, so we have to make up a name for it.    <device> ADJ_UNKNOWN (+near) <unknown>    <IPv4/v6>: addresses expressed by <devices>    <device> DEVICE_IN (-contains) <country)    <region> REGION_IN (-contains) <country>* <AS>: A BGP “Autonomous System” entity consists of many devices, may span multiple regions, but non-overlapping    <device> PART_OF (-contains) <AS>* <DNSdomain>: BIND entity consists of many IP addresses, may overlap    <DNSdomain> HAS_ADDR (+expresses) <ip>

Modelling is always based on the answering of particular questions rather than trying to capture an absolute representation of reality as we believe it to be (our idea is based on a subjective point of view).

Figure 7: The aggregation of nodes into cellular grains offers a view based on scaled spatial regions. these have socio-technical significance with respect to different processes and applications.

Quick wins

Figure 8: A log-log plot of internet node Near adjacency degrees (log k, log N(k)) showing the approximate power law for preferential attachment by counting of inferred adjacencies.

Big Data present challenges both for the representation, presentation, querying, and computational methods that we use to find meanings across a range of scales. Scaling is nearly always the underestimated and least well understood part of computation. Graph methods and technologies have yet to embrace optimizations like Monte Carlo approaches that were developed decades before, but this will no doubt influence the next phase in the development of graph database platforms, and I've written privately on this subject already in consulting.

// Show IP addresses for a Device N1FOR ip in Contains FILTER ip.semantics == "HAS_INTERFACE" && ip._from == "Devices/N1"  RETURN ip._to// Show Regions within 40,000 km of x,yFOR reg IN Region FILTER GEO_DISTANCE([4.60971, -74.08175],reg.coordinates) < 400000000 FOR dev IN Contains FILTER dev.semantics == "DEV_IN" FILTER dev._to == reg._id RETURN dev._to// Show Devices within 40,000 km of x,y = 4.60971, -74.08175FOR reg IN Region FILTER GEO_DISTANCE([4.60971, -74.08175],reg.coordinates) < 400000000 FOR dev IN Contains FILTER dev.semantics == "DEV_IN" FILTER dev._to == reg._id RETURN dev._from// Show IP addresses within 40,000 km of x,yFOR reg IN Region FILTER GEO_DISTANCE([4.60971, -74.08175],reg.coordinates) < 400000000 FOR dev IN Contains FILTER dev.semantics == "DEV_IN" && dev._to == reg._id FOR aif IN Contains FILTER aif.semantics=="HAS_INTERFACE" && aif._from == dev._from RETURN aif._to// Show neighbours of N2FOR a in Near FILTER a.semantics == "ADJ_NODE" && a._from == "Devices/N2" RETURN a._to// Shortest path from N2 to N3 FOR v, e IN ANY SHORTEST_PATH 'Devices/N2' TO 'Devices/N3' Near RETURN [v._key, e._key]"// Show geo-locations for N2FOR p1 IN Contains FILTER p1._from == "Devices/N2" FOR p2 IN Contains FILTER p2._from == p1._to RETURN { source: p1._to, region: p2._to, country: p2._from }// Show IP addresses for Device N40957FOR p1 IN Contains FILTER p1._from == "Devices/N40957" && p1.semantics == "HAS_INTERFACE"  RETURN { source: p1._to }// Show DNS domains in N40957FOR p1 IN Contains FILTER p1._from == "Devices/N40957" && p1.semantics == "HAS_INTERFACE" FOR p2 IN Expresses FILTER p2._to == p1._to RETURN { source: p1._to, region: p2._to, country: p2._from }// Show Devices for domain DNS/example_comFOR p1 IN Expresses FILTER p1._from == "DNS/example_com" FOR p2 IN Contains FILTER p1._to == p2._to RETURN p2._from// Show me the hyperlinks/wormholes (unknown routing hubs/fabrics)FOR p1 IN Unknown RETURN p1// Show me the Devices addresses that are linked by Unknown tunnels...FOR unkn IN Contains FILTER unkn.semantics == "ADJ_UNKNOWN" RETURN unkn._from

Next steps, new horizons

This overlap of conceptual spaces, concepts, and contexts will only grow as mobile edge computing (think 5G and beyond) begins to develop. Technologists have so far only bragged about the speed of 5G, but its promise goes far beyond its speediness to semantics and service too. Having semantic principles that sew together document information with process graphs is a glimpse of the future of data analytics. And exploratory technologies like ArangoDB are pushing the envelope here — think supply chains, economic and monetary networks, with a million shadow economies via loyalty points and memberships. The security and effectiveness of the service economy hides in plain sight for CAIDA’s foundational work to probe.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store