Universal Data Analytics as Semantic Spacetime
Part 10: Big data, coarse graining, and “effective” approximations
In the previous post, I looked at how large scale properties of graphs could be calculated from smaller scales, and how the graph itself reveals the computation! Before we end this series of posts, we need to finish addressing the elephant in the room, which is scale, in all its interpretations! We already saw in part 8 what happens when we zoom into a graph of even moderate size, when it has dense interconnectivity. We lose our way, and find it hard to track causality. For some, the solution is to use one of several inscrutable Artificial Neural Network techniques being developed in the ML research community, which mistakenly (in my opinon) try to replace graphs with Euclidean embeddings. But there’s a lot we can do without throwing away causal explanation. In this post, I want to show how the concept of semantic spacetime shows us how to stay in control of a more traditional and causal alternative: we can apply classic methods from statistical mechanics to integrate out degrees of freedom, leaving a controllable approximation over a scale parameter.
Big data — why do we want it?
The fact is that it mainly became popular to accumulate Big Data when disk prices fell below a certain threshold of affordability in the tech industry. Supply drove demand more than direction. We can blame tech marketing for some gratuitous over-usage of storage during those times, but — if it has been advocated indiscriminately in the past, then — today there’s a higher level of sophistication involved in the use of large caches of data. Perhaps the same can’t be said of CPU time.
Let’s recall the two main reasons for using data:
- To seek answers about empirically gathered data, i.e. to determine (repeatable) causal relationships.
- To combine data through models in order to generate composite outcomes (e.g. for modern “AI” syntheses) like an expert system.
The former is a traditional view of scientific inquiry — high fidelity signal analysis, as “objective” as can be. In the latter, there’s a kind of artistic license like making photo-fit mug shots from body parts— we don’t have full causal integrity in the story between input and output, but we apply superpositions over possible “worlds” or “alternative states” to compose an outcome to be judged on the merits of later applicability. Not every data set (large or small) will contain all the information we need to make it meaningful by itself. Semantics are not always apparent from observable dynamics. Nonetheless, we might still be able to combine one source with another, to “imagine” possible ideas, as we do in “AI” methods — again, a bit like a photo-fit face pieced together to identify a criminal suspect.
Methodological Challenges
Scale is about more than just the amount of data, or the number of computers needed to crunch it, though that’s its conventional usage in Computer Science. In Natural Sciences, scale can also refer to standard intervals on a measuring stick or other meter. I’ve discussed this before in my books In Search of Certainty and Smart Spacetime. The two definitions are closely related, because we can’t solve the problem of crunching bulk data without understanding how the number of countable intervals between relative measures that data represent affects the questions we’re trying to answer.
Anyone who has had to deal with “large data” will know that comprehending more than a certain amount of it is impossible, even unhelpful, because we can only tell a decent story about a few variables or patterns at a time. Approximations are more useful than high dimensional precision for story telling. So, we look for ways to project out certain “viewpoints” within a problem, perhaps summing up small contributions and replacing many things with one effective thing, or taking the different angles one at a time. As we do this, it’s important to understand the integrity of the “chain of evidence”. i.e. how the final answer emerged from the data, so that we understand it’s claim to being an honest story. If we lose that, then it’s hard to trust the result.
Chain of evidence is something graphs do well, but not something Euclidean space representation (tuple spaces) do well — yet Euclidean methods are more familiar to us, and there’s a tendency to want to switch to a Euclidean picture at the first available opportunity. This is one of the challenges facing modern Machine Learning methods.
The causal path of a calculation may or may not be apparent in the results we see. Answers may be independent on the particular way something was calculated. This happens for so-called memoryless processes. Ironically, this makes it easier to trace them, because they have to expose all their information externally. Physicists traditionally prefer memoryless processes, because with less contextual dependency they feel more “objective”, but they are relatively rare in systems of greater complexity. All cognitive (learning) processes are based on cumulative path-dependent learning. The context dependence is an important discriminator or memory key.
With many calculation steps and many variables in play, one can easily lose track of whether techniques are answering the question we intended. Physicists have studied this problem in detail, employing detailed analyses of approximation techniques known as the renormalization group. Techniques like neural networks (e.g. Deep Learning) have a “coolness” factor, but it can be hard to understand how they calculate. Why should we care? Because the two use-cases above require very different measure of integrity,
The tools I’ve chosen to adopt, in this series, are just those of plain old procedural computer programming and data representation in ArangoDB’s multi-model database. They don’t imply a particular approach, so we need to find one — but that’s not too hard if we trust the structure and meaning inherent in graphs. Sticking with these tools may not always be the most efficient way to approach large data problems today, but I believe we should begin with understanding cart horses rather than jumping at passing zebras.
Let’s look at how the classical methods of approximating high dimensional problems with statistical mechanics can be used together with semantic spacetime to simplify computations, without losing sight of their intent. For any physicists out there, this is very like what’s known in the trade as Effective Field Theory.
Controlling scale with tunable parameters
To try to make a story concrete, let’s examine a public data set, derived from a “real world problem” — one that has enough size and detail to think about the issues. The Amazon Product relational map is available from The Open Graph Benchmark project at Stanford University. The data set isn’t fully raw data, with full insight into the sales processes; it’s something of a hybrid, used for validating machine learning algorithms. But that, in itself, is useful to the extent that it forces us to think about the problem and to find compromises.
The Amazon data set, consists of files, line by line, representing products purchased by online users, with only indirect supplemental information about the types of product — nothing explicit. We find:
- For each node, a coarse category label to one of 50 or so product categories, e.g. 1) Books, 2) Beauty Products, etc. In the four SST relations, this is a CONTAINS type.
- For each node, a feature vector which is the output of a machine-learned pre-classification of product descriptions into a 100 dimensional Euclideanized “word space”, where similar points are about similar topics. In the four SST relations, this is a NEAR=”IS LIKE” kind, with implicit links of a graph. However, the data are not presented as a graph, but rather as a set of tuples or points in a 100 dimensional vector space.
- For each pair of products co-purchased (on any occasion), an idempotent edge or link is present, forming a graph of all such time-integrated events. The links are unweighted edges, so the relative probabilities or frequencies are not recorded over the integration. In the four semantics, this is a NEAR=”COACTIVATED” kind of relationship. This is like a bioinformatic network of co-present proteins in some process, hence making the connection to bioinformatics noted in the previous post.
We might try to tell a story about the features. The feature vector data represents a second data set, and a kind of “hidden” graph of “similarity links” between similar feature vectors and its dense web of co-purchases. But it’s not explicit. We don’t need to expose this secondary graph explicitly, although had the data been cached and represented as a graph in the first place, the problem would have been simpler.
Suppose we want to make recommendations based on a product, we either want alternative product recommendations, i.e. similar products from the same category (books by Tolkien or Lewis), or products that complement one another, from any category (chocolates with roses). In the absence of an explicit information about dependencies (batteries not included), we have to infer dependencies or complements by looking at what people purchased together. This is a long shot, but worth a try, and of course it’s the use-case for this test.
The graph edges have limitations. They are an aggregate, flattened projection of the co-purchased products averaged over the timeline of the dataset. Time (a sequence or some kinds of FOLLOWS relation) is therefore absent from the data. There is an imposed assumption buried in this, which may lead to a false and fictitious result: co-purchasing is not a transitive relation. If A and B were purchased together on one occasion, and B and C were purchased together on another, it does not imply that A and C were ever purchased together. Moreover, this is not evidence of future purchasing needs: if I buy a toy, I need to buy a battery. If I buy a battery, I don’t need to buy a toy to go with it. If someone bought the toy and a torch with batteries on different occasions, it doesn’t mean it’s of any interest to buy the torch if you buy the toy — or vice versa. Many users of graphs don’t make use of this directedness, because it makes some computations harder! They should!
Suppose now we want to answer the question, what products to recommend given a first choice, then analysis of the whole graph is of no interest, because that’s an entirely local process. Having a general idea of trends isn’t too revealing either. Probably, we could guess the answer. Similarly, if we’re interested in which products by name are considered similar, the whole graph is a distraction and no need to render it as a plot.
With these aggregations, all we can say is that certain product categories are likely to be bought together with others.That’s the information which is inherent in the graph edges. What the amazon strategy does to try to work around this is to use the description data to coarse grain and reduce the product basis at the outset. So there are only really two questions we can answer: possible likelihoods for co-purchasing different kinds of product (without necessarily understanding what they are in detail), and a meta-question about the efficiency of representing data by graph or as Euclidean vectors.
Implementing analysis with Arangolangs
The repository on Github for this post contains the following programs, which you should execute in the following order.
- amazon-load-data.go — load the graph into the ArangoDB SST model.
- amazon-category-level-graph.go — count link nodes between type hubs.
- amazon-coarse-product-nodes.go — count link nodes using knowledge of products inside which hubs.
- amazon-coarse-product-nodes-v2.go — count link nodes based on correlating products by feature vector, then compare to type hubs.
The data can be loaded with the program amazon-load-data.go. The data files are line by line, so we use the line numbers as node labels.
% go run amazon-load-data.goReading nodes5000 …10000 …15000 …20000 …25000 …30000 …35000 …40000 …45000 …
We use the different collections in the Semantic Spacetime toolkit to read the products into Nodes, then we use Hubs for the 50 or so product categories. We can immediately connect product nodes to their intended categories (see figure 2, left hand side).
The SST model uses:
- Nodes as purchases, i.e. products.
- Hubs as product categories (CONTAINment of a Node in a category)
- Fragments as coarse grained product category inferences (summarizing feature vector NEARness)
As graphs, different purchase events could also be simply hubs that united several products — but we don’t have that information, just like the events trails in the passport example or the traceroute map.
What could we do with the data in this SST graph? We might:
- Use the COACTIVATION graph to read off and suggest the linkage of products used together (referring later to the independent feature graph), but without ranking.
- Use the CONTAINS graph to find intentional product classifications, added by hand.
- Use the feature vector with a Euclidean distance function to renormalize the embedding of the graph in its imaginary Euclidean space of intentional product descriptions, in order to find other products that are similar to a given product. We can then check whether the product categories inferred by distance correlate with the product descriptions (this is more than 50%).
- Combining all these, we can see trends and patterns of co-purchasing. However, because the edge weights are missing, we have to rely on summation over different products, rather than different purchasing events, together with a renormalization for extracting significance.
The format of the data presents challenges. There are two data sets, both tied to nodes. In one set, we use the nodes to form a point to point graph. In the other, we use the nodes as random samples in a Euclidean continuum. This is a common approach in a staged machine learning pipeline, however it leads to subtle trouble for the analysis.
A scale transformation — exposing a story
Let’s consider the product co-purchase graph first. The density of connections in the co-purchase graph is such that it effectively wipes out the sense of scale from a high level view (see figures 3, 6, 7, and 8).
The trick to finding quick answers, indeed to solving any equational system, is to find the right parameterization. How we establish linkage from one layer (nodes) to another (hubs) is how we expose features on one scale from features at another. Figure 3 shows how the full picture looks as the data grow: an inscrutable black hole. It’s not useful for making inferences or for telling a story. When graphical interactions are only sparse in number, we can get away with thinking about graphs as simple networks — but when it all blurs into a kind of background noise (as in figure 6 of part 8, or figure 7) then we need other strategies to deal with this kind of detail. Those principles are well known from physics. They go by a variety of names: continuum approximations, field theories, effective field theories, with techniques of renormalization, regularization, and more.
Spacetime thinking is important here, not only in sourcing and interpretation, but in the processes of computation too — precisely because spacetime trajectories are stories — it’s about the structure of storytelling. Every process is spacetime in one form of story or another, and we have to learn to recognize and use that advantageously. But we shouldn’t assume that spacetime means a Euclidean or even Riemannian continuum! Graphs are the minimal spanning basis of process. Embedding processes in Euclidean space is intuitive but introduces a raft of new problems to solve.
One reason to reduce the dimension of a complex graph of data is to be able to comprehend something about it in order to simplify the story on a higher level. Visualization is a hopeless task for large amounts of data (figure 3). Scatter plots and heat maps assume that we can visualize a process as a continuum. It’s trying to plug data directly into our brain’s ability to render distances in space.
A different algorithm can separate nodes into local clusters, retaining some local structure, but it throws away certain kinds of pathway instead in order to fit on the page (figure 4). The blue nodes are aggregated products. The pink nodes are product categories (intentionally encoded). The mesh of overlaps is extensive and random. The visual doesn’t represent link weights, so we can’t see repeated entries. This is all very unsatisfactory. Finding an answer we like versus a realistic answer to a question of practical utility are very different issues.
Limiting the amount of data helps us to tell a story about a particular scale. We can do that in different ways: i) we could focus on a small microscopic region and throw away everything outside a certain radius (see figure 5). Now the blue nodes are gone entirely! If we don’t know how to measure the scales, which bring out the key characteristics, then no ad hoc procedure will necessarily give a representative impression.
The alternative, ii) is to try to tell a story on a larger scale by coarse-graining the data into groups with similar semantics — these then act as supernodes (coarse grains). We do this using CONTAINS links between different layers in the graph. The trick then is to make such supernodes tell the same story that we wanted to extract on a local level. That means we need to pay attention to semantics of the aggregation, and make sure we only aggregate things we wish to project out as being similar things.
For example if you are analyzing a cake for its ingredients you could coarse grain the cake by well-mixed local regions and this could tell you that the large scale structure of the cake is a uniform bubbly mass of cake particles. Or, you could aggregate the grains by semantic type and discover that there are grains of flour, sugar, butter, etc. The separation of concerns can be managed in different ways.
Figures 5 and 6 show how just reading more data (worrying about “bigness”) doesn’t answer the right questions unless we know how to coarse grain in a controlled fashion. Semantics have to be associated with quantitative features — but if there are none distinguishable, we’re in trouble. These are characteristic of local regions that are loosely connected to other regions — a kind of Small Worlds structure. That’s certainly a valid insight, but only on a local level.
To summarize this lengthy point, throwing data at a problem, without a causal model or an understanding of the relative scales for features, is a game of hit and miss. We might get lucky, but it’s gambling. What we need is a more controlled, deterministic approach to data modelling. This is where semantic spacetime helps us to apply tried and tested methods of statistical mechanics to the problems. This, in fact, is also how bioinformatics has developed over the past decades, and we can learn a thing or two from the field of complex biological information systems.
Integrating out superfluous degrees of freedom — effective models
I’ve mentioned renormalization and effective representations as a method for coarse graining. Let’s focus on that now, and ask what this looks like for data in the form of a graph. If we emphasize the link structure of a graph, even with a hopeless rendering, it’s clear to see that the density of criss-cross lines almost blurs into a continuum of background noise (see figure 7). Where is the signal? The green doesn’t play much of a role, except as a density.
We can use this smudginess to our advantage by simply “integrating it out”, and folding the summation results out of sight — like applying a visual filter. Rather than throwing away data, we sum and average their impact. If we could preserve the causal integrity of the processes represented by the green in figure 7, using successive levels of effective supernode approximations, then we’ll be able to emphasize key remaining features and make uniform noise disappear. Figure 9 shows how we see the scaling to do that, by rethinking the graph as “dressed hubs” or “kernels”.
A total graph has the interesting property of being a generating function for all possible interaction sub-graphs. We want to integrate out features that we don’t need to distinguish, and keep the resulting effective nodes. This is essentially what we do in a path integral (effective action) in quantum theory or statistical mechanics. This isn’t the only thing we do in semantic data transformations however. Dimensional reduction is an important tool — another is transforming variables to capture the right semantics.
We know the Amazon data contain two completely independent graphs: one explicit one made of co-purchases, and an implicit Euclideanized one, made of product similarity vectors. If we are looking to discern which products belong together in a category, there seems to be no a priori causal reason for co-purchases to lead to products in the same category. I can buy socks with chocolate if I want to. Moreover, since the links have no weights, I only have to do this once to make those categories indelibly linked, with equal importance to millions of people who buy chocolate with flowers (unified by the customary process of buying celebratory gifts). On the other hand, product descriptions are much more likely to be similar if their descriptions are similar, because there is (hopefully) a causal connection between the nature of the things and their product descriptions. So, if we are going to integrate over some variables to reduce the degrees of freedom, it makes sense to coarse grain the Euclideanized description graph — which already has redundant dimensionality.
Regularization and renormalization
With that in mind, let’s return to the study of the product nodes. We were handed the final answer in the form of product category assignments. But perhaps that was too large a leap — too coarse an approximation for such a diverse ecosystem as the Amazon sales platform. What about the clusters of similar products from different manufacturers? Surely that is also a scale on which we would expect to find structure to tell a more interesting story? We can explore this, without computations blowing up in time, by regularizing over small subsets of the data while deciding how to renormalize the data in a way that exposes the story we’re trying to tell, then once optimized we can relax to the full data set if necessary.
This is where the feature vector, based on text product descriptions comes into play. The feature vector is a set of 100 numbers which are the cached result of prior-processing by a Natural Language Processing run over the text of product descriptions. It comes from Amazon’s BlazingText service. In short, it looks at texts and tries to place similar texts close together in a 100 dimensional vector space, so that we can tell whether two products are similar (in terms of their descriptions) by using a simple Euclidean distance (it does more than this, but we can’t use the other properties here). This is already an extensive machine learned algorithmic data reduction. The 100 dimensional vector is listed for most nodes (a few are just zeros) as a key-value association. So, let’s use that to explore the implicit network of associations by description.
When we use the distance between nodes to infer the existence of a shadow NEARness graph (figure 9), not from the explicit graph that was given as an adjacency map and loaded into the graph database, but from the vectors in its documents, there is an implicit weight on the links formed by this distance function (figure 9). If the Pythagorean distance between two products is small, then their product descriptions are considered to be small, and they can be considered semantic alternatives. This is nice, but it introduces a new problem: we can calculate the distance between ANY two nodes, so it’s an N-squared complete graph over a potentially infinite number of graduations— much too large to handle! By converting into a Euclidean representation we’ve made the problem even more computationally intensive than if it we we could have represented relationships directly as a NEARness graph. What we would like to know is which associations dominate, if any.
As a continuum, this problem is ripe for threshold renormalization. We can keep just pairs of nodes whose inter-node distance is less than a cut-off, and ignore others. This is a classic cutoff regularization and renormalization scale, in a statistical mechanics sense. We need it for the same reason we need it in field theory: because the physical picture has been over-parameterized over a continuum for algebraic convenience. If a recommendation points to a coarse region, we still want to know what’s inside it. Since that information is not in the data, we can’t use the data for recommendations without the original sources.
In order to coarse grain the data, already loaded from the Amazon data sets, we can run the following program which looks for links between product nodes, and transfers the links between nodes belonging to different hubs to the hub graph itself, increasing the weight count of the links according to how many nodes are effectively between one hub cluster and another. The algorithm is here:
% go run amazon-category-level-graph.goCross-link the product category hubs based on member linkage…first attemptCluster [Hubs/h_1 Hubs/h_2 Hubs/h_17 Hubs/h_3 Hubs/h_0 Hubs/h_4 Hubs/h_5 Hubs/h_6 Hubs/h_23 Hubs/h_8 Hubs/h_13 Hubs/h_10 Hubs/h_21 Hubs/h_11 Hubs/h_19 Hubs/h_9 Hubs/h_15 Hubs/h_12 Hubs/h_7 Hubs/h_16 Hubs/h_14 Hubs/h_18 Hubs/h_22 Hubs/h_20 Hubs/h_28 Hubs/h_24 Hubs/h_26 Hubs/h_31]0 : Hubs/h_1 = Health & Personal Care1 : Hubs/h_2 = Beauty2 : Hubs/h_17 = Pet Supplies3 : Hubs/h_3 = Sports & Outdoors4 : Hubs/h_0 = Home & Kitchen5 : Hubs/h_4 = Books
It would be nice if this magically resulted in an answer about how likely it is for consumers to purchase products from different categories together. Alas, what we see is the limitation of the data in the Amazon data sets: namely that the links between products have no weights. So it’s just as likely that one would buy chocolate and socks in the same purchase as buying chocolates and roses, as long as just one customer did so. The effect is that every hub is joined to every other in a giant cluster.
All we can do to improve the picture is to count the number of direct connections and sum them (integrate out the Node links and use them to calculate the effective weights of the Hub links). Then we can introduce an arbitrary cut-off in the number of links so that only the highest weighted connections are shown, hoping that the outcome is approximately independent of the cut-off. This is a subtle issue of ergodicity or entropy in the data, and we don’t have space in the margin to discuss that here. The approximate result is shown in figure 10 and the tabulated frequencies below.
Hubs/h_2 Hubs/h_1 39 Beauty — HealthHubs/h_6 Hubs/h_4 36 Toys/Games — BooksHubs/h_10 Hubs/h_6 36 Art/Craft — Toys/GamesHubs/h_1 Hubs/h_2 32 Health — BeautyHubs/h_1 Hubs/h_9 31 Health — Gourmet/FoodHubs/h_3 Hubs/h_1 25 Sports — HealthHubs/h_19 Hubs/h_1 25 Industrial/scientific — HealthHubs/h_1 Hubs/h_0 21 Health — Home/Kitchen
This method is the graphical version of what’s known as the Background Field Method in field theory. We separate variables into slow moving parts (hubs) and fast moving fluctuations (product nodes), and integrate out the fast variables to discern the effective behaviour of the hubs. It’s a simple idea, but not without pitfalls. A possible error in integrating contributions when coarse-graining lies in over-counting edges. Over-counting can happen in lots of different ways when we parameterize a process with redundant variables. Using the idempotent methods, explained earlier in the series, is a simple way of avoiding this problem, without needing to compensate with “ghost” contributions. A few links in the data are indeed exposed as duplicates, and we see these pop up as notifications of updates.
Note: from a machine learning perspective, using the given product category in this way might be seen as “cheating”, since the purpose of ML is often to try to guess this categorization based on other relationships (see the careful article by Sachin Sharma) but that’s an artificial problem if you have the data we have. We’ll see how that issue works below for the product node coarse graining (into the Fragments collection). However — no sense in inventing fictitious problems. If the answer is in the data, then we should use it. This is the shortest path to outcome. And there isn’t much more to discern from this data set, in spite of its bloated size.
With the data available we can take two approaches to coarse graining:
- i) we use the intentional information about product categories to coarse grain “equivalent” products within those clusters and look for cross connections to understand co-purchasing of unrelated products, or
- ii) we don’t use the category pre-knowledge, and can test how well the feature space clustering mimics the product categories, without bias. In either case, we expect the size of the grains to play some kind of role — just how similar in vector space should equivalence mean for a graph?
In the program amazon-coarse-product-nodes.go we do i), product category by intentional product category, and use the feature vectors to identify products that are similar to one another within the categories, in order to join them into supernodes in the Fragments collection. This allows us to work with a smaller set, i.e. dimensionally reduced set of nodes for visualization. Ideally, we’d like to do this kind of computation inside the database, but for conceptual clarity I pull out the data into a client program. The queries make use of the separated collections to make the search quite efficient, without complex filters or joins. Running the code amazon-coarse-product-nodes.go, with some ad hoc scale choices:
% go run amazon-coarse-product-nodes.goWithin each category hub, coarse grain similar nodes to reduce dimensionTrying to aggregation category: Home & KitchenReduced 2856 to 98 supernodesTrying to aggregation category: Health & Personal CareReduced 3085 to 84 supernodesTrying to aggregation category: BeautyReduced 2735 to 35 supernodesTrying to aggregation category: Sports & OutdoorsReduced 3336 to 241 supernodesTrying to aggregation category: BooksReduced 15433 to 202 supernodesTrying to aggregation category: PatioReduced 1007 to 23 supernodesTrying to aggregation category: Toys & GamesReduced 3878 to 83 supernodesTrying to aggregation category: CDs & VinylReduced 4274 to 14 supernodesTrying to aggregation category: Cell Phones & AccessoriesReduced 2828 to 204 supernodes. . . .
Taking just the first 50,000 lines of the total file and a threshold of 20, for demonstration purposes, this took several hours to run on my modest laptop. Increasing lines or threshold will increase the runtime and the level of aggregation. The Books category is the largest and dominates the computation. A factor of N reduction in nodes may lead to N-squared better performance. However, bear in mind that the example code uses only simple-minded linear algorithms that can’t be easily parallelized without rethinking the graph querying. Using this kind of database-client interaction isn’t the best way to approach this, but it illustrates the method. The goal here is to explain the spacetime aspects in a pedagogical way, so I won’t go into how we could optimize. However, there is plenty one could do to parallelize the method to increase performance.
Note how the use of different database collections allows us, once again, to use a database efficiently as a tool for computation, without affecting the source data. Since most of the computed data values are static (invariant), the values can be cached and reused without the need to repeatedly pass them through layers of computation — in essentially the same way as the feature vectors are cached and presented “as is”, rather than recomputing from scratch in the raw data. Clearly there are ways of separating spacetime processes from one another and using the spacetime itself as the means of computation.
Total product links between different hubs 7088Total product links between coarse different product fragments 5171
The coarse graining i) reduces the regularized sample of 50,000 nodes into 137 product regions, which cluster around the 50 or so hubs. Of the 50,000 nodes, 7088 co-purchases were between different product categories, and (after coarse graining) 5171 were between different product clusters. There’s a reduction in complexity, of the hundreds of thousands of co-purchases in the sample, only 5000 were between different product categories. These were mostly a handful, very small in number. Interestingly, they do reveal some co-purchases that didn’t cross the threshold of importance at the hub level: books with arts and crafts, groceries with home and kitchen, groceries with beauty products, etc, at the level of tens of instances. If any of these ad hoc events come as a surprise we might wonder about the job suitability of the analyst. But, for 5 hours of computation and many Watts of electricity, we might hope for deeper insight. But clearly, with any analysis, it’s about understanding the data in the first place. There is no magic.
It seems that co-purchase clusters are largely focused around product categories. Whether this was engineered so that the data could test some ML algorithm, or whether it’s an actual phenomenon, we’ll never know. But, the scale transformation is consistent through two levels of renormalization.
To check this, we can calculate whether the cluster fragments contain any overlap between different hubs. If that’s significant, it indicates an imperfect relationship between intentional classification and feature vector analysis. In v2 of the coarse graining code, we remove the dependence on hub category to search across hubs too. The results are mixed, and dependent on the ball radius we choose in feature space. If the ball is small, we don’t get an effective dimensional reduction, but we may link together categories that are only weakly connected. The larger the radius, the fewer grains, but the longer and more expensive the computation. As a sample illustration:
Radius 20 Reduced 50001 to 3840 supernodesAverage hub composition of unbiased fragments = 0.4875real 75m56.805sRadius 50, Reduced 50001 to 1269 supernodesAverage hub composition of unbiased fragments = 0.452real 456m54.528s
The results show that about half product co-purchases are between different product categories, independent of the cutoff radius. So looking for strong linkage between categories isn’t hopeless, and the interior and exterior links between categories are basically separable, and could be found much more cheaply by the method in figure 11. The arduous computation in method ii) only offers evidence that the cheaper approach is good enough. Had we not had more method i) available, the computation would benefit significantly from parallel computation. This is why both GPU methods and quantum computing are potentially of interest.
Discussion — Science or Art?
I’ve spent some time on this particular model because it illustrates something important about spacetime representations. We have two kinds of space here: a discrete graph and a Euclideanized summary of data, from which we try to infer different graphs of relationships. Why do I say that we infer a graph from the Euclidean data? Because we never use the concept of Euclidean distance in practice: we only use it as a clumsy data compression to compare places by sampling point to point, then projecting out a graph. The similarities of product descriptions have been summarized by a 100 dimensional vector space, artificially introducing “in-between” states that don’t actually exist in the data, when a NEARness graph would have sufficed. We pay a heavy price for factoring this out again. For data science, this is misleading. For creative composition it’s more desirable, but the cost of searching is then high.
The representation of the feature vectors (as a vector space) alongside a graph of co-purchases is odd. Why don’t we represent both as graphs or both as Euclideanized vectors? There is no obvious reason for this, and the choice leads to an inefficient calculation here. The two data sources: co-purchase and product description are based on causally independent processes (consuming and manufacturing) and have very different metrics for NEARness, so we would not expect to see any objective correlation between them. Moreover, these two representations require very different computational methods and reasoning, which are not very compatible and lead to a lot of work for a straightforward answer. Had someone paid attention to the spacetime structure of the underlying processes in the first place, time could have been saved. But, of course, it all depends on what we’re trying to do. Art or science?
The two representations of data have very different metrics for NEARness. In a vector space, we have to look at an open ball surrounding each point, with a cutoff which is arbitrary since a continuum has no natural scale. To find a scale, inherited from the source data, we have to search (at great computational expense — see below) and the results will always be somewhat ambiguous. In a graph, we’ve already decided the point to point causality of interactions (based on some earlier process). It involves a similar cutoff threshold, but at a much earlier stage. If we’re going to cache information about that decision, why not do it in a way that avoids repetitive calculation later? In the case of this data set, it’s unclear what the motivation was — but the point is clear to see.
The odd thing about the feature vectors is that they are overlaid on top of nodes in a graph that has nothing to do with the vector space the features belong to, betraying the origins and perhaps the misguidedness of the representation. We end up trying to summarize a graph by looking at the Euclidean distance between theoretical nodes. The search leads to a mixing of two different definitions of “distance” or discriminant criterion. The distance we get by summing up unweighted boolean links in a graph between product co-purchases is not a proper Euclidean distance, but something more like a DNA frequency analysis of fragments (a histogram overlaid onto a graph). It’s like looking for familial connections. I used this approach in Natural Language Processing, comparing it to bioinformatic analysis earlier. The distance we get from reducing product information into a 100 dimensional vector space is an infinite over-parameterization with redundancy that has to be removed again later — just like in Quantum Field Theory.
I hope all this helps to illustrate (at least to some of you) just how important the representation of spacetime is to processes and to data representation. For all the flashiness of an Artificial Neural Network in the Cloud, you could probably get away with using ArangoDB and Go on your laptop.
SIDEBAR: Earlier, I suggested that no one wants to use database query languages. Indeed, they can be painful. However, these more complicated operations where we want to transform one data representation into another is a place where a good query language becomes a friend rather than an enemy. This is an intensive search, but luckily it’s optimized inside the database.
FOR coactive in Near FILTER coactive.semantics == “COACTIV”
FOR frag1 in Contains
FILTER frag1._to==coactive._from && frag1._from LIKE “Fragments/%” FOR frag2 in Contains
FILTER frag2._to==coactive._to && frag2._from LIKE “Fragments/%” RETURN {_from: frag1._from, _to: frag2._from}
Respecting the graph directly
The notion of semantic similarity and its use in coarse graining approximation is tricky. For the purpose of answering some (but not all) questions, it should suffice to lump together some of the nodes to yield supernodes. A standard approach to this, as a graph algorithm, goes as follows:
- Nodes that are associated by clustering can be marked for convenience by a single “hub” node (in a different collection, see the traceroute example in part 7, for instance). We link all the members to the hub with a “containment” association (from the original data collection to a new approximate data collection). Then we have an automatic coarse grain reference. For example, Nodes → Hubs, in the Semantic Spacetime code.
- We can add aggregate information about each grain to the hubs, e.g. counts of members, averages, etc. This facilitates future computation.
- Links of other kinds in the original data, which cross from one supernode to another, can be transferred to link the supernodes instead in the effective layer. This will lead to repetition, which can be summed and represented as a weight.
- When computing quantities, using the effective hub layer, e.g. to answer the question “what sorts of products did people buy together?”, we can now answer more quickly (but less specifically) using the new aggregate approximation, using the effective node weights and link weights to calculate.
When finding nodes close together, we could add these relationships to the graph as NEAR relationships A “is like” B, but unless we intend to use this extensively, this is just filling the graph with junk. The information is already in the vectors and summarized in a separable way by the supernode hubs.
ArangoDB’s ability to use multiple separable collections is important here, in keeping these effective supernodes separate from the original data. We don’t want to be using computational filters, adding to the CPU time. The computational expense of coarse graining is high — it’s related to the problem of ultraviolet divergence in quantum theory. Using a database interaction isn’t the smartest way to do this, but it illustrates the process cleanly.
Another strategy for regularization of data is to randomly sample nodes (a Monte Carlo approach). The problem there is that we don’t know that the order in which nodes will be parsed is random in any fair sense. We could also parallelize the computation. The best computational solution would probably be to distribute the computation in the first place, with systematic caching of aggregate information. By using the four spacetime relations at the time of original data sampling, we have a way of caching aggregates by CONTAIN-ment and NEARness, with linear search times and without losing track of other properties. Of course, the best solution overall is for the manufacturer to select a classification intentionally. This is why we are asked to add hash-tags and other labels to things we publish or add to collections. This is the power of intentional qualitative semantics over quantitative toil. Science is often presented with a prejudice against qualitative information, but this is wrong. We use all the tools available.
Summary: everyone’s problem is different
What we see here is that size isn’t everything. Data don’t have to be big to be meaningful. For all the bulk of the Amazon public data set, the conclusions it contains could be represented with a relatively tiny amount of data. We just need conclusions to be stable representations of data input, so if a huge amount of data doesn’t basically settle into a clear picture fairly quickly, we’re already in trouble. Of course “quickly” is relative. The ratio between input data and output data (dimensionality of output semantics) is the decider. No one should need terabytes of data to answer a yes-no question — more than once anyway. Had we worked with the entire end to end process, the computation of the result could have been executed more efficiently and in real time.
Don’t let anyone tell you how to approach analytics. Particularly, don’t let trends in Machine Learning trick you into thinking there is only one way to solve problems, with some free magic. It’s actually pretty expensive magic, and not really magic at all!
The Ferryman’s price is generally fixed — you’ll pay it one way or the other in the end. Throwing brute force at the problem makes is not just wasteful in terms of time and energy, but also makes us a culpable part of the on-going climate catastrophe.
Ultimately, we need to be able to tell a defensible story about what we’re doing. Keeping track of causal processing matters if you are looking for fidelity of the signal, i.e. true data analytics. This is not what people necessarily do in Machine Learning, where there is not so much learning as “imagining”. How shall we control that? Coarse graining is the basic and undersold tool for scaling data.
In the next and final post in this series, I round off the data analytics story and talk about some of its implications.