Federated Intent and Causality in Distributed Systems
Why data consistency is a problem of promising semantics alongside dynamics.
So, you bit the bullet. You’ve broken up your software or organizational monolith into a number of departments or services. Let’s call them microservices. Your seat reservation system performs a reservation in four steps: i) access a user profile, ii) select a seat, and iii) make a payment, then iv) add the reservation back to the user profile. In your wisdom (or because contemporary developer politics strong-armed you) you’ve separated these into three services: users, seats, and payments. Now you have to be sure to visit them all in the right order and keep the state of your order in mind–remembering to complete all stages to ensure a successful outcome.
In the real world, we would probably abhor exposing this kind of compartmentalism in the service of users. In Oslo, for instance, I still have to go to explicitly separate offices to visit a doctor, then across town for an X-ray, then somewhere else to handle blood work, before going back to the doctor again to get the results. This imposed outsourcing is a ripe source of colourful language in many parts of my home town–but in software, we’ve come to tolerate and even deify the eccentricity as a way to rationalize resource decisions and shift complexity away from a service back into the user’s workflow.
The scenario described above is the stuff business models are made of. You’d expect that someone in their right mind would step in and arrange the whole thing for you, for a small fee–like a package holiday. While separation of concerns behind the scenes brings benefits, exposing it directly adds risk for unruly users.
Such a user-centric concern might call into question the whole raison d’etre of the design with microservices in the first place, but today developers often have the power to feather their own nests, and no one really questions the wisdom of building reusable components. Compromises have to be made somehow. However we look at it, the issue of distributed causality is hard to sweep under any ideological rug. Even if we put a simple interface in front of compartmentalized services, integration and smooth process flow in the presence of unreliability is the central challenge. Companies buy and vertically integrate with the smaller companies they depend on this to avoid complexity, in the business world.
What’s the big deal? The trickier problems start to occur when several independent services have to depend on unique information, but choose to cache their own copy for reasons of autonomy. Suppose each silo needs a copy of identity and perhaps medical records. Should they keep their own copy, or pass one instance from place to place? Before computer networks, you might either have to carry your records with you, or have to wait for a postal service to deliver updates to your doctor. If someone sent them twice by mistake, would you be charged for two? What happens if you change your doctor? Would a mistyped record lead to an erroneous outcome? Many questions come to mind!
Databases and consistency
In the modern world, we use databases to store the results of these kinds of short term workflow hand-offs as well as to archive long term records. It’s databases all the way down. In particular the relational database model (SQL) dominates. One tool, multiple purposes. We rarely question whether the tool is fit for purpose. But this doesn’t eliminate trouble: relational databases were designed to be immutable (or at least slowly varying) archives; they weren’t designed to capture dynamical processes (we now have things like graph databases for that).
Once you put changes into a non-ordered database, their process order is generally lost. The origins of a change are forgotten and levelled into a shapeless heap of files. The technical term for this entropy. It’s why version control systems were created. We can try to slap a timestamp on the record updates, but timestamps don’t keep history, and can even tell the wrong times for a number of reasons. It only requires one part of a parallel update to take slightly longer than others to make the timestamps actually tell the wrong story! (see figure 1).
These examples intentionally point out how mundane the issues of data consistency are, yet we often bury them in technicalities that suggest there is some magic involved. There are few subjects more theoretical than data consistency (consensus, quorum, partitioning, asynchronous updates, CAP conjecture, FLP theorem, algorithms etc). Today, microservice architectures have brought us to the situation in which each service may keep independent records for its own operational state, and even for its archival logs. Some of the records overlap between the services, meaning there is no single source of truth.
In old fashioned database language, the total effective database (for the whole system) is not 3NF or in 3rd normal form. If there is no connecting process between the non-duplicate “duplicates”, then it’s wide open to inconsistency. We don’t know if there is a causal order, or if one record is an imposter, and we don’t know how to detect and rectify inconsistencies either.
Subsystem sovereignty isn’t really compatible with cross-border cooperation.
Tomato tomato tomato
In spite of efforts to coordinate the kind of shared data exemplified above, the microservice approach has become popular in recent years, because it favours the sovereignty of developer teams over the source origin and pushes algorithmic complexity upstream from program state into someone else’s workflow. Always nice to outsource problems! In doing so, however, we may impose an implicit workflow onto clients, instead of encapsulating the service as a black box. Workflow and data flow become entangled concerns as well as entangled scales of activity. For the service developer, concerns are separated to make the component view more elemental and easier to maintain, but for another developer integrating the services, the coordination problem is exposed as a new concern.
Microservice designers get the autonomy they crave, but the larger system built from these components still has to work and resolve errors on the scale of the whole process. In practice, the interests of data sovereignty and operational control means that each service typically has its own database too. From a data modelling perspective, that means that a model of the whole system potentially violates the database Normal Forms of data modelling.
If semantic duplicates of data arise in different databases, the issue is then to keep these in sync. Microservices must be able to promise the same interpretation of a data value as each other through their shared function, else the combination of components becomes a semantic nonsense.
Componentization of design implies a new complication in software semantics due to the lack of any invariant semantics across components. In electronics, if you connect a resistor to one circuit, it plays the same role as a respirator in another circuit. But if you connect a bank account number to an input in one program (say a payment credentials module), it will not have the same meaning as in a completely different program (say a spreadsheet adder). Each service component exists for a larger purpose than itself. That larger purpose cannot be held hostage to an inability to cope with conflicts, e.g. one service claims to-mayy-toe (X) and another claims to-mahh-toe (Y). For instance, two people (X and Y) both show up to the same appointment for an X-ray, or the records get swapped in the fray, there is a collision of semantics. The services failed due to making inconsistent promises
For technologists, the idea of creating a straightforward conveyor belt process to join up a workflow between individual specialist services seems obvious and compelling. This is sometimes called orchestration or data choreography. This kind of flow diagram thinking is how we are taught to think about processes in school, but it isn’t as easy as it sounds.
A command flow proceeds as a sequence of impositions. Each part assumes the implicit sovereign right to try to tell the next what their answer is. But this is the opposite assumption that we make when granting autonomy to design microservices! Worse, when there’s more than one component feeding into another, and telling the workflow what it should be thinking, the multi-stage process is then more of a nightmare than a freedom, as it obliges all the parties to work together (see figure 1b).
Sequential dependencies in a workflow add to the problems. If a series of necessary changes fails in the middle, a job is only half done. We need some sequences to be treated as a single atomic change. What does that mean? A passenger cannot half-buy a seat, and certainly won’t be interested to pay for such an idea.
Imagine that a user’s credit card fails to pay for their seat, the seat reservation has to be undone, and the resources reset for others to share. If someone gets the same X-ray appointment as you, you might have to go back and apply for a refund, then go back to the doctor to change the plan. Database folks call this attempt to undo a “rollback”. It’s a much misunderstood and abused technical name from database kernels in which one can assume complete isolation. Most people know in their hearts, but not in technology, that rollback is generally impossible in an open parallel environment:
- The order of events cannot be guaranteed or preserved for a single process (a small part of the universe) so it can’t be reversed. There’s interference between competing processes in general.
- Other processes may have already depended on the state that emerged along the way. The hospital has already used your money to buy drugs, so they can’t afford to refund your fee.
Undoing isn’t feasible when services don’t have exclusivity. If only we could isolate each client process, as if everything was all in one place. The serial “monolith” has certain advantages, but there are lots of ways to bring about the same effect.
Where should we fix this? Who is responsible?
The Downstream Principle
Promise Theory sorts this out. It makes short work of explaining why the imposition view cannot be consistent in a distributed collaboration. The Downstream Principle explains that, in a workflow, those who are more downstream in a process, i.e. those who are closer to the desired end state, have both a greater opportunity and a greater responsibility for delivering the final intended outcome than those at the start of the causal chain. This is the opposite of the mentality of blame that comes from command and control thinking.
Years ago, in a Google talk, I used the analogy of language to make the point. Imagine British and American parents trying to teaching their children how to say tomato “properly”. Mum says it’s “tom-ah-toe”; Dad says it should be “tom-ay-toe”. Each imposes their correct intention onto a child who can’t comply with both commands. Unable to choose between the two, the child has the ultimate responsibility to choose its own path for a successful outcome. It could choose both in different circumstances or just one. The concept of righteous rules is a nonsense. But there’s an alternative: promises are expressions of intent by best effort. Mum and Dad can promise that their way is what they say. The child can promise to accept one or the other, both, or even none. Each agent can only promise its own behaviour, but it still has to deal with whatever might come from elsewhere. It can’t impose its own standards without simply isolating itself from collaboration. That’s sort of the whole idea of microservices too.
In data systems, this same issue crops up with infallible regularity. When clients push changes into a shared database, they expect their transactions to be respected as commands to be obeyed. But they may not realize that they are not alone in wanting this. More often than not, they are in competition with others who would be The One to determine the outcome. The result is a race to get there either first or last, depending on your point of view. Solutions like Paxos, Raft, Zookeeper, CRDTs, and more have been proposed to improve the rules by which updates are equilibrated.
Promise Theory tells us that (as long as all agents consent) we have two choices to keep things in sync or single-valued (see figure 4). Either we base all on a short causal path from a single source of truth (a), or we wait for all the parties to equilibrate by talking to one another (b).
The second is slow and harder to be certain about, as well as over-constrained, so it’s normal to use it only to break up a cluster into a process of selecting a “leader” amongst a group of redundant agents that will act as the source of truth for keeping failover backups. But this is not the only case of interest. For inequivalent microservices, the agents are not redundant however. Sometimes developers will use a special key value store to fudge consistency, adding back some of the complexity they were hoping to avoid.
Having a single source of update doesn’t necessarily solve the problem of inconsistency, if the recipients notice the changes at different times relative to a client process. An update (like that in figure 3a) may arrive at different times compared to a process that is consulting the two recipients in quick succession. We can’t predict which update will win the race to read or write unless we use locks and blocks to limit access.
In a stream of changes, who wins a race is not really the appropriate question. There is no universal principle of correctness for order, based on “the survival of the fastest” (FCFS) or the “survival of the slowest” (last word wins the argument). Yet nearly all database systems treat that final word as the single correct answer, even as that changes over time. Developers and system designers need to question their own thinking here. Why is change occurring? What semantics do we attribute to change?
The problem is often that there are unexpectedly many changes to make sense of in the first place, especially as systems grow and scale in intensity. That, in turn, is a sign that the dynamical system is not well represented by a static data model. The only possible answer is for downstream clients to have a policy about how they will promise the correctness of the outcome.
Consistency across overlapping process shards, even in calm seas
Even when a system is in a state of relative calm, there may still be a problem. In microservices, data and process are effectively “sharded” by being encapsulated — but, counter to the normal assumption, the shards in non normalized data models are actually designed to overlap partially! They may pretend local sovereignty over data, but they are really appropriating common property, common dependencies. The greatest risk then is that semantics become multi-valued, i.e. that data mean different things in different contexts!
Again Promise Theory reveals the essence of the issue in a simple way. If we reverse the roles of a) and b) in the figure 1 above, now b) shows the sources of intent appear to be formally distributed across multiple sources and (from the perspective of these senders) the intended order has no clear definition. We’ve started from two dimensional sequences instead from one unique sequence. Our goal is to transmute this two dimensional message into several one dimensional parts, but the dovetailing has no prescribed order. Either we need to supply one ad hoc and stick to it, or some order has to be encoded through dependencies. Reversibility of these processes is essential in storing and retrieving process meaning.
Laminar versus turbulent flow
There’s an analogue for these issues in the physics of flow. Laminar flow is what happens when parallel processes don’t interact. You might have seen demonstrations of laminar flow showing perfect rollback in a fluid. But this is a very special case. Most processes do interact, even if only weakly.
We should not depend on something that’s unstable. Services need to promise not to serve up data when values are depended on on a larger scale! We can’t escape the fact that promises on a larger scale demand processes on a larger scale to keep them. Those independent services can’t be completely independent after all. Let’s be clear that this is not something a technology can fix. If a designer insists on building in an earthquake zone, (s)he’d better build that assumption into the model.
Even my android phone sometimes has issues updating its display in application sharing, presumably due power saving constraints around cache updates. This has become increasingly annoying as phones vie to make better competitive promises about battery life.
Is there a fair solution as a service?
These issues haunt computing from the chip level (cache consistency) to the cloud level (where atomic clocks have been used to try to keep certain promises about data synchronization).
We’ve already established that scaled promises appear monolithic on the outside, but that microservice platforms lead to potentially the opposite of the kind of harmony, by craving control sovereignty for microservices components. That’s a collision between politics and engineering, but there is a reasonable solution in technology to the dependency issue, which preserves individual autonomy at the expense of outsourcing the task? We want someone to take care of the coordination, but we also want to manage a service without our hands being tied by someone else. There are two main levels to this.
- A single application can run its own platform instance on the scale of cooperation to manage consistencies. The is relatively non-invasive.
- A global service (a cloud service) with individual project registrations could be shared between many, even all applications as a meta-service, just like any other cloud service.
A promise-enforcing platform is a simple answer. We’ve described the bare bone of that in a paper Continuous Integration of Data Histories into Consistent Namespaces. Such a platform requires a few long term changes (fairly insignificant) in the way developers work, but we can also approximate some of its goals for the short term.
Making these updates globally consistent can only be done from a single source of truth, which may change over time, but the results can be communicated globally as long as we preserve the chain of causal dependencies in a way that respects the fundamental separation of components. Semantics can’t be assumed or imposed from above if we are to preserve the meaning of a system, they have to be promised explicitly based on what each component promises, from the bottom up: mutually and bi-directionally by the agents in a system, in order for it to keep the same promises on a wider scale that it keeps on a smaller scale.
One answer lies in the notion of a publish-subscribe pattern. First changes are published from the point of action, where changes are determined. Then they can be rolled out like as series of versions in a version control system.
The first stage may simply be to federate some SQL tables amongst a number of cooperating components, allowing several points of access but preserving promises about causal order with an interleaving clock (again, see the paper Continuous Integration of Data Histories into Consistent Namespaces). This can be handled by intercepting and indirecting over a layer between client and server instances. We have a prototype for this built on JDBC and Kafka. It’s already helpful to enterprises, but it’s hardly a forward looking solution.
The real problem lies in software engineering. The more we replicate, the greater the constraints on the developers locally. However, there is an alternative. At a platform level, communities of developers could meet in the cloud, where a single super agent retained causal control of all data at scale and virtualized links between private database duplicates, as if in multiple namespaces with common dependencies (something like symbolic links for shared data). Each individual microservice or application team would then have their own credentials and would set up a contract (a set of mutually beneficial promises) that outlined the specific compromises made around the sovereignty of certain data values. The platform would then enforce these connections on their behalf, and handle updates as a set of globally synchronized updates, optimized according to the direction and nature of the coordination promises.
Unlike push-based updates (Paxos, Raft, etc) these would be entirely a matter of voluntary cooperation — thus preserving the autonomy craved by developers and the consistency of a causal process for change. Some issues are inevitably tied to specific kinds of database. SQL semantics are standardized (for better or for worse), but with some forward looking design, it could be made transparent for all models — at the expense of some configuration.
Microservices will probably be subsumed by compilers for distributed systems in the future. To realize that goal, we need the kind of cooperative platform described here.
Obviously, the greater the scope (scope) of synchronization needed between components, the slower the clock of consistent ticks can run. In that regard, keeping service models small is an advantage to the microservice philosophy. We can’t speed up data transport. So we could instead start with key value pairs (this is what key stores like Zookeeper, etcd, and Consul do) or we can do the same for SQL tables, if there is model consistency on a table level. We can go on, up the scales, until we reach full database replication. The traditional consensus protocols will not scale to these levels, however. A new approach is needed.
For well know reasons, a distributed update cannot be made with temporal consistency by a push method. Only if there is a single versioned source of truth, in a single location, can a distributed change be interpreted as having an instantaneous coordinated update when pulled on demand. The effects of that change will not be instantaneously visible uniformly everywhere–that would be impossible. The information still needs to travel at a finite speed over various distances, but that might not matter as long as no one needs to look. With a broadcast replication scheme for published data, like Content Delivery Networking (CDN), we can arrange for the information to be close to everyone for a finite size network. In fact, this idea is not new. Even for general communication, Name Data Networking (NDN) has been proposed for more than a decade.
Postscript
Data consistency is potentially violated in cooperative distributed systems, like microservices. Ultimately, this is a design issue, but we can envisage technical assistance to work around the the grey areas. In the paper referred to above, a method to scale causality is described. It means that we can have scalable version control on general data as long as we design processes to observe basic temporal separation of scales. Throwing data randomly at a catchall datalake from different sources will never work.
All this begs the question: what does data consistency really mean to us, as users and as developers? Are we aware of it? If so, what do we mean by it and what value do we place on it? Are we asking computers to do something we don’t even ask of ourselves? More technically: have we been using databases in a way that is now unfit for purpose?
For handling workflows and process state, we don’t need archival data models — we need ledgers, logs, and graphs. When we treat locally ephemeral data as globally persistent facts, in flawed reasoning models (and vice versa), it’s no wonder we end up in a mess. Computer Science loves flows and stateless Markov processes, but in fact the world is more about stateful memory processes than we care to admit.
It’s surely time for a change in our thinking, if only we could all do it in sync…