Why can’t “configuration” be made simple?
A (rather long) view on the configuration-compliance-coherence trust problem in IT
Three central C’s: configuration, compliance, coherence. No matter how many new technologies we invent for configuring infrastructure and data in IT systems, they all eventually fall from favour. Sometimes it’s because infrastructure moves up a level (as with cloud) and sometimes it’s because the older methods overwhelm new users by expecting knowledge they don’t have. The rapid rate of change in infrastructure leaves people confused. More spuriously, technology has become a fashion accessory rather than an engineering decision for many in the 21st century. Engineers tend to choose products based on brand rather than technical analysis. In the 21st century, engineers want to be part of a tribe. Some technologies come from industry, some from academia. Some disappear more quickly than others. Most of them are forced to endure critical abuse from their replacements. Weary engineers argue for a return to the ways of olde: let’s just use shell scripts!
Although all these issues originated in the configuration of basic infrastructure (hardware and software), today the lines between infrastructure and software are blurred by virtualization and service architectures. As IT penetrates every digit of society, it becomes a clear and present safety issue. We are increasingly insisting on accurate and timely services that can “promise a stable and consistent experience”?–this is how trust is built. Thus, regulation is coming, from the highest levels to ensure this is not left to the whim of companies with other agendas. Resistance is futile.
The hard problem of configuration?
For many, configuration means software settings and preferences–something you hope to decide once and for all when you start using some software, and can then forget about. So what’s the big deal? Infrastructure engineers know it as something broader than just personal choices: configuration is the arrangement and fine tuning of all aspects of service delivery. It concerns everything from the design layout to the security of a working system. It includes how the parts are positioned within a particular realm of interest (say the local network, a single computer, or a hosting platform) and how those various services talk to each other. It’s a circuit diagram of every detail.
The IT community can't even agree on what configuration means. Pages about it on Wikipedia studiously refuse to acknowledge one another's work, leading to a mess of conflicting terminology and distorted historical record.
As time goes by, most of us realise that the techniques for managing personal settings turn directly into this more general problem, particularly when we deal with scale. Positioning and plugging together resources creates patterns, as in all programming, and we reuse these patterns to create just as much uniformity and consistency of experience as we need. Hopefully, we don’t simply impose it on those who don’t want it. Some standards are regulated, some are policy, some are convenience.
Engineers don’t always make the right choices for users. Recently, I moved into a new apartment. It has a nice kitchen but the tap over the sink is mounted too high so that water splashes everywhere when you turn it on. If you put a cup or a bowl in the sink and turn on the tap, it never goes into the cup or the bowl because the position is aligned with the drain rather than the possible contents of the sink. It’s a similar story in the bathroom. This too is a configuration problem–a rather simple one, in fact–and yet the solution still wasn’t fit for purpose. Half the problem (at least from my perspective) was neglected. Yes, the tap delivers water and the water runs to the drain, and there is space for things in between. But what about the relationship between water flow and the things to be washed or filled? The designers only solved a static problem instead of the dynamic one it really is. The flaw is easily solved by making the tap more flexible. Bendable nozzles and extensions are easily fitted but are not standard. Too often, with infrastructure, we imagine a fixture rather than a part of an adaptive process. The same is true in IT.
Configuration becomes both a risk and safety issue as well as a usability advantage. A functioning system must have both correct information and correct behaviour, but one does not imply (or guarantee) the other. If data are incorrect, our intentions will not be captured correctly and everything we do will be misconceived and irrelevant. If behaviours are incorrect then every time we alter or move data they will become corrupted–either for some or for all.
Perhaps the most common example of this concerns databases, where a major concern is the “consistency” of distributed data across replicas and sites. Consistency means that data must be true to a common standard when passed around. It sounds easy enough, but it’s a difficult and contested topic. Somewhat bizarrely, we use loose and fluffy ideas like quorum (borrowed from human management) to determine crucial correctness means for critical outcomes (i.e. a majority vote by possibly inconsistent opinion holders). One might think we could do better than asking for a committee to vote on correctness where matters of human safety could be at stake, but (for better or worse) this remains the industry norm..
As engineers, we don’t want too much fuss. We start out hoping for maximum simplicity: an economy of effort over the lifecycle of software. This includes ongoing maintenance, which not everyone plans for with too much care. Thus product managers and developers tend to strip away as much sophistication and adaptability from methods as they can, until that minimalism backfires and results in an incident report. After rounds of tweaking and patching to try to refloat a simple model it’s time for starting again. It’s not that we couldn’t solve the problems elegantly at the outset, I suspect it’s rather that many in IT find these details of management to be boring or trivial next to what they would really like to be doing. A common solution is to impose an artificial simplicity onto software and users as a condition of use, in order to get on with more interesting aspects of the code.
From prior research, we know that decision selection is a formally hard problem, yet technologists increasingly reject the science and pursue their own priorities.
Change management as code?
In IT, most approaches to configuration deal only with static patterns: data fixed once, for all, and forever at the beginning of an installation. Once set, we don’t expect to touch these values again. But this ignores drift, erosion, and other maintenance issues like garbage collection that unintentionally change the conditions of the system. Change doesn’t require a human hand, yet, we nearly always assume that all changes are intentional. In the cloud era, where programs are not expected to last very long, our modern narrative is to view change as the evolution of intentional “software releases” and we neglect the hidden resources that the software implicitly depends on.
Terms like “infrastructure as code”, DevOps, GitOps, etc have been coined to try to persuade engineers to adopt a similar level of discipline around infrastructure. After all, even these campaigns for better management of software (based on versioning) were also only recently championed by Continuous Delivery methods. These issues were taken for granted by previous generations, but because housekeeping methodology was never taught in colleges, the latter generations needed to be taught these things through community campaigns.
The problem of managing real time change in “complete computer systems’’ is still woefully neglected, even after 30 years of work. The result is that, every year some new emergency configuration language or system gets invented to “solve” the immediate problems that occur due to whatever happens to be the fashionable negligence of the day. Modern digital regulatory laws will soon render such negligence illegal. Most noticeably, the trend has been to refactor automation and abstraction into cloud platforms, and go back to occasional hands-on intervention by humans to manage the rest.
Perhaps the most important lesson we learn as we get older is that ideology is rarely useful in the face of the hard facts of what can be achieved in practice.
After 30 years of configuration research, it’s time to review these trends in the light of a deeper understanding about how other fields deal with configurations.
Other fields…
I started life as a physicist. As we learn in Newtonian physics, a configuration is a snapshot of a “system of interest” described by the coordinate positions of all bodies at a certain time. We can calculate past and future trajectories of bodies, to some extent, because the “software” of physics is fixed and known in the form of Newton’s laws. Everyone knows that we can only predict configurations accurately for two bodies. The Three Body Problem in physics is already too complex to be fully reliable. So we don’t expect even initially simple configurations to remain simple over time, as we do in IT.
We call the (x,t) realm of coordinates, in which bodies are arranged, configuration space to distinguish it from other parameters that describe changes to their interior states–though to be fair, one could equally call the realm of internal quantum numbers a configuration too (in the IT sense). The distinction is more a point of view than a reality.
We summarise behavioural “policy” by equations of motion (or their solutions) in physics. Together with the initial settings or boundary conditions for externally decided fixtures, these allow us to predict the state of bodies over time–not just at the moment of process initiation.
In physics, what we would call configurable data in IT are the parameters that are used to adapt the general equations to specific cases. We separate initial (or boundary) conditions from dynamic evolution in this way. Why is there such a separation? It’s really a reflection of a hierarchy of scales. Nothing is truly fixed, but some variables change quickly and some change only very slowly. The slow ones are what we choose to abstract as constants, invariants, or fixed parameters.
In IT, most people think that configuration only refers to the initial conditions of a trajectory. Unlike the idealised equations of physics, we can’t completely isolate trajectories from external forces, because there isn’t such a convenient separation of scales into fast and slow change. This is why flight trajectories need constant corrections (I tried to change this perception in the 1990s). When Windows began to dominate computing, a course correction meant a complete reboot of a system: killing off the pilot and passengers and trying the journey again in a new (improved) vessel. In the cloud era, we are basically back to that method of restarting with disposable cloud models.
In this sense, sticking to the name “configuration” is misleading. Although physics has learned to solve many issues, what it doesn’t describe is semantic purposes–or the intent of a physical mechanism–so it’s not a complete analogy to IT. In physics, structures don’t have a purpose: they simply are. In IT, they serve the aims of users. To add the dimension of intent to a system, we need to think beyond pure data and expand our idea of what space and time represents for IT.
Semantic Spacetime is where your configurations actually live
We can go much further in formulating a solid abstraction than Newtonian mechanics did for physics, by building on fundamental truths about pattern and causality, and moving them into a more realistic representation of space than Euclidean coordinates. Locations in IT are discrete, networked process locations–we can just call them “agents”. All these ideas have been brought together in a general description of intent and state that I call Semantic Spacetime that began with Promise Theory. In doing this, it’s possible to predict the challenges that IT is going to face over the coming years. Indeed, many have already been predicted and even solved.
What we call an agent in IT depends on the scale of what we’re doing and the isolation of process that implies. The focus has shifted over the decades, due to the economics of shared computing (see figure below).
When planning services, we tend to think from the top-down (from the outside-in). However, the core behaviours are actually governed from the bottom-up (from the inside-out) of the devices involved.
A conductor in an orchestra can’t play every instrument in parallel from the top down, e.g. by remote command, but he or she can shape and coordinate the performance as long as every player has detailed instructions and listens for signals. Coordination is a top down problem, because it requires a central standard for calibration (like a conductor). Playing is bottom up, because that’s where the levers of change are. This points us to the basic model of agents as atoms shows us how physics and IT become a single point of view. That model became Promise Theory.
Atomic specifications and desired states
The key ingredients for engineering a desired outcome are:
- A static representation (call it a specification) of an intended state either at some reference time, usually the start or the end (e.g. the musical score or computer preferences), and
- The process by which that intent is realised or evolved (the playing or execution algorithm).
Together, these express what to do and how to do it through the stages of a configuration life cycle, from start to possible end. Together they also form a language. The complexity (expressivity) of the language can be debated, but experience shows that intent demands a lot more than pure data models and schemas can represent, unless there are carefully specified semantics implicit in their interpretation–and that’s exactly what we mean by language.
On a technical level, a language is nothing more than a pattern fed as input to one or more processes. Languages are associated with process outcomes, which in turn are associated with intent. Languages can be diagrammatic (like IKEA “destructions”), dials or switches (like in an autopilot compass heading), iconic symbols (like Chinese or desktop icons), or alphabetic combinations (like English). The arrangement of patterns represents encoded information by combination of basic symbols. This gets interpreted in context to provide meaning.
Intent as language
To describe intent, we make specifications about position and direction in our semanticised spacetime, consisting of data (initial conditions) and behaviours (equations of motion or policy). We declare the desired state (with implicit algorithm to be solved), or we fix the initial state and script the precise algorithm or recipe to follow from there. During the ideological disputes of the noughties, the former became called convergence, while the latter was called congruence. As long as everything is deterministic, these two are entirely equivalent. However, if there are unknowns, they are not. The disputes around configuration basically boiled down to whether one believed in determinism or not. For the record, it’s easy to show that determinism is unachievable by design.
- A declarative language attempts to describe an intended outcome as a reference for later audit.
- An imperative language is a way to describe a precise set of actions to be followed without improvisation or interpretation.
When you buy your furniture pack from IKEA, both imperative instructions and a picture of the desired outcome are provided for reference. If you were practised in the execution of the plan and every build was completely identical, we could eliminate one or the other of these because the patterns could be assumed to be “well known”. The natural one to eliminate would be the instructions, since we would still like to compare the outcome of the process with a reference picture. This is also the case for many changes in a computer system, which is what led to the declarative languages starting with CFEngine. The changes themselves are not hard, but figuring out what to do, where, and when is harder and requires knowledge based reasoning.
People disagree about which of these two descriptions is better. Should we fix a method or an outcome? By training, we are used to thinking about the steps involved in making something happen rather than the final state. It’s tempting to think that when everything is implicit for a well known procedure, all that remains to specify is simple data, but that’s only true if the language implies the reasoning steps by convention. Without expressive language, you can describe change exhaustively, brick by brick, atom by atom, but you can’t compress patterns symbolically or simplify a procedure intelligently into just a few words.
Lingua pranka
Languages are not just for giving instructions (or destructions!); they are also used for reasoning and analysis, in advance of and in review of an event.
Given the present revival of interest in AI, this could be assumed obvious, but people often have difficulty in transferring ideas and knowledge from one area to another. Configuration language is not trying to express poetry in many colours, but it needs at least one way to express whatever we intend to create, in every context.
Compliance with stipulated regulations is a common requirement in many industries, so describing guidelines in a language to be able to promise continuous configuration and maintenance was a desirable goal for auditing. With newer regulations like DORA, European businesses are now expected to be able to explain this as part of their risk plans by law. This suggests that the language of configuration should be readable by non-expert auditors, so a language has to play two roles: a precise specification for experts and a comprehensible pedagogy for inspectors.
Anyone who’s worked in IT knows that it’s easy enough to construct bits of software and infrastructure by invoking instructions–even when the process involves mindlessly long specifications of millions of issues. However, for someone to understand what you actually created afterwards requires something else entirely. To serve these two masters, the rush to oversimplification for engineers is not your friend.
Promises promises
The arrival of Promise Theory showed us how we could blend these two goals, and it proved that any configuration could be expressed without conflicting intentions if and only if the promises were evaluated and kept from the bottom-up. The implication then was that a model of the overlapping concerns would have to be an exercise in knowledge management or pedagogy.
In a promise decomposition, every independent resource would be represented by a distinct agent, and every possible variation would be a promise. A computer is an agent, in one sense, but it is also made of many smaller agents including files with static or variable content, static or variable software, and static or variable tasks running. These promises might change over time in order to adapt to context and policy, and the same promises might be used by several agents of the same kind at different locations for scaled expansion, so a reusable pattern language was clearly needed. The figure below shows the basic elements of syntax for CFEngine 3. It’s not that superficially different in substance from YAML, but what one doesn’t see is how it gets evaluated, with variable substitution and iteration.
The basic contention in Promise Theory is that each agent can only do what it can promise to do. No amount of imposed obligation from the top can make something represent something it’s incapable of or is unwilling to do. To configure a collection of agents that we are made of and hence control, we either autonomously declare what we intend to be, or we declare our intent to listen to instructions from someone else (then assuming the instructions overlap with what we can promise autonomously, we align with that intention).
Bottom-up Promise Theory thus tells us how to harness parallelism and autonomy to scale promise-keeping–not just once at the outset of deployment, but–continuously thereafter and entirely from within. Bottom-up implementation doesn’t imply a loss of central control, rather it explains what the necessary and sufficient conditions are to achieve that.
Today, it’s clear that there are still ways to reduce the linguistic overheads involved in precise modelling and yet further improve the usability of the language. Changing the tyres might be better than reinventing a poorer wheel. Consider the excerpt below.
It still feels cumbersome, yet it is scalable and precise. It leads to accurate documentation of compliance with promises in real time. Just what the auditors need for SOX, NIST, GDPR, DORA, etc.
Implicit in this language is that every promise is compared to reality by an inbuilt monitoring of state, so it represents a best effort promise kept. So-called context “classes” and scoped namespaces provided highly sophisticated if-then-else decision-making, with expanding variable substitution of scalars and lists provided for so-called “Turing complete” expressivity. A reader doesn’t need to see these algorithms, they are to be trusted.
Without variables you can’t compress patterns into programmatic structures. So formats like YAML that have become temporarily dominant in recent years can only list repetitive data exhaustively. Conversely, a programming language, say Python, can generate any set of outcomes by Turing’s theorem, but it describes the method rather than the intention itself.
Promise Theory has also been applied beyond configuration to deal with the other major challenges of distributed service platforms: the highly dynamical patterns of consistent SQL data replication too.
The lessons of CFEngine 3
CFEngine 3, in my mind, came closer than anything else to the ultimate configuration balance between powerful representational language and verifiable, scalable reliability. Yet, even with all the fundamentals in place, the task of modelling clearly exposes too much complexity. There’s an overhead for scaling abstraction that makes it cumbersome for small jobs. It forces the designer to think on several levels.
The latter is absolutely necessary at scale, because it gives the flexibility to express our intent, without forcing a particular hierarchical order on us, as some data based configuration schemas do. Modelling quickly becomes complex, because there are many details to consider and many ideas and intentions to represent. This led to the microservice movement in software, which simplified models but exposed cross model inconsistency instead.
What sabotages simplicity is not the use of a language per se, but rather the vast number of order-dependent degrees of freedom involved in making changes to data representations like text files. That said, a weak language will make the problem into one of brute force rather than intelligent pattern. A strong language will span many issues in a few statements. Theory tells us about necessary complexity of the language.
For physicist-IT engineers, we could say that Markov processes have simple regular pattern languages, but they are basically disordered. Order dependencies require memory-based, contextual languages that can encode and self-referentially describe state. We need the latter to be able to describe configurations that express meaning over non-trivial scales.
Because our human world is not deterministic and rigorously machine-like, we need to allow for ambiguity. For instance, should a language of intent care whether the lines in a file have different white space than the pattern we promise (because white space is a symmetry i.e. plays no role, only the non-white space characters matter). Similarly, should the order of the lines matter or are they also a symmetry? The password file doesn’t care, because it’s really a random access database. If we only care that a file is not writable by random strangers, what’s our policy on not overriding other parts of the file permissions?
There are safety issues: by changing permissions of shared resources randomly, we could accidentally obliterate a working system in unforeseen ways. So as long as multiple tasks or users have to coexist, we either need to reach consensus with them, or keep our fingers out of it!
To eliminate ambiguities and freedoms, one has to rely on fault tolerance by design to work around this. What surprises developers is that this adds back whatever complexity you thought you could save in simplifying your methodology in the form of maintenance cost later.
There’s a level of management overhead needed that you can’t escape. For developers, it often feels like less responsibility to rely on disposable computing, as I predicted for biological models of apoptosis. We can separate virtual workloads in the cloud, and we hope that the underlying layers of hosting will be reliable, but we just end up moving the responsibility around without telling anyone.
The Age of Rehumanization
The backlash against using expressive languages to describe configuration, which began around a decade ago, saw a return to pure data list formats like YAML and JSON where semantics were entirely implicit, and parameterized patterns were removed. People argued that all the target files for software were just data, so why not make the specification pure data too? For microservices, these were relatively small, so manageable by a two pizza team.
The great IT expansion created a shift of human values. We stopped putting the promise of holistic quality at the apex of our goals and we began to underline the importance of human work satisfaction instead. Conferences in IT shifted from “how to do tech” to “how to cope and get along with each other people in tech”. For the new generation of workers, raised by computer games, these were their weakest skills.
At the same time, Google began to dominate tech. If an idea didn’t come from Google, it wasn’t trusted. Everyone wanted to be like Google. SRE took over from system administration, with a focus on team interaction. Agile and DevOps took over to solve the issues that plagued self centred thinking amongst the people. The new messages were all about teams, not about solutions. When I reviewed and edited the SRE books for Google, it was clear that the authors were making up the technical stuff as they went along. The story was about how SRE was infiltrating and patching company processes.
By breaking up processes into independent agents, microservices follow sound Promise Theory reasoning, but stop short of explaining how the consistency of the whole can be assured. There is no language of cooperation except the socialisation of DevOps to keep that promise. So, systems are vulnerable again as they wait for the next reorganization.
Custom spray jobs are cheap
When holistic configuration languages were passed over for a new generation of remote control package management tools, progress in knowledge management and scalability was lost. Configuration is once again treated like a paint job, or a cinematic screen projection, rather than a healing operation. You can try to project an image of perfection onto a few objects, but the cost and brute force required in mass production would be prohibitive (see the video below).
Coherence is dead! Long live coherence!
Are we coming full circle with configuration? There’s been a few attempts to equip versions of YAML with more powerful introspective features lately, and improved compression for JSON in protocol messages. It might seem that communicating intent is “just about uploading a pattern”, but it’s also about coherence of design at all scales, especially when it serves society at large.
There have never been more companies trying to reinvent existing tech than we see today. Many of them fail because they ignore the work that was done in the past and try to enforce models that cannot be enforced. We don’t seem to be able to learn from the past. Stubborn pride?
The main point of departure for Promise Theory was that just insisting on something or trying to spray paint it isn’t good enough. You can ask a dog to become a cat, but the amount of reconfiguration needed would be unreasonable, if even possible at all. Intentions matter to us, but “dynamics trump semantics” every time. If it can’t be done like that, it can’t be done.
Our narrative in computing is to keep everything simple, rather than be accurate or sustainable. Einstein famously said: everything should be made as simple as possible, but no simpler! This is where we go wrong in IT. We oversimplify because of laziness and the race to get to more exciting issues. Thus, we hope to impose and oblige with simplicity rather than detailing and coping with complexity.
- IT languages fail to describe intentions well.
- IT languages fail to describe change well. They are designed for imposition not repair..
Thinking back to the separation of problems in physics into equations and boundary values, we need to understand the corresponding separation of scales in IT–when we can and when we can’t separate scales. Thought experiments and isolated systems are designed for conceptual simplicity. They belong in textbooks, but the real world is seldom so cleanly defined. Real world concerns are a separate “applied science”. The goal is not to oversimplify, but rather to seek stability or robustly predictable behaviour.
For some, the answer to everything is AI. For the time being, AI is a non-issue. AI can’t tell us what we want. Or rather, when we begin to allow it then we’ve already lost ourselves.
I pointed out in my 1998 position piece Computer Immunology, when a biological system fails to perform its function due to changes getting out of balance, the entity voluntarily dies and is recycled (called apoptosis). This is the model we have begun to adopt for cloud computing. It makes sense to kill off problems rather than fix them when there’s sufficient redundancy to make the disappearance of a part an insignificant event. If making a new one is cheaper than repairing the old, we use garbage collection to eliminate faults. This too is an overhead at both design and execution time.
After 30 years of research, the answers to questions of configuration, language, data consistency (for files and databases), compliance, reliability of promises, and behavioural specification are all quite well known. If we want to make permanent progress, it’s up to the present generation to swallow their pride and embrace knowledge.
It’s the only way to build trusted systems.
Some Links and references
These are some links to projects that I've worked on personally, so my convictions colour my choices. Buyer beware :)
- CFEngine 3 configuration agent and commercial website
- The Omniledger data consistency paper and website
- The implications of EU DORA compliance