Running List of Conjectures about System Stability

Just a quick list of thoughts I have about system stability. Alas, I have little data supporting these conjectures, all I have is distilled experience.

Systems have inertia.

A stable system tends to remain stable until destabilized by a change. Once destabilized, a system may be more unstable for a period of time.

Some systems have more inertia than others

Simple systems are more robust to changes than others. If you have a sufficiently complex production environment (one involving a plethora of legacy systems), sometimes one might stumble upon some old process happily chugging along on a legacy server.

Such systems survive because of a combination of:

The job is important (to someone) but not well known.
The system is dependent on a small number of systems.
The system is at equilibrium with its environment and nothing is perturbing it.

Some systems have less inertia than others

Such systems are a combination of:

Complex. ¹
Not business critical.

Perhaps unsurprisingly, such systems fail all the time. When they fail, they likely remain in a failed state until somebody notices.

The more frequent a system fails, the less severe the failure ².

The time since last failure is correlated with the duration of the next outage

When a system fails frequently, its failure is well documented, it’s easy to recognize, and remedies are well known. When a system fails infrequently, in contrast, the failure may be difficult to recognize (or silent) and remedies, even if easily identified, may take time to implement (the best time to write the cleanup job is ten years ago, the second-best time is exactly after it’s needed).

Frequent repeatable failures are a bad (and expensive) practice, leading organizations to expend considerable effort to exorcise them. This leads to…

Selection against the routine leaves just the absurd

Once you’ve exorcised most of the common demons in your organization, all you’re left with are, well, the rare. Rare failures may look something like complex multi-system cascades, but might also look like events that take increasingly long timescales to manifest.

Multi-system cascades seem rarer in poorly designed systems. It might be that throwing resources at poorly understood problems is a common solution for poorly designed systems. Once a system is well characterized and frequent problems eliminated, it becomes possible to reduce operational costs by eliminating such slack. This enables multi-system cascades.

Similarly, long timescale failures seem rarer in poorly designed systems but seem more common than multi-system cascades. Common solutions to poorly understood problems rhyme with “restart the system” and “restarting the system” often pushes off long timescale failures.

Certain categories of long timescale failures that occur in both poorly and better-designed systems often involve “code expiration.” Commonly, this might look like a certificate expiring. Uncommonly, this might look like the equivalent of “somebody writing code that only holds until some future date.”

Systems have velocities

It’s interesting that some systems provide clear feedback about their stability (if the system is reporting no errors within 30 minutes of a release, it’s likely the case that the release was a success), while others provide much more ambiguous feedback. The rate of feedback might be usefully called a system’s velocity.

A different way to think about a system’s velocity is the speed at which it returns to equilibrium (alternatively, the amount of time it takes for a system to return to equilibrium). As a system visits more and more varied inputs (possible states), it becomes increasingly more clear that the system’s behavior is as expected.

Worker services have high velocities

Many worker services have high velocities. They typically perform one task, load is relatively uniform, and load can be managed via a queueing mechanism.

When worker services are failing, they also have relatively easy to define “health” metrics. Just as how one can easily scale workers based on the number of messages in a queue (or some other metric), one can identify failures in worker services due to an increase in the number of messages in a deadletter queue, an increase in the number of retries, or abnormally long processing times.

APIs have moderate velocities

APIs, due to variability in load across time and variation in usage patterns, have moderate velocities. This is because APIs, due to a number of reasons, often expose many different routes. These routes may vary over orders of magnitude in response times, and within route response time might vary within an order of magnitude.

APIs, just like worker services, experience variable load but the higher variance in response times in APIs tends to increase instability. This also makes scaling APIs based on metrics more difficult.

Scheduled processes have low velocities

Scheduled processes converge to equilibrium very slowly. Attempts to assess a scheduled process “off hours” provides a weak feedback signal.

Interconnected services have low velocities

The more interconnected the service, the more possibilities for interactions with other services. As a consequence, to be relatively confident a system is stable, one needs to measure a system’s health over a much longer perod of time.

What makes a system complex? In short, a complex system has multiple dependencies. ↩︎
If it wasn’t obvious, this is certainly not universally true. We can imagine a system which is just broken. ↩︎

Systems have inertia.#

Some systems have more inertia than others#

Some systems have less inertia than others#

System failure frequency is related to failure severity#

The time since last failure is correlated with the duration of the next outage#

Selection against the routine leaves just the absurd#

Systems have velocities#

Worker services have high velocities#

APIs have moderate velocities#

Scheduled processes have low velocities#

Interconnected services have low velocities#