Running List of Conjectures about System Stability

Just a quick list of thoughts I have about system stability. Alas, I have little data supporting these conjectures, all I have is distilled experience.

Systems have inertia.

A stable system tends to remain stable until destabilized by a change. Once destabilized, a system may be more unstable for a period of time.

Some systems have more inertia than others

Simple systems are more robust to changes than others. If you have a sufficiently complex production environment (one involving a plethora of legacy systems), sometimes one might stumble upon some old process happily chugging along on a legacy server.

Such systems survive because of a combination of: 1. The job is important (to someone) but not well known. 2. The system is dependent on a small number of systems. 3. The system is at equilibrium with its environment and nothing is perturbing it.

Some systems have less inertia than others

Such systems are a combination of: 1. Complex. ¹ 2. Not business critical.

Perhaps unsurprisingly, such systems fail all the time. When they fail, they likely remain in a failed state until somebody notices.

Systems have velocities

It’s interesting that some systems provide clear feedback about their stability (if the system is reporting no errors within 30 minutes of a release, it’s likely the case that the release was a success), while others provide much more ambiguous feedback. The rate of feedback might be usefully called a system’s velocity.

A different way to think about a system’s velocity is the speed at which it returns to equilibrium (alternatively, the amount of time it takes for a system to return to equilibrium). As a system visits more and more varied inputs (possible states), it becomes increasingly more clear that the system’s behavior is as expected.

Worker services have high velocities

Many worker services have high velocities. They typically perform one task, load is relatively uniform, and load can be managed via a queueing mechanism.

When worker services are failing, they also have relatively easy to define “health” metrics. Just as how one can easily scale workers based on the number of messages in a queue (or some other metric), one can identify failures in worker services due to an increase in the number of messages in a deadletter queue, an increase in the number of retries, or abnormally long processing times.

APIs have moderate velocities

APIs, due to variability in load across time and variation in usage patterns, have moderate velocities. This is because APIs, due to a number of reasons, often expose many different routes. These routes may vary over orders of magnitude in response times, and within route response time might vary within an order of magnitude.

APIs, just like worker services, experience variable load but the higher variance in response times in APIs tends to increase instability. This also makes scaling APIs based on metrics more difficult.

Scheduled processes have low velocities

Scheduled processes converge to equilibrium very slowly. Attempts to assess a scheduled process “off hours” provides a weak feedback signal.

Interconnected services have low velocities

The more interconnected the service, the more possibilities for interactions with other services. As a consequence, to be relatively confident a system is stable, one needs to measure a system’s health over a much longer perod of time.

Footnotes

What makes a system complex? In short, a complex system has multiple dependencies.↩︎
If it wasn’t obvious, this is certainly not universally true. We can imagine a system which is just broken.↩︎