Saturday, March 28, 2020

Day 8: Corona, Brittleness and Efficiency

This article is part of The 100 Days Offensive.

The last years have been focused on making the economy more efficient. But the corona crisis now shows us the downside of our efficiency. The economic systems in turn have become more brittle.

Wikipedia says "A material is brittle if, when subjected to stress, it breaks with little elastic deformation and without significant plastic deformation." This definition also shows what one property on the opposite side is: Elasticity. And since we all love elasticity in our systems: How can those systems be brittle? Let me explain.

Elasticity is not just a checkbox we click "on" when we design a system. It comes at a cost. For example:
  • Services are elastic that have unused resources that can be used to satisfy a higher demand.
  • Organisation are elastic when people have time they can invest in managing a crisis.
  • Projects are elastic when there are buffers (in terms of time & material) to deal with unexpected problems.
Elasticity always implies not using 100% of the available resources. That is not perfectly efficient. So in order to increase our efficiency, we eliminate non-allocated time of people, CPU cycles or delivery windows. But in turn our systems become more brittle.

Elasticity also means a system or process bends under stress: the functionality becomes limited or restricted. For example it is slower to react or may show unfamiliar reactions. This is not a desired property.

So how elastic do we design something? We want it to handle the normal stress without breaking. But what about unusual stress? How much stress to plan for is still economical?

We found very interesting ways to approach this vector:
  • We outsource it: We have a provider who shall provide us with the elasticity on demand to cope with unusual high stress. "Our contracts will protect us!"
  • We accept the risk: We assign a cost to the (unlikely) case, multiply it by the probability and create reserves to deal with the fallout. "Our bank account will save us!"
  • We create pans for a breaking system: We plan for the case of disaster and on how to continue or restart the system in case of a major breakage. "We know how to get out of the jam!"
  • And one of the most common one, we ignore it. As we have no way of dealing with the risk without breaking our business case, we prefer to ignore the risk. "It is extremely unlikely and therefore will not happen!"
Usually not a single approach is taken, but a combination of several methods is applied. For example: we have a cloud provider who promises us to increase our resources in case we need them, assign a small probability to the scenario of the cloud provider failing and build reserves for that case, have an emergency plan to move our business to another provider quickly and ignore some scenarios.

This not just true for the small web shop next door who sells handcrafted items, but also for whole national economies. In the competition between nations for wealth our economies have become more and more efficient by using the same means as described above. We just increased the scale by several orders of magnitude while maintaining the same coping mechanisms.

Those mechanisms are not as stupid as they sound here. They have been working quite well for some times and have gotten us through quite a few earthquakes, hurricanes and a lot of other disasters. So we used them to become even more efficient. And we have become huge fans of this efficiency. It is nothing short of a miracle that with the click of a mouse we can get nearly every item even from China for dime within a week. After two years that feels like a universal law.

But the permanent drive for efficiency has also produced additional side effects:
  • Cluster risks: Similar systems can operate more efficient if they are placed physically close to each other.
  • Cascades: In case of a failure in one system the load is redirected to a similar system.
  • Inter-dependencies: Every system depends on several others to work. This allows us to specialize (and therefore increase efficiency) but create the danger of (unseen) circular dependencies.
Examples are:
  • Pharmaceutical companies have similar demands on raw materials and trained people. Though they are competing with each others, they tend to install themselves in close physical proximity.
  • Cloud services of customers are often to intended to be cloud agnostic (with limited success). If a cloud provider fails, they shall be quickly deployed to the next one.
  • Mines require complex machinery which is turn fabricated by using the same ores.
Overall all systems have become more and more efficient to handle the usual stress. 

The corona epidemic is far from being just "usual stress". We discover that our usual mitigation strategies do not work on this scale:
  • Contracts cannot save you from a government ordered shutdown of your provider.
  • The financial reserves may be calculated for one week or even a month of production outage, but a longer duration may cause the company to close.
  • Disaster recovery plans are designed for one's own company to experience a crisis alone and not in concert with several others, competing for the same resources required for a recovery.
  • Business scenarios considered unlikely or even impossible occur right now quite frequently.
  • All competing companies producing some specific goods are located close to another and are affected by the same regional outbreak/shutdown.
  • Failing companies shed their customer base which turn to the competitors and overloading their infrastructure and causing a chain of failures.
One property of brittleness is that the work-piece or system at the point of too much stress does not bend but break. This is called a "catastrophic failure". Meaning the affected piece (or in our case: system) has to be replaced or rebuilt before it can be used again. 

Lacking sufficient storage (another inefficiency we got rid off via "just in time delivery"), this requires other systems to work and provide the resources for a restart. The circular dependencies already mentioned may cause overlay systems to fail and make restarts/rebuilds really painful and slow.

And this is why the current crisis worries me. That is also the reason why I am very much in favor of the government intervening and trying to prevent additional failures for financial reasons.

No comments:

Post a Comment