Failure—Is It A Matter Of When?

I’ve been watching the outstanding HBO series Chernobyl which details the worst nuclear reactor meltdown in human history—an event that was approximately 400 times more potent than the atomic bomb dropped on Hiroshima.

What occurred to me, and what I discovered over the course of the TV series, this was a cataphoric failure destined to happen. A slow drift into failure from the beginning—only expedited by the people executing the experiment their rules-based culture required. But why? Is failure just a matter of when?

Chernobyl Disaster

The Mystery Is How Anything Ever Works At All

In the pursuit of success in our dynamic, ever-changing and complex business environment with limited resources and many conflicting desired outcomes, a succession of tiny decisions eventually can produce breakdowns—a domino action of latent failures—on a tremendous scale.

From news feeds to newspapers, daily debacles highlight how the systems we design—with positive intent—can create more unintended consequences and negative effects on the society those very systems are designed to support.

From Facebook hacking to Boeing 737 Max accidents, algorithmic autonomous-bot arguments to legacy top-down management structures, information flows and decision-making we struggle to cope with much of our context, to the point its amazing that anything ever works as intended at all.

When problems occur we hunt for a single root cause, that one broken piece or person to hold accountable. Our analyses of complex system breakdowns remains linear, componential and reductive. In short, it’s inhumane.

The growth of complexity in society has outpaced our understanding of how complex systems succeed and fail. Or as Sidney Dekker human factors and safety author said, “Our technologies have gotten ahead of our theories.”

Click To Tweet

Modeling The Drift Into Failure

Another pioneering safety researcher Jens Rasmussen identified this failure-mode phenomenon which he called “drift to danger”, or the “systemic migration of organizational behavior toward accident under the influence of pressure toward cost-effectiveness in an aggressive, competing environment”

Rasmussen illustrated the competing priorities and constraints that affect sociotechnical systems, as shown above.

Any major initiative is subjected to multiple pressures and our responsibility is to operate within the space of possibilities formed by economic, workload and safety constraints to navigate towards the desired outcomes we hope to achieve at a given time.

Yet, our capitalist landscape encourages decision-makers to focus on short-term incentives, financial success and survival over long-term criteria such as safety, security and scalability. Workers must be more productive to stay ahead and become “cheaper, faster, better”. Customer expectations accelerate exponentially with each compounding innovation cycle of progress. These pressures push against and migrate teams towards the limits of acceptable (safe) performance. Accidents occur when the system’s activity crosses the boundary into unacceptable safety conditions.

Rasmussen’s model helps us to map and navigate complexity, toward properties for which we wish to optimize. For example, if we want to optimize for Safety, then we need to understand where our safety boundary is in his model for our work. For instance, optimizing for Safety is the primary, explicit outcome of Chaos Engineering.

Spoiler Alerts (Or The Lack Thereof)

The crew at Chernobyl were performing a low power test to understand if residual turbine spin could generate enough electric power to keep the cooling system running as the reactor was shutting down. The standard for the planning of such an experiment should have been detailed and conservative. It was not. It was a “see what happens”, poorly designed experiment—with no criteria established ahead of time for when to abort the experiment. The test had also failed multiple times previously.

Design engineering assistance was not requested, therefore the crew proceeded without safety precautions and without properly coordinating or communicating the procedure with safety personnel. Chernobyl was also the award-winning, top-performance reactor site in the Soviet Union.

The experiment went out of control.

In order to keep the reactor from shutting down completely, the crew shut down several safety systems. Then, when the remaining alarm signaled ignored it for 20 seconds.

While the behavior of the team was questionable, there was also a deeper, unbeknown latent failure in waiting. The reactor at Chernobyl had a unique engineering flaw that caused the reactor to overheat during the test, one which had beed obfuscated from the scientists by the Governments policy to insure State secrets remainder so—the graphite components they selected for the reactor design also believed to offer similar safety standards at cheaper costs. It did not.

Under standard operating conditions reactor No.4’s max power output was 3,200 MWt (megawatt thermal) during the power surge that followed the reactors output spiked to over 320,000 MWt. This caused the reactor housing to rupture, resulting in a massive steam explosion and fire that demolished the reactor building and released large amounts of radiation into the atmosphere.

The first official explanation of the Chernobyl accident was quickly published in August 1986, 3 months after the accident. It effectively placed the blame on the power plant operators noting that the catastrophe was caused by gross violations of operating rules and regulations. The operator error was due to their lack of knowledge of nuclear reactor physics and engineering, as well as lack of experience and training. The hunt for the single root cause and individuals error complete, case closed.

It wasn’t until later and the International Atomic Energy Agency’s 1993 revised analysis that debate around the reactor’s design was called into question.

One reason there is such contradictory viewpoints and debate about the causes of the Chernobyl accident was that the primary data covering the disaster, as registered by the instruments and sensors, were not completely published in the official sources.

Much of the low level information for how the plant was designed was also kept from operators due to secrecy and censorship by the Soviet government.

The four Chernobyl reactors were pressurized water reactors of the Soviet RBMK design, very different from standard commercial designs and employed a unique combination of a graphite moderator and water coolant. This makes the RBMK design very unstable at low power levels, and prone to suddenly increasing energy production to a dangerous level. This behaviour is counter-intuitive, and was unknown to the operating crew.

Additionally, the Chernobyl plant did not have the fortified containment structure common to most nuclear power plants elsewhere in the world. Without this protection, radioactive material escaped into the environment.

Contributing Factors To Failure To Consider

KPI’s drive behavior

The crew at Chernobyl were required to complete the test to confirm to standard operation rules to ‘be safe’
They had a narrow focus on what mattered in terms of safety e.g. completing the test versus operating the plant safety
They did not define boundaries, success and failure criteria for the experiment in advance of performing it
They didn’t have all the information to set themselves up for success
They disregard other indicators flagged by the system as anomalies and pushed ahead to complete the experiment with the timeline they were assigned

Flow of information

The quality of your decisions is based on the quality of your information, supported by a good process to make decisions. The operators were following a bad process with missing information
Often we find that people setting policy are not the people doing the actual work, and this causes breakdowns between work-as-expected and work-as-done, such as the graphite design of the reactor
Often policy violations are chosen by workers because they are in a double-bind position, and so they choose to optimize for one value (like timeliness, or efficiency) at the expense of another (like verification)

Values guide behavior

What the company tells you are the behaviors that lead to success, and to be followed
What behaviors you believe lead to success?
What would you do when following the rules goes against your values?

Limited resource put pressure on behavior

When KPIs and behaviors are set in such a way that pressure put behaviors under stress and create unsafe systems
The team that were fully prepared to run the test ultimately had to be replaced by a night shift crew when the test was delayed, and that night shift had far less preparation
The chief engineer was thus put under further pressure to complete the test (and avoid further delay) ultimately clouding his judgment around the inherent risk of using an alternate crew

How To Drift Into Value (Over Failure)

Performance variability may introduce a drift in your situation. However, we can drift to success over failure by creating experiences and social structures for people to safely learn how to handle uncertainty, and navigate towards the desired outcomes we hope to achieve at a given time.

Here’s a set of principles and practices to consider;

Try to encourage sharing of high quality information as frequently and liberally as possible
Look into the layers of your organization—where are decisions made?
How can you move authority to where the information is richest, the context most current, and the employees closest to customers or the situation at hand?
Proactively review ‘Things That Went Right’ (i.e- positive investigations such as the Thai Soccer Team Diving Rescue) and examine the ‘near misses’ instead of waiting for situations that ‘went wrong’
Be aware of conflicting KPIs, compare them with the organization’s values and the behaviors they might drive
Have explicit and communicated boundaries for economic, workload and safety constraints. For example, in Rasmussen’s mode these exist implicitly, whether you acknowledge them or not. Make sure the people doing the work understand where all three boundaries actually are in your context
Explore how values are documented, sense-checked and evolve over time—just as your organization values shifts over time
Proactively seek out what process are being used, sidetracked or short cut. Could you enable the shortcuts processes that work safely? Could you turn the expedite process into the process?
Rule bound versus rule guided cultures give people flexibility. Can you do it better? Why not share it?
Have fast feedback mechanism in place to tell you when you’re hitting your pre-defined risk and experiment boundaries

Conclusion

Complex systems have emergent properties, which means explaining accidents by working backwards from the particular part of the system which has failed will never provide a full explanation of what went wrong.

Click To Tweet

There will rarely be a single source of failure but many tiny acts and decisions along the way that eventually unearth the latent failures in your system. This is why it’s important to remember, that our work doesn’t only progress through time, it progresses through new information, understanding and knowledge.

The questions is what systems do you have in place to safely control your problem domain (not your people)?

Chernobyl while the world’s worst nuclear accident did lead to major changes in safety culture and in industry cooperation, particularly between East and West before the end of the Soviet Union. Former President Gorbachev said that the Chernobyl accident was a more important factor in the fall of the Soviet Union than Perestroika – his program of liberal reform.

What mental models, theories and methods are you using but not driving the outcomes you’re seeking—and which must be unlearned?

References

Special thanks to Mark Da Silva, Qiu Yi Khut, Erinn Collier and Casey Rosenthal for their thoughtful reviews

This post was also inspired by The Lund University Learning Lab on Resilience Engineering I attended at Slack, lead by David Woods, John Allspaw, Laura Maguire, Nora Jones and Richard I. Cook.

Ideas worth exploring (curated by Lorin Hochstein):

The adaptive universe (David Woods)
Dynamic safety model (Jens Rasmussen)
Safety-II (Erik Hollnagel)
Graceful extensibility (David Woods)
ETTO: Efficiency-tradeoff principle (Erik Hollnagel)
Drift into failure (Sidney Dekker)
Robust yet fragile (John C. Doyle)
STAMP: Systems-Theoretic Accident Model & Process (Nancy Leveson)
Polycentric governance (Elinor Ostrom)

INSAG-7 report

Chernobyl Disaster Wikipedia

The Mystery Is How Anything Ever Works At All

Modeling The Drift Into Failure

Spoiler Alerts (Or The Lack Thereof)

Contributing Factors To Failure To Consider

How To Drift Into Value (Over Failure)

Conclusion

References

More ways of working insights

When AI Projects Are Zombies, Ghosts, or Ghouls and How to Spot Them

Frictionless, Artificial Organizations: Measuring What Matters in the Age of AI

Design For Business Evolution

Read my newsletter

Failure—Is It A Matter Of When?

The Mystery Is How Anything Ever Works At All

Modeling The Drift Into Failure

Spoiler Alerts (Or The Lack Thereof)

Contributing Factors To Failure To Consider

How To Drift Into Value (Over Failure)

Conclusion

References

More ways of working insights

When AI Projects Are Zombies, Ghosts, or Ghouls and How to Spot Them

Frictionless, Artificial Organizations: Measuring What Matters in the Age of AI

Design For Business Evolution

Read my newsletter

Thank you!