Casey Rosenthal is the co-founder and CEO of Verica and the former Engineering Manager of the Chaos Engineering team at Netflix. He is an author and thought leader in chaos engineering, “a discipline of experimenting with software systems in production in order to build confidence in the system’s capability to withstand unexpected and turbulent conditions.” Casey was an early engineer and champion of chaos engineering, bringing together people from companies like Netflix, Google, Facebook, and Amazon to explore the field. In this week’s Unlearn Podcast, Casey and Barry O’Reilly talk about the chaos engineering domain and how to apply its principles to build high performance teams and businesses.

Origins of Chaos Engineering

Netflix’s migration to the cloud, in particular the sudden outages and service disruptions that would occur, spurred the creation of a program they called Chaos Monkey. “So Chaos Monkey would for each service inside Netflix, every day it would randomly choose an instance and turn it off,” Casey tells Barry. The underlying principle was that once engineers knew a problem existed, they would fix it. “It changed their behavior by aligning the organization around the business problem that needed to be solved,” he remarks. He shares the early Chaos Community Days, bringing colleagues together from leading tech companies to build the discipline they would call chaos engineering. [Listen from 4:00]


Navigating Complexity

Chaos engineering assumes that you already have complexity in your system. “This is engineering to navigate it, or to surface it so that you’re aware of it,” Casey explains. Once you’re aware a problem exists, you can take steps to fix it. It’s a proactive approach to improving availability and security, which improves your system overall. He shares an example of how United Health Group was able to discover a system vulnerability they didn’t know they had, and allocate appropriate resources to strengthen their position. Barry comments, “The thing that’s very contrary with this is that it’s not about people trying to predict the future, it’s about them having the data to understand how the systems are performing and then taking action based on that.” [Listen from 11:05]

Experimentation is a foundational principle of chaos engineering: it can force you to abandon your hypothesis and go in a different direction, thereby generating new knowledge. “Tests don’t generate new knowledge, experiments do,” Casey points out. Barry remarks that you have to be humble enough to realize that complex systems will not always operate in the way you anticipate. Intentionally injecting failure into the system to discover those unanticipated weaknesses gives useful feedback. The key is not trying to remove complexity, but to navigate it, Casey agrees: “That’s where the gems are for learning better ways of going forward… Complexity and success track together.” An engineer’s job is to add complexity, he posits. He and Barry discuss the economic pillars of complexity, and the value of improving reversibility in systems architecture. [Listen from 17:00]

Experimenting With Human Systems

Human systems would benefit from the same experimental approach, Barry says, “yet in the human world so few people apply that rigor to that sort of human change.” It starts with identifying your end goal, Casey agrees. He tells Barry that his own hypothesis of high performance teams is that they “have been able to extricate themselves from the bureaucracy that most of the software engineering industry still wallows in.” Bureaucracy sees the manager as the expert and the one to tell their team what tasks they need to accomplish. However, he points out, “what you’re actually doing is removing the tools from the people who need them most to improvise in the situations that are most critical.” Barry comments on companies “who are intentionally designing themselves to allow the conditions of success.” These companies understand the value of small cross-functional teams: “the knowledge can live and reside for a complex system within a cross-functional group of people who can make changes and understand the impact of those changes,” he points out. [Listen from 27:55]

Relearning Leadership

“Unlearning management is relearning leadership”, Barry says. He asks Casey to share lessons he learned that he is bringing to his new company. Managers are creatures of habit, and that holds them back, Casey responds. “Most of us think we’re making decisions when we’re not; we’re just following habit.” He tries to formulate his own management principles and strategies in his company, instead of following traditional ideas. He believes a manager’s job is to ensure their team has the context they need to make the right decisions. His litmus test is this: if your employees can explain why what they’re working on is the most important thing they could be working on for the company right now, then you are a successful manager. [Listen from 35:35]

Removing Bureaucracy

Casey says, “Chaos engineering tools – by virtue of what they’re doing – they help take the bureaucracy out of the organizations where they’re being practiced, because it’s generating context at the sharp end and it’s giving them the explicit ability to experiment, which means they’re generating new context that then they have to trickle up… If you adopt chaos engineering then you are taking steps to removing bureaucracy from your organization.” Barry agrees that this approach moves accountability and authority to where the information is. It’s giving people the tools and the context they need to experiment safely. He asks Casey to share some unlearning moments as he championed chaos engineering. Casey’s insights include:

  • Adding redundancy often makes things fail;
  • Teams that take more risks tend to be better at handling unsafe situations;
  • Training is not a fix. It’s usually a signal that leaderships don’t understand the real problem.

[Listen from 51:55]