Continuous Chaos but in a Good Way
We always enjoy a good talk from Amazon Web Services’ resident cloud guru Adrian Cockcroft, and his keynote at MayaData’s Chaos Carnival this week did not disappoint.
He offered a good way of thinking about system resilience. When a system goes down, we tend to want to look at a root culprit of some sort, a failed router, a faulty configuration script. But that is the wrong way of looking at the problem, he advised.
Think of a rope that breaks. "The last strand breaking isn't the cause of the failure. The real cause is that the rope got too frayed,” he said. Instead, he advised, when building out systems, we should instead understand how much margin of failure we have, how much extra capacity a system has. We should understand how frayed we can let our rope get.
To this end, Cockcroft recommended Sidney Dekker’s book “Drift into Failure.” Dekker asserted that, “even if everyone does everything correctly, at every step along the way, you can still get a catastrophic failure, because people are optimizing locally, rather than optimizing for the big picture outcome,” he said. “If you never have a failure, you start believing it can’t happen.”
Before Cockcroft worked at AWS, he was an architect at Netflix, where the concept of chaos testing was pioneered. In chaos testing, resources are randomly taken offline to test the system’s resilience. It seemed like a radical idea a few years back, but over time companies like Gremlin productized the tools for knocking resources offline and then recording the response. And increasingly chaos testing seems like standard practice for site reliability engineers (SREs). Now Cockcroft wants to take your systems to the next level of resilience, with continuous chaos testing.
“What this is going to do is it's going to harden the patterns,” he said. “We're taking what has been traditionally a pretty scary annual experience, that's a big pain in the neck to do, to something that's automated, continuous.”