Chaos Engineering

Origins

Chaos Engineering as a named practice originated at Netflix around 2010. Engineers at Netflix needed a way to verify that the platform could survive instance failures in AWS — not in theory, but in practice. The answer was Chaos Monkey1: a tool that randomly terminated running instances during business hours, forcing engineers to build systems that could handle that failure mode without notice.

The practice expanded as Netflix's "Simian Army" grew — Chaos Gorilla simulating availability zone outages, Latency Monkey introducing artificial slowness, Chaos Kong simulating full region failures. Nora Jones, Casey Rosenthal, and others at Netflix formalized the discipline in Chaos Engineering: System Resiliency in Practice2, giving the field its name and a small body of principles that have since become widely adopted.

The Premise

Distributed systems fail in ways their builders did not anticipate. A team can write all the unit tests, integration tests, and acceptance tests it wants and still ship a system that falls apart the first time a real production failure cascades through it — because production has failure modes that test environments don't reproduce.

The premise of Chaos Engineering is simple: if you cannot avoid these failures, you can practice them. By deliberately injecting failures in controlled ways, the team learns what actually happens, fixes what doesn't behave well, and builds the muscle of responding to incidents. The system gets stronger because its weak points have been visible and addressed before they caused real customer harm.

The Five Principles

The Principles of Chaos Engineering3:

  1. Build a hypothesis around steady-state behavior: define what "the system is working" looks like in measurable terms.
  2. Vary real-world events: inject the kinds of failures that actually happen — instance death, network latency, DNS errors, regional outages.
  3. Run experiments in production: staging won't reveal what production reveals.
  4. Automate experiments to run continuously: one-off chaos days are theater; routine experiments are practice.
  5. Minimize blast radius: scope each experiment so a worst-case outcome is contained.

What a Chaos Experiment Looks Like

A well-run chaos experiment has five elements:

  • Steady-state hypothesis: "Our checkout completion rate is normally 95%+. If we kill 10% of the service instances, completion rate will stay above 90%."
  • Defined scope: what subset of the system is affected, for how long.
  • Blast radius limit: a kill switch, a maximum exposure percentage, automatic abort criteria.
  • Observability: the metrics that will reveal whether the hypothesis held.
  • Rollback plan: how the experiment ends, whether it succeeds or fails.

Starting Without Tools

Chaos Engineering does not require Chaos Monkey or any specific tool. The discipline can start with very simple experiments:

  • Kill one instance in your service during low-traffic hours. Does the system notice? Does it recover?
  • Block traffic to a downstream dependency for 30 seconds. Does your service degrade gracefully or cascade?
  • Introduce 200ms latency to a database call. Do timeouts trigger? Does the user see acceptable behavior?
  • Fill the disk on a server. Does logging fail loudly or silently?
  • Run the team's incident response drill with a simulated outage. Does the runbook actually work?

Even these simple experiments reveal surprises. Most teams running their first chaos experiment discover that something they assumed was resilient is not.

Gameday vs. Continuous Chaos

Two common modes:

  • Gamedays: scheduled, facilitated chaos sessions where the team deliberately injects failures together, watches what happens, and debriefs. Good for learning, team building, and validating high-risk hypotheses.
  • Continuous chaos: automated experiments running on production (or production-like) systems on an ongoing basis. Good for steady-state verification that the system's resilience properties haven't drifted.

Most mature programs combine both — gamedays for human learning and complex scenarios, continuous chaos for unattended assurance.

Common Pitfalls

  • Skipping the hypothesis: "let's break things and see what happens" produces excitement but rarely useful learning. The hypothesis disciplines the experiment.
  • Production aversion: running chaos only in staging. Most of the interesting failures are production-only.
  • Unbounded blast radius: an experiment that takes the whole system down has not learned anything new about resilience — it has simply caused an incident.
  • No follow-up on findings: chaos surfaces problems; if the team doesn't fix them, the practice produces awareness without resilience.
  • Tool-first thinking: buying Gremlin or Litmus without changing the team's behavior. The tools amplify a practice; they don't substitute for it.
  • Wrong team maturity: a team that can't yet handle a real incident shouldn't be deliberately causing them. Build the incident response muscle before adding chaos load.

Coaching Tips

Start Small

Begin with the smallest experiment that would produce learning. Kill one non-critical instance in low traffic before designing a regional failover drill.

Write the Hypothesis

Before injecting anything, write: "If we do X, we expect Y. We will measure by Z." No hypothesis, no experiment — just chaos.

Bound the Blast Radius

Every experiment needs a kill switch, a maximum exposure, and automatic abort criteria. "We'll just be careful" is not a blast radius limit.

Fix What You Find

Chaos surfaces problems. If the team doesn't fix them, you've added work without adding resilience. Track findings to closure.

Build Incident Response First

If the team can't run an incident response well in a real incident, don't add manufactured ones. Earn the right to chaos by getting the foundation in place.

Run Gamedays as Learning

Facilitate the first few sessions explicitly as learning events. The point is not just resilience — it's the team's relationship with failure.

Summary

Chaos Engineering reframes incidents from accidents to be feared into practices to be invested in. Systems that have never failed in controlled ways will fail in uncontrolled ones; the discipline ensures the failures happen on the team's terms, with the team watching, with the ability to learn and improve. Done well, it makes systems more reliable than testing alone could.

The practice requires real maturity. A team that cannot handle a routine incident should not be causing them deliberately. But a team that has the basic reliability muscle and wants to grow it further has no better tool than thoughtful, scoped, hypothesis-driven failure injection.

Footnotes
  1. Netflix Technology Blog. (2011). The Netflix Simian Army. Netflix Tech Blog.
  2. Rosenthal, C., & Jones, N. (2020). Chaos Engineering: System Resiliency in Practice. O'Reilly.
  3. The Principles of Chaos Engineering. principlesofchaos.org.
Back to DevOps