Quick thoughts on the Infinite Hows

Quick thoughts from reading The Infinite Hows

If you are trying to establish a healthy postmortem culture at your company, then you should lean on an established framework. The Five Whys, despite the attempted take down from the Infinite Hows, holds up with healthy doses of blameless culture and openness to real-world complexity.

A coworker recently recommended The Infinite Hows as a superior alternative to the Five Whys. The Infinite Hows claims that the Five Whys is flawed in two main ways:

  1. The Five Whys, incorrectly, suggests that there is a single logical chain of cause and effect that eventually reaches a root cause. This is a gross over-simplification of reality, which cannot be expressed as a single logical chain. The idea of a single logical chain is satisfying to the human mind’s desire for a cut and dried answer to a difficult question. But it papers over omissions and unexplored logical branches.
  2. The Five Whys leads us down a path of human blame instead of system analysis. This human blame, rather than helping us learn from our mistakes, injects fear and friction into incident response.

These are both salient points, but they don’t seem intrinsic to the question of Why versus How. If we internalize that a question’s “Why” is a one-to-many relationship and that we should blame people, not systems, then Why seems like a fine question to ask.

In all of my software jobs, companies have employed a blameless postmortem analysis that roughly matches the Five Whys, without issue. Perhaps the difference comes from the fact that the Infinite Hows was written in 2014, and my first postmortem exposures were in 2015. By 2015, the industry may have matured to a more efficient (e.g. blameless) postmortem culture.

On the other hand, the first point about the fallacy of a single logical chain seems more salient. I have certainly experienced incidents with multiple chains of problems. I have seen incidents that are severe enough that engineers are split into different groups to explore different mitigations. For example, one group may explore whether a rollback of a given service can mitigate the incident, while the second group may explore whether some combination of feature flag changes can mitigate. Now there are two branches of events. Each branch will have its own sub-sequence of steps, from which we can probably learn something.

Suppose the incident can be mitigated by a rollback, but the second group has a lot of trouble determining whether there was a given culprit flag. Even if rollback resolves the incident, the abandoned feature-flag investigation shouldn’t vanish. The entire feature flag tangent is no longer relevant to the root cause, but the pain of finding a culprit flag will likely emerge in a future incident.

And besides forgetting about unexplored branches, the Infinite Hows mentions the very real problem of omission. Engineers often exclude certain parts of the story because they feel adjacent to blame or uncomfortable to share. Some examples:

  • An oncall may take a long time to respond to a page
  • An engineer may message a subject matter expert instead of paging them, out of fear of disturbing the expert
  • Engineers may spend a lot of time trying to work through an incident via message thread instead of creating a video call

These are all process issues that could be explored for improvement. Does everyone have their pager app set up to override Do Not Disturb? Does the company have a written policy about when to page others or when to create a video call? By creating policies for these situations, we can remove the mental overhead of decision making and replace the fear of judgment for that decision with following the process.

Ultimately, blaming people instead of systems is counterproductive for improving engineering safety, and omitting incident “sub-branches” forgoes some learning benefits. Neither of these feel particularly worth inventing an entirely new approach, though.

Perhaps the best approach would be incident analysis framework that accommodates multiple, weighted, root causes and logic branches. This weighted, casual event tree would acknowledge multiple contributing causes, surface one or two “primary” drivers for focus, and preserve sub-branches for future improvement. This would help to articulate the idea that many causes can contribute to a single effect, without abandoning the idea of a primary root cause. At a certain point, we do have to pick a limited set of causes and cut off the depth of the logical tree.

If you’re looking for a good incident analysis framework, then you can rely on either the trusty Five Whys or the more nuanced and reflective Infinite Hows.