Mistakes happen. We try our hardest to make sure they don't here at Brandwatch but occasionally things don't quite go to plan. This is a common problem in Engineering, just ask NASA, who lost a $125-million space craft thanks to a part manufacturer using imperial instead of metric units, or Apple, who failed to discern that the iPhone 4 would suffer huge signal drop when held in a human hand. Clearly, if slip-ups of this magnitude can happen to the largest, most well funded companies imaginable they can happen to anyone, and we are no exception.
This was (and presumably still is) a well known fact at Toyota back at the turn of the century, when Sakichi Toyoda first developed the "5 whys" technique, designed to uncover the “root cause” of problems. Since then, 5 whys has been incorporated into the Toyota Production System, a key part of the company’s success over the last century or so. The method has subsequently moved beyond just manufacturing and is used in other fields, including Software Development.
We have found it useful for a 5 Whys meeting to be triggered by a specific, significant event that can then be the focus of the investigation. An example might be a period of unplanned downtime or a critical bug making its way to the live system, but there is no limit to when it can be used. Generally, it's a case of using judgement to decide when it is appropriate to use 5 whys, but major, unusual or unexpected failures are generally good candidates.
The process is very straightforward, it requires that you begin with the problem and simply ask "Why?" over and over again, building up causes and effects as you go until you reach a root cause or perhaps multiple root causes. It is then a case of studying the root causes and identifying ways to solve them. It's important to think about long term solutions rather than quick fixes, the aim is to prevent similar failures from ever recurring, by tackling the very root cause of the issue rather than the symptoms.
Let's look at a simple example, provided by Taiichi Ohno of Toyota:
"Why did the robot stop?"
The circuit has overloaded, causing a fuse to blow.
"Why is the circuit overloaded?"
There was insufficient lubrication on the bearings, so they locked up.
"Why was there insufficient lubrication on the bearings?"
The oil pump on the robot is not circulating sufficient oil.
"Why is the pump not circulating sufficient oil?"
The pump intake is clogged with metal shavings.
"Why is the intake clogged with metal shavings?"
Because there is no filter on the pump.
In this example, it would be simple to fix the apparent cause of the fault (a blown fuse) and move on, but this would not prevent the problem from recurring (in fact, it may recur quite quickly!).
Similarly, a worker could clear the metal shavings from the pump. This might fix the issue longer term but the issue would still eventually recur.
Even replacing the filter on the pump may not be sufficient. In this example it may be desirable to keep asking "Why" to find out what part of the process failed to allow a filter to be missing from a pump. It could be a process problem, a training problem, a monitoring problem or something else.
How we do 5 Whys
We perform 5 Whys meetings as follows.
After a candidate for 5 whys has been identified a meeting is arranged involving all people involved in the issue. This is typically the entire development team as well as any key people from outside of the team who were involved. It's generally important that everyone attends so that their point of view can be represented. Otherwise, you can find yourself not having the correct person to answer a question present and the investigation grinds to a halt, or worse, the absentee being blamed for the failure and is not present to provide a crucial point of view.
The meeting will be scheduled for about 2 hours but may be shorter. The meeting is split into two parts; in the first half the 5 whys technique is carried out and a tree of all causes and effects are mapped out on a whiteboard. In the latter half of the meeting the root causes identified are discussed, and solutions are proposed. These are noted down on the whiteboard. At the conclusion of the meeting a consensus should be reached on which root causes should be tackled.
Seemingly important causes may be difficult to fix so a compromise must be made, a tradeoff between the simplicity of the fix and the severity of the problem. In practice, this is not normally difficult to achieve. In his book The Lean Startup (where I first learned of the technique), Eric Ries proposes that the investment should be proportional, that is the investment should be smaller when the problem is minor and larger when the symptom is more painful [The Lean Startup p.232].
People who are unfamiliar with the technique are often reluctant to take part, perhaps feeling that the purpose of the meeting is to find someone to be at fault and that they may be blamed for any failures. It's important when facilitating a meeting like this that everyone understands the intended outcome is some kind of process improvement and that there is never a desire to attribute blame to any person.
Making it useful
It's important that the output of the 5 whys meeting is shared with the appropriate people. We think that there is value in everyone in the department (and even beyond) learning from the improvements identified, so we (as a rule) share a brief write-up of the outcome with the entire department as well as with the Product team, and all other parties who may have been involved. It’s important to us to try to foster a culture where we can be open and honest about the circumstances surrounding failure, so that we can learn from them. At Spotify, there is a physical "Fail Wall" where failures are posted and celebrated. This may seem strange, but it is important to realise that every failure is an opportunity to learn, and by distributing the results from the investigation as widely as possible we hope to do just that.