Friday, July 03, 2009

It's Management's Fault

Article in Forbes by Kenneth G. Brill on 1 July 2009

Author claims that 10 years ago he "developed Brill's Law of Catastrophic Failure: Catastrophic failure is never the result of a single event or interaction, such failures are the result of at least five, and as many as 10, interactions. Any single interaction or several in combination might be bad, but not catastrophic. But when the right combination of interactions occurs (typically seven are required), they will produce a domino effect and devastation will occur 100% of the time."

I think the idea of multiple causes of accidents was well established at this time, but the article does give some good examples of how situations develop. They include

* Falling of a ladder doing DIY - (self) management failures of working alone, rushing to complete before dark, reducing the number of trips taken up and down the ladder by carrying more stuff each time. Human nature of risk taking when activities are repeated without incident. All combine with events that occur including strong winds at the top of the ladder. The result is loss of balance whilst hands are full leading to the catastrophic failure of a fall.
* Failure of Manhattan air traffic control - management failures included agreeing to run on local electrical generators at times of high demand on grid to receive a financial incentive from the electrical supplier, sending technicians on an offsite course so unable to monitor, failing to ensure an alarm was visible to staff when a control centre was moved, failing to provide a properly redundant backup. There was not an immediate failure because battery backup worked for several hours. Luckily manual procedures worked and a catastrophe was averted.

The author has applied his analytical approach to other events and concludes that "The results are always the same. It doesn't matter what the underlying physical, mechanical or electrical portions of the event are, management error or inaction contributes up to half of the interactions resulting in catastrophic failure."

Based on data from The Uptime Institute for data centres, one third of failures are caused by equipment. Of the remaining two-thirds of availability failures, more than 70% are caused by intentional management decisions or by management inaction. This means only 20% of data centre failures are caused by human error.

He claims that "Systemically addressing management issues is the quickest and ultimately cheapest path to reducing catastrophic failure."

No comments: