Thursday, October 05, 2006

IT reliability

Article by Borris Sadacca on 3 October 2006 available here

Mostly concerned with datacentres, and the reliance on reliable equipment and reliable power supply including Uninterruptible Power Supply (UPS). "It is clear that to achieve high availability in the datacentre, IT directors need to look not only at the applications and server infrastructure and service level agreements associated with the IT, but also at the non-IT infrastructure - the mech­anical, electrical and plumbing systems that keep the datacentre operational."

It points out that systems designed to be highly reliable are often brought down by human error. Examples quoted include:
* Staff may be needed to work after hours and are tired.
* A common problem is when maintenance staff do not follow procedures step by step, which happens especially with well-versed personnel.
* Systems components are replaced even though there are no signs of wear or failure. This creates an opportunity for inserting other failures.
* Invasive checks that require the removal of other components can introduce problems.

"So while technology and multiple levels of redundancy can limit the effect of failure, much of what keeps a datacentre going is down to the people. Many problems can be avoided simply by operating a two-person maintenance team."

Andy Brazier

No comments: