British Airways is blaming the cause of its massive IT outage last weekend on human error. A contractor at its data centre had inadvertently switched off a power supply, leading to a catastrophic IT failure that stranded some 75,000 travellers. But clearly the true failure of must be really somewhere higher up.
It wasn’t clear earlier what kind of power supply British Airways had been referring to, as some reports pointed to a power supply unit, or PSU, typically a component within individual server equipment, which had been wrongly switched off. It would have been completely unbelievable that a PSU shutdown could have caused such a disastrous IT disruption.
However it was later reported that the failure concerned an Uninterruptible Power Supply (or UPS), a system which is designed to deliver a continuous flow of electricity to servers, with batteries that automatically take over in the event of a utility power failure. A maintenance worker had switched off a UPS that was working normally. Shutting down a UPS would have impacted a portion of the whole data centre.
I’ve been running data centers for many years, so stories like these intrigue me. It’s common design for data centres to run two parallel power distributions, with independent incoming power supplies, independent UPS systems, and independent backups. Server equipment would have two or more PSU too. There is no way a single UPS, the big system that would impact the data centre operations, should have taken out British Airway’s IT system.
A human error started the problem by switching off an otherwise good UPS system. Perhaps more human errors followed, such as the panic’ed maintenance worker attempting to restart the UPS system, resulting in a power surge that British Airways had reported. It could have damaged PSU components. There are multiple levels of protections for these sort of events, preventing a catastrophic meltdown.
But perhaps British Airways was really unlucky, that this human error, or errors, took out the entire data centre. Now, you begin to wonder, if a single human could trigger such a catastrophe, why were there not protections to prevent, avoid, or survive such a hazard?
Did British Airways not have a Disaster Recovery (DR) site to resume operations if a primary data centre is taken out of operations? Or was there one, but no one knew how to resume operations at the DR site? Was there a DR plan, was it ever tested, and were all staff familiar with it?
Why did British Airways have an IT system that could not seamlessly continue operations despite a single data centre catastrophe?
The unfortunate maintenance worker may have sparked the incident. I’m sure he or she must be feeling very bad, perhaps already fired, or future career affected. However, the true failure is surely with the design, management, and operation of British Airway’s IT system.
Even if the UPS is shut down it should not shut down the entire system. The design is when utility power is down then the UPS kicks in and UPS can be shut down for maintenance. There was no report of a power outage is there?