Site icon Zit Seng's Blog

Hard Disk Failure Or SGX Failure

Now that we have the scoop on the root cause of the SGX outage last Thursday, one wonders why a simple matter of a hard disk failure could spiral into such a disruption. More interesting, yet, is why a simple hard disk failure could not be explained until, let’s see, five days later?

I suppose we should credit SGX for not trying to obfuscate the matter. Or maybe they did try to see how they could do some damage control, but ultimately gave up and decided they should just come clean and state the simple problem as it actually is.

A hard disk had failed.

One wonders if they had not used some sort of RAID storage. If they did, why did a disk failure impair the application? Furthermore, why was it needed for that application to detect a disk failure?

Yes, someone needs to be notified about the disk failure so that the disk can be replaced. This is usually done by some server management software. Instead, it sounded like SGX said the trading application needed to detect the hardware failure. Sure, it is a nice thing to have, but isn’t it more important that some fault monitoring software detected and reported the fault for a human to take remedial action?

I can only imagine that, assuming they did have some sort of RAID or other kind of redundancy in place, error and alarm messages must have gone unnoticed. Or, they were noticed, but weren’t acted upon quickly enough.

Worse, yet, is the possibility that SGX doesn’t know enough about RAID, HA, and the sort of redundancy and failover you’d expect a critical IT system to have. It is a “lower level” problem that has been outsourced away.

It’s like how, after their 2014 power outage, we heard that SGX did not have the expertise to design, construct or operate data centers from the facilities perspective.. Perhaps there’s a lot more expertise that they’re lacking.

The cause of last week’s outage might have been a hard disk failure. I want to ask, what is really the underlying cause of the outage? Is it just a hard disk failure, or their inability to run their critical infrastructure?

Exit mobile version