Today is sort of the equivalent of the Mayan apocalypse for M1. No voice, no text, and no data, for a pretty big chunk of their customers for a pretty big part of the whole day. This is for their 3G mobile service. 2G was still running, but of course, smartphones are so ubiquitous nowadays, so pretty much most people are on 3G.

M1 reported in a press release that a power incident in their network centre triggered the discharge of their gas suppression and water sprinkler system. This led to the outage of one of their mobile network switches.

Their press release suggests that the outage was confined to one mobile network switch. This begs a few questions.

  1. Why is there no redundancy from the mobile network edge (presumably the mobile base stations) to some alternate backup switch? Did M1’s network engineering allow for such a risky single-point-of-failure? A single switch failure could cause such a large scale problem?
  2. Why is it so hard to restore this single switch? If the hardware had been toasted, surely there must be some standby equipment available? What about backend service agreements with equipment vendors?

M1 is lucky that the gas and water discharge only led to the outage of one equipment (at least that’s what we’re hearing). What if there were more serious damage? Think not just about more equipment damage, but the destruction of the entire network centre? Wouldn’t that be so devastating that their mobile services would be out for a very prolonged period of time? Perhaps so bad that M1 would go out of business?

Actually, I find something fishy about the information provided by M1. I mean, the above questions already rock my confidence in M1 (not that I had plenty of it in the first place). I happen to know a little about data centre operations. So I find something amiss.

  1. M1 says a power incident led to the activation of their gas suppression and water sprinkler system. Water sprinklers only discharge water when there is sufficient heat to “burst” the bulb (at the nozzle). So there must have been heat. Heat, as in from a fire.
  2. Besides, such systems, particularly in “data centre” environments, are designed to be very reliable. You require a detection alarm, and a confirmation alarm, before the system activates. When you have both a gas suppression system and water sprinkler system, they ought to have run independently. To have both go off at the same time is quite unusual, unless there had really been a valid detection condition.
  3. In data centre environments, fire detection is usually based on sensing smoke rather than heat. But the water sprinklers “burst” from heat. Which suggests that there seemingly had been both smoke and heat.

I would love to hear more details about what actually happened in M1’s “network centre”. It is just so unfathomable that a single switch outage could cause a disruption on the scale it did, and for so long!

