M1’s Idea of Network Resiliency

DSC03098 - Version 2M1 has come out with more information surrounding its colossal 3G service outage from 15 to 17 January. Great, I was indeed hoping that M1 would share more information. As I last posted, their earlier press release raised many questions. Questions such as the type and scale of actual disaster that they experienced, and just what kind of network resiliency and disaster recovery plans they had.

First off, I was quite right that things were a lot more complicated than a “power incident” as they had described originally. There was smoke. Smoke would have triggered the FM200 gas suppression system, and it would have charged the dry sprinkler system had they used one instead of a wet sprinkler.

The new information from M1 says the gas discharge resulted in one of the water sprinklers activating (i.e. releasing water) . This is still fishy to me. It is possible that the gas would have activated (i.e. armed to charge water into a dry sprinkler system). But whichever kind of water sprinkler system was in use, heat would have been required to burst the bulb at the sprinkler head (the nozzle) to release water. Was there a fire?

FM200 gas suppression systems are ultra effective. Within 10 seconds, the protected area is completely filled with HFC-227ea gas. It prevents fire from combusting. This is almost instantaneous. It is extremely unlikely that any fire could have continued to combust, producing enough heat to burst the water sprinkler bulb. That’s the whole point of investing in an expensive fire suppression system like FM200. You do not want the water sprinkler to release water, because the cleanup thereafter is expensive.

So here’s the curious question. Was the FM200 system actually working properly?

M1 explained about two action plans they took to restore service after their mobile network switch was destroyed:

  • Relink the radio network controller from a mobile network switch to an alternative mobile network switch.
  • Reconfigure the base stations to use an alternative mobile network switch, presumably through another radio network controller, hence independent of the work above.

It took almost three days. Clearly, a lot of work needed to be done. All that configuration, running around, doing this and that. I appreciate that that is a lot of work.

However, this is not what we call high availability network. This is not a network with redundant components designed to be resilient against a single site outage.

It seems to be that M1 is quite satisfied with themselves that their contingency plans enabled them to restore service in the shortest time possible. Otherwise, it would have taken them 12 to 16 weeks. If they had been out of action for 12 to 16 weeks, they woud probably be out of action permanently — i.e. out of business.

Is three days something to be reassured about? Com’on, M1 claimed just one mobile network switch was out of action. A small little thing like that put them out of action for three days.

What if that incident site had burned down?

I am not an expert in 3G mobile infrastructure. But coming from an IT point-of-view, our systems are designed to be much more resilient. If one switch is down, there’s another that takes over immediately (well, give a couple of seconds to a minute or two, depending on the technology in use). It’s alright for an entire IT rack to be destroyed. There’s another one. It’s okay for the entire data centre to burn down. There’s another data centre.

If you’re as big as Google, it’s alright for an entire geographic area to be nuked out. There’s always something else somewhere else.

Of course, how much resiliency you design, and how much you spend, depends on your risk analysis and business continuity requirements.

Surely, I think M1 has under-provisioned their network resiliency. I thought perhaps a whole data centre had burned down. But no.

Just one switch. Three days of outage.

Comments

  1. I think this is a clear case of non-performing automatic failover (sneaker net failover in this instance doesn’t count) that sabotaged the overall resiliency of the entire system design. I would have thought something as industrious as a base station would have simple redundant uplink configuration.

  2. […] I still cannot believe the incident that happened just a little over a year ago, where a single data centre accident led to a catastrophic crippling of much of M1′s […]

  3. […] I still cannot believe the incident that happened just a little over a year ago, where a single data centre accident led to a catastrophic crippling of much of M1′s […]