We have not had such intense firefighting excitement at work in recent times. As IT folks, firefighting usually refers to emergency management of IT issues. This time, the firefighting literally included an element of real fire in the big picture. We’re not firemen, so no, of course, we weren’t directly involved with fighting that real fire. Furthermore, although the fire had not happened inside our premises, our IT operations was severely impacted by fire, and we would be spending significant time recovering services.
In the early morning of Friday 13th, fire had broken out in the 2nd level of block S14 at the NUS Kent Ridge Campus. We didn’t know about that initially, of course. The fire wasn’t at our premises. Although the fire supposedly start at about 3am, our first hint of problems started later.
Our morning started with a NOC alarm at 4:17am about 2 link failures. The two fibre links had lost connectivity. It’s not a severe enough problem to cause any real service disruption. The entire network was still running perfectly normally.
Then at 5:23am, suddenly, we completely lost all connectivity from our main premises to a remote site. Hmm, that was extremely unusual. Theoretically, even in such scenarios, network services should not be significantly impacted, but unfortunately in this instance, we experienced some other problems.
Soon after, I heard from a colleague who manages our Unix server operations that even their SAN connectivity had been lost.
This seems like a very severe problem. The most logical explanation based on these observed symptoms is that there must have been some sort of catastrophic even that has completely damaged all our fibre optic cables. It is not a scenario was like to imagine. Recovering from such an event would be very difficult. But this would be the simplest explanation.
Then, some news start to trickle in. Two fire trucks and a police car were sighted at S15/S14. Sounds like a fire. We have fibre optic cables running through those buildings. Perhaps a fire had catastrophically damaged our cables.
So for the next few hours in the morning, I spent ascertaining exactly what had happened, and exactly where it happened. We met with many people, mobilized our own staff, and drafted out several courses of action depending on how the situation pans out. Bear in mind that at that early point in the morning, we were still unsure of many things.
As more details were confirmed through the morning, we were able to fix up our recovery plans. I think the most significant “breakthrough” was when we were able to assess the nature and extent of damage. It was so completely gutted that we figured there was no way we were able to carry out any repair of the cable at the damage site itself. The most reasonably workable solution was to cut, splice and reroute our fibre optic cables around the entire damage zone. It will be a costly affair. Emergency repairs are always going to be costly.
We are still resuming our service recovery today.