I was attending a course on crisis management earlier this week. It’s great to know how much planning has gone into crisis and emergency management at my workplace. A big organisation such as where I work in would come under intense scrutiny and criticism if it were anywhere less than perfect in handling any emergency or crisis situation.
Being an IT infrastructure person, my scope of emergency preparedness is largely focused in ensuring the continued delivery of IT services. This is a lot more complicated that many people would have envisioned at first thought. Not surprisingly, many non-IT people don’t understand the complexities in continuity management in an IT organisation.
First, like anything else, there’s always the People, Process, and Technology parts to consider. (Some people might know it as People, Process and Product.) But, most people seem to remember only about Technology. As a result, emergency management in IT is often thought to be about UPS, redundant server components, redundant network, backup storage, and alternate data centres.
People in IT will probably (or at least should) appreciate that the technology aspect itself can already be exceedingly complicated. It is easy to understand provisioning of backup power and cooling. (Although, in my experience, most people don’t appreciate the complexities of cooling in a data centre.) The same can also be said about other redundant server and network components. These are the building blocks. Where it gets complicated is when we need complex inter-dependencies between these building blocks to build up a resilient system that can ensure application availability.
Alright, if I just lost you, let me take a step back. Storage is really not as simple as it seems. You know about RAID1 and RAID5. But these do not protect you from a facility-wide outage. You need remote data replication. Now, it is easy to say, but remote data replication leads to many more questions. Is the data replicated synchronously? How is failover handled? How is fail back handled? Is there active-active operation?
Then, when you thought you’ve sort of understood the Technology part, there’s still the People and Processes parts to deal with. You need people to know what they need to do. You need processes in place so that everyone knows what they are supposed to do, and when they are supposed to do. Does everyone know what to do when an emergency situation arises? Are there guidelines to help people recognise an emergency situation? Are there people and processes in place before the emergency situation to ensure that the IT organisation is ready when that said situation arises? How do we know at all that those preparations really work?
There are so many aspects to talk about just in the area of business continuity management in IT. But just let me share a couple of things about communication in an emergency situation. Many organisations forget about the need for well managed communication. Here’s a list of things that happen when IT goes down:
- IT staff respond to the situation.
- Customers, at the same time, are already affected, and they want to know how are they affected and when will things be fixed.
- But of course, IT just responded. They don’t know what is wrong. They are trying to figure that out.
- Customers are impatiently hounding IT. This might actually hinder IT’s work. In fact, this incessant pestering could even be preventing IT from getting useful work done.
- Meanwhile, IT has no useful information to share. Or they are so busy they have no time to share what they think are meaningless information. So customers get no update.
- Customers get angrier.
This degenerative cycle also happens in other areas. Think about all the lack of communication coming from SMRT during the initial major breakdowns. We learn from those mistakes, and subsequently, we get better at communicating with customers. IT may still not know what has gone wrong at the onset of an incident, but they know how to convey their lack of information in such manner that satisfies their stakeholders.
Business continuity for an entire organisation, of course, is a whole lot more in-depth and comprehensive, compared with just what the IT organisation within the business has to deal with.