On Saturday, 14th Oct 2023, the aircon in a somewhat innocuous building in an unpretentious locality in Singapore broke down. It shouldn’t be a big deal. Afterall, it’s a non-working day. There’re probably few people around, if at all, in most commercial buildings. But this was different.
This aircon caused a blip in our digital economy.
DBS suffered its 3rd outage of its digital banking services this year. The earlier incident happened on 29th March 2023. There was also another incident on 5th May 2023 where capacity issues caused problems with internet and mobile banking, electronic payments and ATM transactions. The disruption yesterday also saw outages to some ATMs.
But this disruption was beyond just DBS. At the same time, Citibank also suffered an outage to its digital banking services. Then, it subsequently emerged that several other businesses suffered some form of outage to its services.
The common denominator to all of their disruptions come from one thing: they were all impacted by a technical issue at an Equinix data centre. There was a cooling issue at this data centre, and it (probably) caused some systems to shutdown, thus leading to service outages.
Some companies, like Imperva, a cybersecurity software and services provider, were able to work around the problem by redirecting services to other data centres. As a result, their customers were not affected.
We’d think that’s how the cloud works. There are redundant systems so that problems can be trivially worked around without service impact. Even when we consider facility-wide outages (e.g. an entire data centre somehow going offline), services can simply failover to another redundant data centre.
This was apparently not the case with DBS or Citibank.
Ten years ago, I would expect some robustness to such critical systems, but perhaps accept that some banks are just not very good. However, it’s 2023 now, and after so many incidents and reprimands from MAS, I’m surprised that DBS still can’t get such basic things right.
An outage in one data centre completely obliterated their digital banking services. The same happened with Citibank, because they were unfortunate to be colocated in the same facility.
Maybe, there are more banks that are no better, but they escaped this time by having chosen to colocate in a different facility.
This one single data centre caused the outage to DBS and Citibank’s online banking services. In the case of DBS, even some physical ATMs were down. I’ve heard some people, who were travelling then, found their credit card transactions declined. Some others were trapped in carparks because their NETS top-up from DBS failed.
So, let’s be clear. The issue is not just about not being able to login to the banking app. There are other interdependent digital services that got affected, causing various levels of inconveniences to customers everywhere, not necessarily in Singapore. This is not any sort of national level catastrophic event, but surely at the individual level, there would have been some very frustrated customers.
We need to be very concerned, however, that considering how just an aircon problem in one data centre can bring about such wide-ranging impact, if our digital economy is robust enough to withstand not-so-trivial types of infrastructure failures.
Back in 2013, M1 suffered a 3-day mobile service outage because, can you imagine that, their entire network had a dependency on one specific switch. The failure of one switch broke their entire network. Isn’t that just completely insane? It’s not like we didn’t know about redundancy, high-availability, and failover, even before the 2000s. Yet, we have a telco that got caught with such ridiculous network design.
2013 was certainly a bad year, because there was another incident. A fire broke out at Bukit Panjang telephone exchange. It caused a massive network disruption to many users. While I can appreciate that home broadband users and even less-critical commercial services wouldn’t have backups in place, it seems ridiculous that 18 DBS branches, 2 UOB branches, and 100 mobile base stations were out-of-commission.
As Singapore tries to push towards digitising everything, we have to wonder if our digital world is robust enough to withstand some non-trivial attack. At that time, will be also be sufficient prepared, have the required contingencies in place, so that we can still conduct our lives with some adjustments.
What if, for example, there is an outage of our payment network? No PayLay, no PayNow, no VISA, no MasterCard. How would we transact for daily necessities? What if this outage lasted a week?
Some people will thus say, cash is king. That may be so, yet that may become increasingly more difficult when our transactions are virtual, and there may not even be the face-to-face opportunity to exchange cash.
Our vulnerability is that our infrastructure may not be robust and resilient enough to withstand an organised attack.
Here’s an example to think about. Network professionals will know quite well, our connectivity to the outside world, basically come down to our submarine cable landing stations. How many are there? (I leave this as an exercise for you to find out.)
These are big problems that we need to solve. Before that, however, it seems we have yet to overcome more basic challenges. Two banks got taken offline because an aircon broke.