Zit Seng's Blog

A Singaporean's technology and lifestyle blog

The Art Of Troubleshooting With Data

The mystery of the Circle Line signal interference that has caused significant disruption to train service on that line has been put to rest. Thanks data scientists from GovTech, they found the cause of the disruptions. The story of how they solved the problem quite intrigued me, because that’s what techies like me do all the time.

Now, I don’t want to steal any limelight from GovTech. They certainly did an excellent job and deserved all the accolades. Just think about it for a moment. They aren’t train operators, they don’t have any special experience with designing and building trains. The mystery had apparently even baffled the people who built the Circle Line train system. Yet, here we have data scientists at GovTech who managed to conquer a problem that isn’t something in their domain of expertise.

GovTech used data to get to the bottom of the Circle Line mystery. You can get the details from their blog post.

The manner in which GovTech dealt with the problem isn’t unfamiliar to me. Again, I don’t want to downplay their achievement, but what they did is what every competent engineer ought to have been doing all the time. This is how you troubleshoot problems. You collect data, make sense of it, develop a theory, test your hypothesis, collect more data, rinse and repeat.

Troubleshooting, and fixing problems, has to be predicated on data. If you don’t have data, what is there to fix?

In my line of work, I often hear vague complaints like “the Internet is down”. With that one statement, users expect whatever problem it is that they’re experiencing to be resolved. Well, you know what, if the Internet is working fine for me, then it doesn’t seem like there’s any problem to be fixed?

So I tell technical support staff that they need to collect data, enough so that they can properly and accurately explain the problem, enough so that someone else can reproduce the issue. If the problem can be independently reproduced, troubleshooting becomes so much easier.

The problem with the Circle Line wasn’t so straightforward. It seems random. The problem wasn’t something they could reproduce on-demand. Still, that’s not an excuse not to collect data and work on the basis of that data.

I’m afraid this culture of working with data is, perhaps, getting uncommon. People are troubleshooting and fixing problems through random actions, or even fixed actions that pay no attention to the real problem. These new techniques, surprisingly, work well often enough that everyone’s happy to leave things as they are, however illogical the methodology might be. But hey, if it works, why complain?

Some of these new methods of troubleshooting are brought about by Microsoft Windows. If something doesn’t work, just reboot the PC. If your broadband router isn’t working, try rebooting it. If your TV is mucking up, try turning off the TV, and turning it back on again. It even happens with cars. If the Check Engine light turns on, some people would advise turning off the engine, wait 20 minutes, then try starting the car again.

No one actually tries to troubleshoot the problems more deeply. It’s usually just reboot. Turn off and turn on. Unplug and plug back in. Remove and re-insert.

For consumer devices, perhaps this is just fine. You don’t expect the consumer to have the in-depth technical know-how. The reboot regime works most of the time, no need to dig too deep.

This, however, isn’t acceptable in enterprise IT. I get somewhat riled up when someone just randomly reboots a switch, a server, or whatever equipment without beginning to understand what the issue is about. I have seen people randomly running tcpdump in random places, without knowing what they want to look out for, or knowing if they are even doing it in the right place at the right time. Often times, I hear people say “try this”, without knowing why, what they expect to see, and what next to do when they do or don’t get what they expect.

Not too long ago, I saw a brand-name vendor’s professional services team randomly unplug and rearrange network cable connections, because the network didn’t work. They were thinking if one arrangement doesn’t work, perhaps just try another one, and see if they have better luck. I gave up and left them alone to sort out their troubles.

I just mentioned luck. For some people, fixing problems is about luck. Just try something, perhaps they’ll get lucky and it’ll work. If it really worked, who’s going to ask more questions?

People like me. As annoyed as I might be if something doesn’t work when it should, I’d be equally frustrated if when nothing has changed, that problem suddenly gets resolved. To me, it is equally a problem that something works when it ought not to.

I wanted to talk about data, so let me get to it. What GovTech with the Circle Line was an excellent example of using data. It’s not something revolutionary. We use data all the time to troubleshoot weird, strange, random stuff. Let me give some examples.

Once upon a time, a long time ago… I had a network that occasionally and seemingly randomly melted down from a traffic storm. The problem would start so suddenly we could not troubleshoot effectively, and then it would just disappear so suddenly that we weren’t left with anything to work on. Still, we collected all sorts of data. Much like what GovTech did, we were trying to correlate events and data from other sources. Things like day of week, time of day. Eventually, we correlated the problem occurrences with lull periods in our computer labs. We asked what happens during those periods, and we soon found our culprit.

In another case, we had complaints about “slow network” speeds from time to time. But it was never slow when we went to investigate. Yet, because there were a cluster of such complaints, we could not ignore them. The complaints were all around a certain physical locality, so that prompted us to test our equipment, but we found no fault. There was new structured network cabling installed in those areas, but all of them tested fine. In the end, we found the fault to be with a batch of patch cords. Now, we had begun to zoom in on the patch cords because it was a common denominator. Although casual testing found the cables to be working fine, and they even passed in proper cable testing methods, we ultimately found that the internal twisting of the individual wires in the cables were wrong.

I know these cases are not as sexy sounding as GovTech’s Circle Line problem, but the commonality in them is that the problems are relatively random and difficult to catch. The common top-down, bottom-up or divide-and-conquer methods to approach engineering problems don’t apply in these cases.

Now, you’d wonder why SMRT and LTA couldn’t solve their train mystery on their own.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.