Zit Seng's Blog

A Singaporean's technology and lifestyle blog

An IT Operations Mindset

I suppose I’ve been doing this for too long. What comes as second nature, and common-sense, to me, seems to be exceedingly complicated for some other people to understand. Or, maybe, as I sometimes think, they just cannot be bothered. They’re in the wrong job, because they really don’t like to do this.

My beef is about the often lack of planning and design, or haphazard planning and design that has little reasonable chance to work. Some people make a project timeline that is in name only, there to satisfy some management that demands to see it, but appears to have no intent of following the plan at all. Others agree to anything, just for the meeting or call to be over, and then subsequently act as if the meeting or call never even took place.

These days, people get on Google and then think they are cleverer than their doctor. Not surprisingly, even more people will think IT is so simple. Everything you wanted to know, from installing Linux, learning Python, to developing a website or web app, you can learn it all from YouTube. Yes, you can find tutorials on YouTube and elsewhere, and it’s true these things aren’t really that complicated. You will get your website and web app. Now, whether that is well made, will work in production, can scale, is manageable, is maintainable, robust, and resilient… those are all different things altogether.

Unfortunately, these people don’t know that they don’t know, but think they are quite the experts now, and will not listen to people who actually know more. It is like a patient who Googles about their medical condition and thinks they know it better than their doctor. I’m not saying doctors are not infallible, but surely their years of medical studies and training count for something much more than 10 minutes of Googling.

This rant is about IT operations. But let me first start by saying that at times, the problems with some people’s management of IT operations isn’t even IT-specific. They are common-sense planning, design, and management matters that would also apply to other kinds of work.

For example, the lack of a project timeline. Even if it is for a simple task, it doesn’t hurt to have some inkling of an timeline. When is something needed? Who are available to do it? What skills, resources, and materials are required? Are there any potential challenges? How will we address those challenges? Are there third parties or other dependencies we have to work with?

If something needs to be done today, does it make sense to only start thinking about those questions today? (To be clear, “done today” means it should be completed, and the desired outcomes are achieved, today.)

It’s unbelievable (to me) that sometimes there is no plan at all. If a task is due today, the thinking is that, today, we will take a look at it and figure it out. Well, today arrives, and then it comes to light that there are other pre-requisite activities required. A task that was to be done today might then end up still not seeing any sign of completion even one month later.

Let me give an example about renewing the road tax for one’s car. If the road tax expires next week, you’ll probably got to look into renewing it two or three weeks earlier. But never mind, you think you can just do it “late”, perhaps like one week before it expires. So today, you decide to look into the renewal, and you realise, oh dear, you need insurance to be renewed. Maybe, also that you need to get your car inspected. While road tax and insurance can be done online nowadays (though the insurance will need some lead time to be updated into LTA’s system), getting your car inspected requires that: you have time to get it done; the inspection centre is open; that it is not a peak time if you don’t have lots of time to spend waiting in queue; and if there are any issues, you can get them resolved at a car workshop and still have time to get the car re-inspected.

You tell me, is that not common sense?

Then there are times, when a task involves iterations of a certain step. Let’s say a certain step needs to be repeated many times. Each iteration takes 5 mins, and if 100 iterations are required, the simple arithmetic tells you that 500 mins are needed. That’s assuming you make no errors, and you waste no time. Yet, someone can make a plan that says this task will be done in half an hour (knowing, in this case, that various constraints make parallelism, overlap and other optimisations are not possible).

I have to teach people basic arithmetic and the concept of time, topics that are taught in primary school. You understand, here, the problem isn’t about IT. It is the lack of common sense, people who cannot count, and cannot keep time.

With respect to IT, one should also know that you cannot always simply add resources to make things go faster. E.g., adding 2x people to a task doesn’t really shorten the task time to half. This could be different in other kinds of task. E.g. if it takes 1 painter 10 hours to paint a house, reasonably, 2 painters will likely need about 5 hours to do the same thing. This logic may not work with IT. In fact, sometimes it not only doesn’t scale the same way, adding people could possibly make things worse. IT people should jolly well understand this.

True story: There was a time I broke some code in production. I needed to fix it immediately, and I wasn’t fixing it fast enough. There were very talented people around me who could help, but they didn’t. They provided the best support they could, and that was in spirit, keeping silent, and dealing with other matters. Because they trying to help will certainly make things worse, and they knew it.

I can go on and on about time planning, but let’s move on to contingency planning. This seems to be a common sense thing also. Like, if you plan an outdoor event, you know you have to consider a wet-weather plan.

So here’s what’s wrong. People make optimistic plans, plans that have many moving parts with many things that can go wrong, but they are optimistic that everything will just fall into place perfectly. Then, this is their whole plan, no backup, no contingency, no plan B, and they’re going to go with this as if it will 100% work perfectly.

Did you know, for the National Day Parade flag flypast, the RSAF prepares and flies out multiple Chinooks to carry the flag, but ultimately only one appears at the event area? This is a do-or-die attitude that convicts these people to make sure their job is performed to perfection.

In IT, we often talk about availability, redundancy, and contingency planning. Our designs need to deal with the fact that things break and things go wrong. We need to do something, whether that is to prevent, to mitigate, or to address the issues head-on. Given these are foreseeable incidents, we don’t want to wait until after the incident happens before we start to digest the issues and then figure out what to do.

Back in 2013, M1’s network broke for three days. An incident occurred in their data centre. One switch broke, and you get three days of outage. M1 assumed that switch will never break. There was no planning for what happens when that switch breaks. I know that there is a backstory and some underlying issues that broke that switch, but fundamentally, the problem here is that M1 didn’t have a sensible plan in place to deal with the incident. I don’t mean to single out M1; it’s just an example of things that happen.

So I know in real life, there are constraints like budget, time, manpower, and various other resources leading to compromises and suboptimal designs. But there are instances when I point out serious deficiencies which are trivial to fix, people still don’t want to fix, think my point is simply unrealistic which will never happen, and then the problem happens. I want to say, “I told you so” but mostly I’m just plain annoyed because it ends up causing more problems for me.

Just imagine this. I tell someone they need to do A, because if not, problem B will eventually happen. They don’t do A. Problem B does happen. So I tell them, see, I told you so, therefore, please get A done, because now you know problem B will happen. They still don’t do it. Problem B happens yet again. Why?! Getting A done doesn’t even cost money, nor does it consume any other resources that they don’t already have, except that the task may take a few 10 mins of time.

Telling people to work with a do-or-die kind of mindset doesn’t get very far. They know they won’t die. They won’t even get fired. In fact, they will make clever excuses about how it is not their fault. They continue their don’t-care attitude, put in bare minimum effort but talk as if they’ve made such great achievements. It’s all a shell, looks alright on the outside, hollow on the inside, waiting to burst and collapse anytime. Their mindset is, leave the mess to someone else to cleanup, because they’ll probably move on to the next job by then.

All those common sense is just scratching the surface. I haven’t even begun to talk about things like scalability, manageability, maintainability, etc, that I mentioned in passing earlier. These topics are more IT-oriented, though arguably one can find simple parallels in everyday life to explain. I don’t want to get into them here, except to say, if those people cannot even handle the common sense above, none of these other things are going to be on their minds.

To be fair, there are inexperienced people who are keen to learn and take heed of these best practices. These are probably in the minority. I appreciate that some people also learn better by experiencing their own failures.

Sadly, most won’t even learn from their own failures, oblivious and happy to just keep repeating the same mistakes over and over again. Doing things right is too hard for them. Much easier to leave the mess to others to handle.

Leave a Reply

Your email address will not be published. Required fields are marked *

View Comment Policy