I recently found myself babysitting a certain web application project as it went live. It was besieged with many performance problems as it scaled up to meet real-life usage. This is a pretty common problem. I deliver a guest lecture from time to time on the topic of building scaleable web application infrastructure, and although this is something pretty familiar to me, I can see that many people are bewildered by the complexities of real-world production web application infrastructure.
The challenge of building scaleable web application infrastructure is that you need domain knowledge that cuts through all layers of the OSI model (or the TCP/IP model if you prefer). Most of the time, application developers are the main drivers of web application projects, but they don’t know much about operating systems, server hardware, storage, and networking. It is usually not their job to know all those things, because there are other people to take care of them.
However, there is a big divide between the people who operate above the operating system level, and the people who work with the operating system or below. The divide often means the two sides don’t communicate on the same terms with each other, and they don’t appreciate the problems experienced by the other.
So, anyway, what’s interesting this time around is about syscalls. A syscall (or system call) is a call to an operating system function. It is below the user library, or the common system libraries that still run in “userland”. A syscall is more expensive than an ordinary function call because there is the matter of “privileged transfer” into operating system code.
I was looking into this Drupal application to find out why it was sucking up so much CPU. Web applications are usually not really CPU-bound. But this problem was rather unusual. So I started to strace the httpd processes. I found this drupal application made over 46K syscalls to service a single web request. Sounds horrendous.
I know 46K is too many syscalls, but I forgot what is a “right number”. I have a site running on Drupal, and I tested that. This site made over 6K syscalls. That’s almost an order of magnitude less. But even then, I thought 6K is still plenty.
For many years, we have been developing our own network management portal. It has grown into a mammoth application, and it has also gotten very old and dated. I tested its mod_perl-based web-front. Nice… just over 500 syscalls to service a single web request. As I looked through the 500 syscalls, I realize that many “wrong things” have crept into our code, but nevermind, 500 syscalls is already an order of magnitude lesser than our better Drupal site.
I next tested a WordPress website I have. It’s actually the old instance of this blog. Nice, 700+ syscalls per request. I’m suddenly not liking Drupal, and liking WordPress a lot more. WordPress does deserve its CMS-of-the-year award.
Just to put things in perspective, the number of syscalls isn’t anywhere the most important metric to look at. Not all syscalls are equal either. Syscalls that result in disk I/O are generally going to be more expensive (time-wise) than one that, say, fetches the system time. This is something worth looking at when you’re trying to squeeze the last drop of performance out of the app.