Scaling Out CodeCrunch

I recently found myself leading an application system project: CodeCrunch. This is an online system for automated assessment of programming tasks. It is designed to help students learn computer programming by providing a web-based system to retrieve programming tasks, submit program solutions, perform automatic assessment, and obtain feedback of testing results. CodeCrunch originally began as a student Final Year Project, until we inherited the system in early 2010, and rolled-out to go-live in July 2010.

We had pushed out CodeCrunch with the intention that it will one day replace an older automated programming assessment system we’re running. The two most important advantages of CodeCrunch are: web-based user interface, and system designed to scale out.

Everyone’s familiar with using web browsers, and so clearly any web-based system is going to be much preferred over another that requires client-side software installation. No need to elaborate anymore here.

The other advantage about scalability is rather fascinating. One of the limitation of the other system (and possibly many other simliar systems used elsewhere) is that it cannot easily scale to accommodate more demanding workloads. Think about a programming lab exam where there might be 500 students submitting programs for automatic assessment. Sure, you deploy more power hardware, faster CPUs, more RAM, etc. But in our past experience, the achilles heel is working with Java programs. How powerful a system do you need to support, possibly, 500 concurrent Java compilations and testing sequences? What if each JVM required a few hundred MBs of memory?

The only solution, really, is for the system to scale out to a distributed architecture. Better yet if the workload can be scheduled asynchronously. This is, in my opinion, the biggest strong point of CodeCrunch’s system architecture.

Today, the CodeCrunch cluster comprises:

One cluster head node: Database, scheduler, and central storage.
6x worker nodes: These are the compute nodes doing the grunt program compilation, testing and assessment.
2x web server nodes: The web servers have very light load, but we run two simply for redundancy purposes.

The cluster is fronted by a pair of load balancers, which also serve to terminate and accelerate SSL connections. On top of the failover/high-availability provided by the load balancer, the SSL termination is particularly helpful, because it relieves the web servers from encryption/decryption overheads. (Our load balancer also performs on-the-fly content compression.)

The distributed design allows the system to scale out to additional nodes rather trivially if required by the workload demand. The nature of asynchronous job scheduling also means that the system will not work itself to death if, for any reason, it cannot keep up with the workload thrown at it.

An interesting phenomenon of web application systems is that, if not properly designed, they often run into a danger of killing themselves as they near peak capacity. Here’s how it typically happens. As the rate of web requests increase, the response gets slower because of CPU bottlenecks, storage I/O bottlenecks, database lock contentions, etc. As response gets slower and slower, at some point, users get “fed up” and click reload on their web browser. The web server doesn’t necessarily know that the previous request has been “abandoned” and simply sees the reloads as additional web requests, thus further worsening the situation. If not properly managed, the problem will simply escalate exponentially until the site kils itself.

The problem is actually so easy to solve as to simply add more nodes to a distributed system. At some point, your distributed system will also get overloaded. The challenge is really to get the system to “fail” gracefully.

CodeCrunch’s nodes are, actually, not real physical hosts. They are virtual machines running on blade servers (like the one pictured above). This allows us to easily move VM instances from a more busy physical host to a less busy physical host when needed. It also allows us to easily manage hardware failures. Better yet, we can even easily spawn additional VM instances during peak usage, or shutdown some VM instances and consolidate others into fewer physical hosts during lull periods. Oh yes, that’s what some vendors are selling as being “green”.

We use MySQL for our database, and like most MySQL users, we have not gone the path of MySQL Cluster. This means that the database is both a potential performance bottleneck and a single-point-of-failure. Not too urgent at this time, but at some point we’ll need to look into this.

Another concern is with storage. There is currently a single central storage volume. We can easily increase the volume size without too much trouble. In fact, I just did that this morning (JADDOG’s guide was helpful). But this central storage volume is a single-point-of-failure.

Not all problems need to be fixed, of course. The MySQL and storage are concerns, but it may turn out that solving them may not be worthwhile for our purposes.

Leave a Reply Cancel reply

Categories

Archives