15 Disks Dropped Out

IMG_20130214_154359We run some storage pods that are built based on BackBlaze design. Our latest build is a chassis containing 45 pieces of 3TB drives, giving us a total raw disk capacity of 135TB. It’s a nice system that provides plenty of storage on the cheap, and the mirror service http://mirror.nus.edu.sg/ that we run uses one of these pods as a backend to put the mirror contents. It’s all very nice when everything works.

Then, something breaks. Single disk failures are not uncommon. Sometimes more than one disk may fail. With RAID, small failures like these typically don’t disrupt the storage service. But when 15 disks drop out, we have a problem. One of the RAID controller cards failed, causing the system to lose 15 disks. In case you’re wondering, that’s the root cause for our mirror service going down today.

Fortunately we had another pod to cannibalise, and we were able to replace the RAID controller card. The ZFS pool would need some time to get scrubbed though.

Some day we should build a large federated storage that will easily tolerate much more than a pod-level failure, like perhaps a site-level outage. Easier said than done, of course. Then, there’s also the question of whether it is worthwhile at all or not. It’s easy to say we want to provide that certain level of HA/resiliency for all our data. But honestly, not all our data requires the same level of HA/resiliency.

It’s much like how many enterprises build Tier-3 or Tier-4 type data centres, and then throw all their servers in there. However, not all servers, or the services that they host, require that level of resiliency. In fact, no matter how good your site is, you would always have to plan for a site-level outage. So in such a situation, you would have to design your application system or service to tolerate site-level outages. If you have built your application system or service this way, then you wouldn’t need a Tier-3 or Tier-4 type data centre anymore.

Now, what would you do with your “personal data”? I’m referring to the stuff that you keep in your own computer. I.e. your photos, your documents, your stuff. What sort of scalability and backup plans do you have?

I use the cloud, but I’m not one to be trusting of the cloud. The cloud can die. AWS has lost their data centres several times (i.e. offline, albeit not for very long). Speaking of AWS, they’ve also lost customer data before too. So you see why I hold the view that we are each ultimately responsible for our own data. Use the cloud, but don’t depend on the cloud.