One would have reasonably expected a server operating system to be stable and robust (and secure too, among other things). When you apply patches, and/or upgrade system software components, one would expect things to “just work”. Here, I have a Linux system, running CentOS. I picked a server distribution of Linux because, well, I want the system to be relatively pain-free to maintain. I’ve no need for cutting-edge software.
So I was surprised that, after a routine yum update run, which included a kernel update, my server failed to come back online. To make a long story short, my server has an Intel 82574L NIC, and I have this bug:
(The photo above is just a file photo, not one of the actual server in question.)
The solution is either to set pcie_aspm=off in the kernel boot options, or to apply an EEPROM patch to the NIC.
I’m a little miffed at the time lost. Luckily the said server isn’t really a production one. But this could easily have happened with a server that was a lot more important. Then again, one should have tested updates and patches first before rolling them out in production servers. However, many of use don’t have the luxury to have testbeds with test servers of every type and configuration as the ones that run in production.
That’s why we go with server distributions, hoping things are a lot more stable and tested. It’s a shortcut to help reduce our workload. So this bug, apparently, has been around for a couple of months. The CentOS bug report was filed on 3 Dec 2013. The EEPROM fix mentioned above is from April 2012. Yes, pretty long time ago. Of course, it seems the problem is with the NIC. But perhaps the kernel and/or driver could have been helpful enough to either auto-fix, or perhaps warn of the problem via syslog.
Anyhow, now you know. If you have an Intel 82574L, and use CentOS, and are upgrading kernel… watch out for this.