Some of us have been using computers for a long time. If you’ve been through the era of, say, Windows 95, you’d think computer crashes are just part and parcel of using a computer. However, crashes are far less frequent these days. The odd time when it happens, there are usually some accompanying, though possibly useless, error message or explanation.
Nowadays, it is not so common for a computer to lock up for no rhyme nor reason. But if you have the misfortune of experiencing seemingly random crashes or lock ups, where there are apparently no traceable reasons, then it is time to look at possible hardware faults.
I still experience these random lock ups on my Linux computer at home. Considering that the Linux box is not used very much, and further that the Linux operating system ought to be rather robust, such lock ups are really unusual. I will come home now and then to find that the computer isn’t reachable on the network. There is no display, and there is no response to keyboard and mouse. The only thing that works is the hardware reset button. Upon boot up, I’d find absolutely nothing useful in the logs.
In the old days, one of the possible reasons for these lock ups is with poor or flakey electrical connections between various components of the computer. For example, RAM modules could have worked themselves loose. The remedy is simple. Remove RAM modules, cards, cables, etc. Clean the connectors and contacts. Reinsert them and make sure they are seated properly. This simple task often fixes lots of computers.
That didn’t work for me. My Linux box still locks up now and then. I removed an old hard disk drive that had been failing before. It didn’t help.
The time came to test for something I was reluctant to do. I may have bad RAM in my computer. Testing for bad RAM means I have to shutdown my Linux. I run it as a server, hence I hadn’t been very eager to plunge into RAM testing. Besides, RAM tests often take a long time to run. Furthermore, what are the odds that I really have bad RAM, and with bad RAM my Linux could still run pretty much fine most of the time, sometimes for even a whole month without incident?
Well, I shutdown Linux, and rebooted into Memtest86+. Thankfully Ubuntu, which is what I run at home, included Memtest86+ in its installation, and it was an easy selection from the GRUB menu.
Lo and behold. I really had bad RAM. I retested twice just to be sure.
Fortunately, the Linux kernel has a built-in feature to support marking bad RAM for exclusion from its memory allocation. You actually tell Grub about the bad RAM, and the Linux kernel will get the information from Grub.
Since it was a Ubuntu system I was fixing, here’s what to do in Ubuntu.
- Run Memtest86+.
- Press c for Configuration. Then 4 for Error Report Mode, then 3 for BadRAM Patterns. This mode makes it easy to use in Linux boot parameters.
- Memtest86+ will report errors in a line format suitable for you to give to Grub below.
- You may want to rerun the test two or three times to be sure the error is always at the same address(es).
- Boot into Ubuntu. Edit /etc/default/grub, and look for the GRUB_BADRAM line. Whatever you see in the badram output from Memtest86+, put it in GRUB_BADRAM.
- Run: update-grub
So far my Linux box is working okay. I really do hope the random lock ups are gone for good.