- General Troubleshooting Philosophy
- Localhost Troubleshooting
- Network Troubleshooting
- Hardware Troubleshooting
Hardware Troubleshooting
For the most part you will probably spend your time troubleshooting host or network issues. After all, hardware is usually pretty obvious when it fails. A hard drive will completely crash; a CPU will likely take the entire system down. There are, however, a few circumstances when hardware doesn’t completely fail and as a result causes random strange behavior. Here I will describe how to test a few hardware components for errors.
Network Card Errors
When a network card starts to fail, it can be rather unnerving as you will try all sorts of network troubleshooting steps to no real avail. Often when a network card or some other network component to which your host is connected starts to fail, you can see it in packet errors on your system. The ifconfig command we used for network troubleshooting before can also tell you about TX (transmit) or RX (receive) errors for a card. Here’s an example from a healthy card:
$ sudo ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:17:42:1f:18:be inet addr:10.1.1.7 Bcast:10.1.1.255 Mask:255.255.255.0 inet6 addr: fe80::217:42ff:fe1f:18be/64 Scope:Link UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:1 errors:0 dropped:0 overruns:0 frame:0 TX packets:11 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:229 (229.0 B) TX bytes:2178 (2.1 KB) Interrupt:10
The lines you are most interested in are
RX packets:1 errors:0 dropped:0 overruns:0 frame:0 TX packets:11 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000
These lines will tell you about any errors on the device. If you start to see lots of errors here, then it’s worth troubleshooting your physical network components. It’s possible a network card, cable, or switch port is going bad.
Test Hard Drives
Of all of the hardware on your system, your hard drives are the components most likely to fail. Most hard drives these days support SMART, a system that can predict when a hard drive failure is imminent. To test your drives, first install the smartmontools package (sudo apt-get install smartmontools). Next, to test a particular drive’s health, pass the smartctl tool the -H option along with the device to scan. Here’s an example from a healthy drive:
$ sudo smartctl -H /dev/sda smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ SMART Health Status: OK
This can be useful when a particular drive is suspect, but generally speaking, it would be nice to constantly monitor your drives’ health and report to you. The smartmontools package is already set up for this purpose. All you need to do is open the /etc/default/smartmontools file in a text editor and uncomment the line that says
#start_smartd=yes
so that it looks like
start_smartd=yes
Then the next time the system reboots, smartd will launch automatically. Any errors will be e-mailed to the root user on the system. If you want to manually start the service, you can type sudo service smartmontools start or sudo /etc/init.d/smartmontools start.
Test RAM
Some of the most irritating types of errors to troubleshoot are those caused by bad RAM. Often errors in RAM cause random mayhem on your machine with programs crashing for no good reason, or even random kernel panics. Ubuntu ships with an easy-to-use RAM testing tool called Memtest86+ that is not only installed by default, it’s ready as a boot option. At boot time, hit the Esc key to see the full boot menu. One of the options in the GRUB menu will be labeled Memtest86+. Select that option and Memtest86+ will immediately launch and start scanning your RAM, as shown in Figure 11-1.
Figure 11-1 Memtest86+ RAM scan
Memtest86+ runs through a number of exhaustive tests that can identify different types of RAM errors. On the top right-hand side you can see which test is currently being run along with its progress, and in the Pass field you can see how far along you are with the complete test. A thorough memory test can take hours to run, and I know some administrators with questionable RAM who let the test run overnight or over multiple days if necessary to get more than one complete test through. If Memtest86+ does find any errors, they will be reported in the results output at the bottom of the screen.