- General Troubleshooting Philosophy
- Localhost Troubleshooting
- Network Troubleshooting
- Hardware Troubleshooting
Localhost Troubleshooting
While I would say that a majority of problems you will find on a server have some basis in networking, there is still a class of issues that involves only the localhost. What makes this tricky is that some local and networking problems often create the same set of symptoms, and in fact local problems can create network problems and vice versa. In this section I will cover problems that occur specifically on a host and leave issues that impact the network to the next section.
Host Is Sluggish or Unresponsive
Probably one of the most common problems you will find on a host is that it is sluggish or completely unresponsive. Often this can be caused by network issues, but here I will discuss some local troubleshooting tools you can use to tell the difference between a loaded network and a loaded machine.
When a machine is sluggish, it is often because you have consumed all of a particular resource on the system. The main resources are CPU, RAM, disk I/O, and network (which I will leave to the next section). Overuse of any of these resources can cause a system to bog down to the point that often the only recourse is your last resort—a reboot. If you can log in to the system, however, there are a number of tools you can use to identify the cause.
System Load
System load average is probably the fundamental metric you start from when troubleshooting a sluggish system. One of the first commands I run when I’m troubleshooting a slow system is uptime:
$ uptime 13:35:03 up 103 days, 8 min, 5 users, load average: 2.03, 20.17, 15.09
The three numbers after the load average, 2.03, 20.17, and 15.09, represent the 1-, 5-, and 15-minute load averages on the machine, respectively. A system load average is equal to the average number of processes in a runnable or uninterruptible state. Runnable processes are either currently using the CPU or waiting to do so, and uninterruptible processes are waiting for I/O. A single-CPU system with a load average of 1 means the single CPU is under constant load. If that single-CPU system has a load average of 4, there is 4 times the load on the system that it can handle, so three out of four processes are waiting for resources. The load average reported on a system is not tweaked based on the number of CPUs you have, so if you have a two-CPU system with a load average of 1, one of your two CPUs is loaded at all times—i.e., you are 50% loaded. So a load of 1 on a single-CPU system is the same as a load of 4 on a four-CPU system in terms of the amount of available resources used.
The 1-, 5-, and 15-minute load averages describe the average amount of load over that respective period of time and are valuable when you try to determine the current state of a system. The 1-minute load average will give you a good sense of what is currently happening on a system, so in my previous example you can see that I most recently had a load of 2 over the last minute, but the load had spiked over the last 5 minutes to an average of 20. Over the last 15 minutes the load was an average of 15. This tells me that the machine had been under high load for at least 15 minutes and the load appeared to increase around 5 minutes ago, but it appears to have subsided. Let’s compare this with a completely different load average:
$ uptime 05:11:52 up 20 days, 55 min, 2 users, load average: 17.29, 0.12, 0.01
In this case both the 5- and 15-minute load averages are low, but the 1-minute load average is high, so I know that this spike in load is relatively recent. Often in this circumstance I will run uptime multiple times in a row (or use a tool like top, which I will discuss in a moment) to see whether the load is continuing to climb or is on its way back down.
What Is a High Load Average?
A fair question to ask is what load average you consider to be high. The short answer is “It depends on what is causing it.” Since the load describes the average number of active processes that are using resources, a spike in load could mean a few things. What is important to determine is whether the load is CPU-bound (processes waiting on CPU resources), RAM-bound (specifically, high RAM usage that has moved into swap), or I/O-bound (processes fighting for disk or network I/O).
For instance, if you run an application that generates a high number of simultaneous threads at different points, and all of those threads are launched at once, you might see your load spike to 20, 40, or higher as they all compete for system resources. As they complete, the load might come right back down. In my experience systems seem to be more responsive when under CPU-bound load than when under I/O-bound load. I’ve seen systems with loads in the hundreds that were CPU-bound, and I could run diagnostic tools on those systems with pretty good response times. On the other hand, I’ve seen systems with relatively low I/O-bound loads on which just logging in took a minute, since the disk I/O was completely saturated. A system that runs out of RAM resources often appears to have I/O-bound load, since once the system starts using swap storage on the disk, it can consume disk resources and cause a downward spiral as processes slow to a halt.
top
One of the first tools I turn to when I need to diagnose high load is top. I have discussed the basics of how to use the top command in Chapter 2, so here I will focus more on how to use its output to diagnose load. The basic steps are to examine the top output to identify what resources you are running out of (CPU, RAM, disk I/O). Once you have figured that out, you can try to identify what processes are consuming those resources the most. First let’s examine some standard top output from a system:
top - 14:08:25 up 38 days, 8:02, 1 user, load average: 1.70, 1.77, 1.68 Tasks: 107 total, 3 running, 104 sleeping, 0 stopped, 0 zombie Cpu(s): 11.4%us, 29.6%sy, 0.0%ni, 58.3%id, .7%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 1024176k total, 997408k used, 26768k free, 85520k buffers Swap: 1004052k total, 4360k used, 999692k free, 286040k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9463 mysql 16 0 686m 111m 3328 S 53 5.5 569:17.64 mysqld 18749 nagios 16 0 140m 134m 1868 S 12 6.6 1345:01 nagios2db_status 24636 nagios 17 0 34660 10m 712 S 8 0.5 1195:15 nagios 22442 nagios 24 0 6048 2024 1452 S 8 0.1 0:00.04 check_time.pl
The first line of output is the same as you would see from the uptime command. As you can see in this case, the machine isn’t too heavily loaded for a four-CPU machine:
top - 14:08:25 up 38 days, 8:02, 1 user, load average: 1.70, 1.77, 1.68
top provides you with extra metrics beyond standard system load, though. For instance, the Cpu(s) line gives you information about what the CPUs are currently doing:
Cpu(s): 11.4%us, 29.6%sy, 0.0%ni, 58.3%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st
These abbreviations may not mean much if you don’t know what they stand for, so I will break down each of them below.
- us: user CPU time
- sy: system CPU time
- ni: nice CPU time
- id: CPU idle time
- wa: I/O wait
- hi: hardware interrupts
- si: software interrupts
- st: steal time
This is the percentage of CPU time spent running users’ processes that aren’t niced (nicing a process allows you to change its priority in relation to other processes).
This is the percentage of CPU time spent running the kernel and kernel processes.
If you have user processes that have been niced, this metric will tell you the percentage of CPU time spent running them.
This is one of the metrics that you want to be high. It represents the percentage of CPU time that is spent idle. If you have a sluggish system but this number is high, you know the cause isn’t high CPU load.
This number represents the percentage of CPU time that is spent waiting for I/O. It is a particularly valuable metric when you are tracking down the cause of a sluggish system, because if this value is low, you can pretty safely rule out disk or network I/O as the cause.
This is the percentage of CPU time spent servicing hardware interrupts.
This is the percentage of CPU time spent servicing software interrupts.
If you are running virtual machines, this metric will tell you the percentage of CPU time that was stolen from you for other tasks.
In my previous example you can see that the system is over 50% idle, which matches a load of 1.70 on a four-CPU system. When I diagnose a slow system, one of the first values I look at is I/O wait so I can rule out disk I/O. If I/O wait is low, then I can look at the idle percentage. If I/O wait is high, then the next step is to diagnose what is causing high disk I/O, which I will cover below. If I/O wait and idle time are low, then you will likely see a high user time percentage, so you will need to diagnose what is causing high user time. If the I/O wait is low and the idle percentage is high, you then know any sluggishness is not because of CPU resources and will have to start troubleshooting elsewhere. This might mean looking for network problems, or in the case of a Web server looking at slow queries to MySQL, for instance.
Diagnose High User Time
A common and relatively simple problem to diagnose is high load due to a high percentage of user CPU time. This is common since the services on your server are likely to take the bulk of the system load and they are user processes. If you see high user CPU time but low I/O wait times, you simply need to identify which processes on the system are consuming the most CPU. By default, top will sort all of the processes by their CPU usage:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9463 mysql 16 0 686m 111m 3328 S 53 5.5 569:17.64 mysqld 18749 nagios 16 0 140m 134m 1868 S 12 6.6 1345:01 nagios2db_status 24636 nagios 17 0 34660 10m 712 S 8 0.5 1195:15 nagios 22442 nagios 24 0 6048 2024 1452 S 8 0.1 0:00.04 check_time.pl
In this example the mysqld process is consuming 53% of the CPU and the nagios2db_status process is consuming 12%. Note that this is the percentage of a single CPU, so if you have a four-CPU machine you could possibly see more than one process consuming 99% CPU.
The most common high-CPU-load situations you will see are all of the CPU being consumed either by one or two processes or by a large number of processes. Either case is easy to identify since in the first case the top process or two will have a very high percentage of CPU and the rest will be relatively low. In that case, to solve the issue you could simply kill the process that is using the CPU (hit K and then type in the PID number for the process).
In the case of multiple processes, you might simply have a case of one system doing too many things. You might, for instance, have a large number of Apache processes running on a Web server along with some log parsing scripts that run from cron. All of these processes might be consuming more or less the same amount of CPU. The solution to problems like this can be trickier for the long term, as in the Web server example you do need all of those Apache processes to run, yet you might need the log parsing programs as well. In the short term you can kill (or possibly postpone) some processes until the load comes down, but in the long term you might need to consider increasing the resources on the machine or splitting some of the functions across more than one server.
Diagnose Out-of-Memory Issues
The next two lines in the top output provide valuable information about RAM usage. Before diagnosing specific system problems, it’s important to be able to rule out memory issues.
Mem: 1024176k total, 997408k used, 26768k free, 85520k buffers Swap: 1004052k total, 4360k used, 999692k free, 286040k cached
The first line tells me how much physical RAM is available, used, free, and buffered. The second line gives me similar information about swap usage, along with how much RAM is used by the Linux file cache. At first glance it might look as if the system is almost out of RAM since the system reports that only 26,768k is free. A number of beginner sysadmins are misled by the used and free lines in the output because of the Linux file cache. Once Linux loads a file into RAM, it doesn’t necessarily remove it from RAM when a program is done with it. If there is RAM available, Linux will cache the file in RAM so that if a program accesses the file again, it can do so much more quickly. If the system does need RAM for active processes, it won’t cache as many files.
To find out how much RAM is really being used by processes, you must subtract the file cache from the used RAM. In the example above, out of the 997,408k RAM that is used, 286,040k is being used by the Linux file cache, so that means that only 711,368k is actually being used.
In my example the system still has plenty of available memory and is barely using any swap at all. Even if you do see some swap being used, it is not necessarily an indicator of a problem. If a process becomes idle, Linux will often page its memory to swap to free up RAM for other processes. A good way to tell whether you are running out of RAM is to look at the file cache. If your actual used memory minus the file cache is high, and the swap usage is also high, you probably do have a memory problem.
If you do find you have a memory problem, the next step is to identify which processes are consuming RAM. top sorts processes by their CPU usage by default, so you will want to change this to sort by RAM usage instead. To do this, keep top open and hit the M key on your keyboard. This will cause top to sort all of the processes on the page by their RAM usage:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18749 nagios 16 0 140m 134m 1868 S 12 6.6 1345:01 nagios2db_status 9463 mysql 16 0 686m 111m 3328 S 53 5.5 569:17.64 mysqld 24636 nagios 17 0 34660 10m 712 S 8 0.5 1195:15 nagios 22442 nagios 24 0 6048 2024 1452 S 8 0.1 0:00.04 check_time.pl
Look at the %MEM column and see if the top processes are consuming a majority of the RAM. If you do find the processes that are causing high RAM usage, you can decide to kill them, or, depending on the program, you might need to perform specific troubleshooting to find out what is making that process use so much RAM.
OOM Killer
The Linux kernel also has an out-of-memory (OOM) killer that can kick in if the system runs dangerously low on RAM. When a system is almost out of RAM, the OOM killer will start killing processes. In some cases this might be the process that is consuming all of the RAM, but this isn’t guaranteed. I’ve seen the OOM killer end up killing programs like sshd or other processes instead of the real culprit. In many cases the system is unstable enough after one of these events that you find you have to reboot it to ensure that all of the system processes are running. If the OOM killer does kick in, you will see lines like the following in your /var/log/syslog:
1228419127.32453_1704.hostname:2,S:Out of Memory: Killed process 21389 (java). 1228419127.32453_1710.hostname:2,S:Out of Memory: Killed process 21389 (java).
Diagnose High I/O Wait
When I see high I/O wait, one of the first things I check is whether the machine is using a lot of swap. Since a hard drive is much slower than RAM, when a system runs out of RAM and starts using swap, the performance of almost any machine suffers. Anything that wants to access the disk has to compete with swap for disk I/O. So first diagnose whether you are out of memory and, if so, manage the problem there. If you do have plenty of RAM, you will need to figure out which program is consuming the most I/O.
It can sometimes be difficult to figure out exactly which process is using the I/O, but if you have multiple partitions on your system, you can narrow it down by figuring out which partition most of the I/O is on. To do this you will need the iostat program, which is provided by the sysstat Ubuntu package, so type
$ sudo apt-get install sysstat
Preferably you will have this program installed before you need to diagnose an issue. Once the program is installed, you can run iostat without any arguments to see an overall glimpse of your system:
$ sudo iostat Linux 2.6.24-19-server (hostname) 01/31/2009 avg-cpu: %user %nice %system %iowait %steal %idle 5.73 0.07 2.03 0.53 0.00 91.64 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 9.82 417.96 27.53 30227262 1990625 sda1 6.55 219.10 7.12 15845129 515216 sda2 0.04 0.74 3.31 53506 239328 sda3 3.24 198.12 17.09 14328323 1236081
The first bit of output gives CPU information similar to what you would see in top. Below it are I/O stats on all of the disk devices on the system as well as their individual partitions. Here is what each of the columns represents:
- Ttps
- Blk_read/s
- Blk_wrtn/s
- Blk_read
- Blk_wrtn
This lists the transfers per second to the device. “Transfers” is another way to say I/O requests sent to the device.
This is the number of blocks read from the device per second.
This is the number of blocks written to the device per second.
In this column is the total number of blocks read from the device.
In this column is the total number of blocks written to the device.
When you have a system under heavy I/O load, the first step is to look at each of the partitions and identify which partition is getting the heaviest I/O load. Say, for instance, that I have a database server and the database itself is stored on /dev/sda3. If I see that the bulk of the I/O is coming from there, I have a good clue that the database is likely consuming the I/O. Once you figure that out, the next step is to identify whether the I/O is mostly from reads or writes. Let’s say that I suspect that a backup job is causing the increase in I/O. Since the backup job is mostly concerned with reading files from the file system and writing them over the network to the backup server, I could possibly rule that out if I see that the bulk of the I/O is due to writes, not reads.
Out of Disk Space
Another common problem system administrators run into is a system that has run out of free disk space. If your monitoring is set up to catch such a thing, you might already know which file system is out of space, but if not, then you can use the df tool to check:
$ sudo df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 7.9G 541M 7.0G 8% / varrun 189M 40K 189M 1% /var/run varlock 189M 0 189M 0% /var/run udev 189M 44K 189M 1% /dev devshm 189M 0 189M 0% /dev/shm /dev/sda3 20G 15G 5.9G 71% /home
The df command will let you know how much space is used by each file system, but after you know that, you still need to figure out what is consuming all of that disk space. The similarly named du command is invaluable for this purpose. This command with the right arguments can scan through a file system and report how much disk space is consumed by each directory. If you pipe it to a sort command, you can then easily see which directories consume the most disk space. What I like to do is save the results in /tmp (if there’s enough free space, that is) so I can refer to the output multiple times and not have to rerun du. I affectionately call this the “duck command”:
$ cd / $ sudo du -ckx | sort -n > /tmp/duck-root
This command won’t output anything to the screen but instead will create a sorted list of what directories consume the most space and output it into /tmp/duck-root. If I then use tail on that file, I can see the top ten directories that use space:
$ sudo tail /tmp/duck-root 67872 /lib/modules/2.6.24-19-server 67876 /lib/modules 69092 /var/cache/apt 69448 /var/cache 76924 /usr/share 82832 /lib 124164 /usr 404168 / 404168 total
In this case I can see that /usr takes up the most space, followed by /lib, /usr/share, and then /var/cache. Note that the output separates out /var/cache/apt and /var/cache so I can tell that /var/cache/apt is the subdirectory that consumes the most space under /var/cache. Of course, I might have to open the duck-root file with a tool like less or a text editor so I can see more than the last ten directories.
So what can you do with this output? In some cases the directory that takes up the most space can’t be touched (as with /usr), but often when the free space disappears quickly it is because of log files growing out of control. If you do see /var/log consuming a large percentage of your disk, you could then go to the directory and type sudo ls -lS to list all of the files sorted by their size. At that point you could truncate (basically erase the contents of) a particular file:
$ sudo sh -c "> /var/log/messages"
Alternatively, if one of the large files has already been rotated (it ends in something like .1 or .2), you could either gzip it if it isn’t already gzipped, or you could simply delete it if you don’t need the log anymore.
Out of Inodes
Another less common but tricky situation in which you might find yourself is the case of a file system that claims it is full, yet when you run df you see that there is more than enough space. If this ever happens to you, the first thing you should check is whether you have run out of inodes. When you format a file system, the mkfs tool decides at that point the maximum number of inodes to use as a function of the size of the partition. Each new file that is created on that file system gets its own unique inode, and once you run out of inodes, no new files can be created. Generally speaking, you never get close to that maximum; however, certain servers store millions of files on a particular file system, and in those cases you might hit the upper limit. The df -i command will give you information on your inode usage:
$ df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda 520192 17539 502653 4% /
In this example my root partition has 520,192 total inodes but only 17,539 are used. That means I can create another 502,653 files on that file system. In the case where 100% of your inodes are used, there are only a few options at your disposal. Either you can try to identify a large number of files you can delete or move to another file system, possibly archive a group of files into a tar archive, or back up the files on your current file system, reformat it with more inodes, and copy the files back.