- Understanding a Running System
- Design Rules of Thumb
- Analyzing an Existing System
- Designing for RAS
- A Logical Design Specification
Analyzing an Existing System
Often, the purpose of designing a new system is to replace an existing system in your infrastructure. If so, you can benefit from analyzing your existing system because this analysis will give you a better idea of what problems you are facing. This analysis is also useful if you are trying to upgrade a Sun Fire server. A proper analysis will ensure that you are upgrading the right parts of the system to address the issues.
Before you go any further, you should revisit your design goals discussed and developed in Chapter 2. Doing so will help you properly formulate your statement of the performance problems you are encountering. A good problem statement is:
When many users are logged in, NFS performance is very slow.
A bad problem statement is:
There is a large number in the w column of the vmstat output.
Always start with the perceived problems and requirements. An improvement in these areas is the only way you can tell if your design is a success. You can only make use of statistics if you know what you are looking for.
The easiest way to analyze a system is by using the stat commands that ship with the Solaris OE, and which can be used to monitor performance of a running system. You can get a full list of the available commands by typing the following command in a shell prompt:
# ls /usr/bin/*stat
This command will display a series of commands, such as vmstat, iostat, netstat, and so on.
You should never use the uptime command to analyze a system. You can use it to show how long your system has been up, but the notion of a load is very outdated and fairly useless in the Solaris OE. Most notably, load varies widely from system to system; a load of 10 may indicate a lack of activity on one machine, but extreme activity on another. We recommend you get in the habit of using vmstat 5 instead of uptime when a machine seems sluggish.
Some stat commands are more useful than others, so the following sections focus on the useful commands (TABLE 3-2).
TABLE 1-2 Useful Stat Commands
Command |
Description |
/usr/bin/vmstat | Virtual memory/paging statistics with CPU/process summaries |
/usr/bin/mpstat | Extensive per-processor statistics |
/usr/bin/iostat | I/O and NFS statistics |
/usr/bin/netstat | Network statistics |
/usr/bin/prstat | Summary of active processes very similar to the top utility |
Collecting and understanding the output from these commands should give you a good idea of what problems your current system is having, and how to improve upon these problem areas in the design of your new server.
The following sections review each command in turn, along with how to properly use each one, so you can gather the best statistics possible. It is important to note that not all options of a given stat command produce usefulor even trustworthyoutput in all situations. The focus is on the specific parts of the output of each command that are the most important.
How and When
How and when you monitor a system is just as important as what commands you use and why you use them to collect statistics. You should make sure that you are monitoring the system when it is doing what you want it to do.
In some situations, this is relatively straightforward, such as on a multiuser interactive system. In this case, you want to run your stats during the day, when everyone is doing their normal work. Conversely, if you have a system that serves mainly as a database server, and the load gets very heavy at night when batch jobs are running, you should gather your stats overnight.
When collecting stats unattended (such as overnight), use a simple shell script that writes to a log file in /var/tmp with periodic timestamps. You can use something like the script in CODE EXAMPLE 1-1 to run the stat commands mentioned previously:
CODE EXAMPLE 1-1 nightstatsScript for Unattended Stat Collection
#!/bin/sh # nightstats - Script for unattended stat collection if [ $# -lt 2 ]; then echo "Usage: $0 stat args ..." >&2 exit 1 fi # some basic vars holding date, etc. stat=$1 shift date='date +%Y%m%d' logfile=/var/tmp/$stat.$LOGNAME.$date # run the stat, writing output to our logfile exec 1>$logfile echo "Running '$stat $@' as '$LOGNAME'" while true do date $stat "$@" done
The way this script works you will get timestamps at each interval count you specify. So, if you run:
# nightstats vmstat 5 12
you get a timestamp every 12 repetitions. Note that the nightstats script loops indefinitely so you must manually kill it when you want it to stop. You can use this script to kick off stats in the background, either using cron or before you go home. For example:
# nohup nightstats vmstat 5 300 2>/dev/null & # nohup nightstats iostat -xcnz 5 300 2>/dev/null &
Then, when you come to work the next day, you will have a log file in /var/tmp for each stat, with timestamps every five minutes. Each file will be named with the name of the stat command, the date, and your user name ($LOGNAME is automatically set to your user name by the shell). This will allow you to collect stats during the times when your system is under the type of load you care about.
NOTE
You also want to collect some stats when your system is not busy, which you can then use as a baseline for comparison. Otherwise, you will not be able to tell what stats change when the load increases.
Simulating Loads
Trying to simulate loads is not very useful. In general, trying to simulate a load gives you a poorif not misleadingpicture of what the system is trying to do. For example, the common practice of using dd to write to disks is usually a misrepresentative measure of I/O load. While dd reads and writes sequentially, most real-world disk access is random, and is an unpredictable combination of reads and writes. Thus, your configuration could look good on paper, and work well when running dd, but work poorly in a real-world application.
To get an accurate picture of your requirements, you should monitor a system that is running what you want it to be running. If you need to simulate this, the best way is to create a test environment that mirrors what you want to design as closely as possible. If you cannot do so, then we recommend that you use the design rules of thumb, and avoid analyzing a dissimilar system as this can cause you to make poor decisions.
What and Why
Now that you understand how and when to measure your system, the following sections examine each of the different stat commands and what they tell you.
prstat Command
When looking at your stats, the first thing you should know is what your system is doing. Seeing a large amount of disk activity by itself does not tell you anything other than the system is undergoing a large amount of disk activity. This is where the prstat command comes in. It shows you what processes are active on the system, along with how much CPU time they are using, what processor they are bound to, their size in memory, their priority, and more. If you have used the freeware tool top, the output should look very familiar.
Unlike all of the other stat commands, to use prstat you just type the command with no arguments:
# prstat
The display fills the terminal window and refreshes every 5 seconds. You should launch prstat in a separate window and keep it going as you use each of the following stat commands. That way, you can correlate the performance of your system with what is actively running on it.
vmstat Command
Memory is always the first place to start. If you have a memory bottleneck, then all of your other stats are going to be unreliable, since the system will be introducing extra delays trying to manage memory. Often memory problems are misdiagnosed as I/O or CPU problems, since disk access or applications seem slow to the user. In reality, these operations are slow because the system is paging or even swapping.
So remember: Always start by looking at memory. Repeat that over and over as a kind of mantra whenever you are analyzing or designing a system.
The virtual memory management algorithms in the Solaris OE are complex. Basically everything is seen as a page of memory, including files. While this is a benefit as far as the system is concerned, it makes analysis more difficult. Therefore, properly analyzing memory takes several steps.
The simplest way to look at memory is by specifying a time interval to the vmstat command, and letting it run until you press Ctrl-C to interrupt it. The following vmstat command monitors the system in five-second intervals:
CODE EXAMPLE 1-2 How to Use the vmstat Command
# vmstat 5 procs memory page disk faults cpu r b w swap free re mf pi po fr de sr s0 -- -- -- in sy cs us sy id 0 0 20 1461688 510080 37 185 30 1 2 0 0 1 0 0 0 667 650 292 4 2 94 0 0 64 1468888 197976 8 43 0 0 0 0 0 0 0 0 0 638 571 269 1 1 98 0 0 64 1469320 198528 0 0 1 0 0 0 0 0 0 0 0 642 467 256 0 1 98
The first line of the vmstat output is a summary.
NOTE
Always ignore the first line of any stat command. It does not provide any useful information because it is a summary for as long as the system has been up. Summaries span too long a period of time, and they give you no indication as to the use of the system during that time.
When looking at the output from vmstat (CODE EXAMPLE 1-2), you will notice a lot of columns. You should ignore all the fields about disks and device interrupts, as there are better tools for monitoring these stats, which we will describe in subsequent sections. In fact, only some of these columns (TABLE 3-3) are really useful.
TABLE 1-3 Important vmstat Command Output Columns
Column Heading |
Meaning |
r | Number of runnable processes (waiting for CPU time) |
b | Number of blocked processes (waiting for I/O, paging, and so on) |
w | Number of runnable but swapped-out processes (normally 0) |
re | Page reclaims (memory pages taken from other processes) |
mf | Minor page faults |
pi | Kilobytes paged in (including process startup and file access) |
po | Kilobytes paged out (should be close to 0) |
sr | Pages scanned by page-out scanner (also close to 0) |
us | Percentage of CPU time spent in user mode |
sy | Percentage of CPU time spent in system mode |
id | Percentage of CPU spent idle |
First, look at the procs headings. Normally, the r, b, and w columns are fairly low numbers, if not 0. This is because, generally, these columns only become nonzero if a process is waiting for something, either a CPU (r), I/O (b), or enough memory (w). Large numbers in these columns are usually bad.
One caveat is that you may occasionally see a steady, unchanging number in the w column. This means that the Solaris software has decided these processes have been idle so long they should be swapped out to make room for other things. Do not be concerned about this.
The cpu columns give you a good system-at-a-glance snapshot of what the system is doing, averaged across all processors. In general, non-idle time should be spent in roughly a 2-to-1 ratio in usr-to-sys modes. Also, if idle time (id) is close to zero consistently, you probably need some additional CPUs, especially if the r column is a large number. Beyond this, to get a good view of your CPUs you should use the mpstat command, as explained in "mpstat Command" on page 18.
On to memory. First, note that the free column should be completely ignored, as it does not in any way correspond to what is thought of as free memory. Because of the way the Solaris software manages memory, the free list does not properly count multiple processes sharing the same pages, or unused pages that have yet to be reclaimed. In addition, the file cache grows to consume most of free memory to improve performance.
Consequently, the free list tends to decrease steadily over the uptime of a system, when in fact the system is efficiently reclaiming and reusing memory.
If you want a better picture of available virtual memory, you can use the swap command:
# swap -l swapfile dev swaplo blocks free /dev/dsk/c0t1d0s0 227,6 16 4093712 4093712 # swap -s total: 494360k bytes allocated + 35568k reserved = 529928k used, 25137440k available
If both the free column from the first command, and the available column from the second command are nonzero, the system is all right. Beyond that, you can ignore the concept of free memory.
Instead, the most important column of vmstat is the scan rate (sr). This column shows the number of pages scanned in an attempt to free unused memory. The pageout scanner starts running only when free memory goes below the kernel parameter lotsfree, which is a small percentage of physical memory. When you see an increase in the scan rate, you should also see a jump in the page-outs (po), indicating that pages are being moved from physical memory to swap space. If you see this consistently, it is evidence of a memory shortagethe system needs more memory. If this only happens occasionally, then you should explore whether better job scheduling or /etc/system tuning could help. If not, you need more memory.
NOTE
A high number in the page-ins (pi) column is not necessarily significant. This is because when a new process starts, its executable image and data must be read into memory. Also, file system access appears in the pi column too. A large number in the pi column is only relevant if the po column is large too.
Here is an example of a system that is undergoing heavy paging because it is reading in a large file.
CODE EXAMPLE 1-3 vmstat 5 Command Output Reading a Large File
# vmstat 5 procs memory page disk faults cpu r b w swap free re mf pi po fr de sr f0 s0 s6 s7 in sy cs us sy id 0 0 0 2406032 431280 8 72 2 0 0 0 0 0 1 0 1 121 87 202 1 5 94 0 0 24 2489472 643792 0 0 1 0 0 0 0 0 0 0 0 328 86 108 0 2 98 0 0 24 2489472 643784 61 252 483 0 0 0 0 0 5 0 9 466 718 260 3 8 90 0 0 24 2452936 605616 1396 1753 10950 0 0 0 0 0 9 0 77 1266 2363 801 50 45 5 0 0 24 2383216 531176 1484 1860 11822 0 0 0 0 0 53 0 40 790 1897 357 55 33 12 0 0 24 2309576 458256 1435 1773 11475 0 0 0 0 0 69 0 23 697 1791 247 51 30 19 0 0 24 2236608 391168 1374 1761 11008 0 0 0 0 0 52 0 40 775 1613 235 49 35 17 0 0 24 2165824 324224 1411 1700 11291 0 0 0 0 0 75 0 16 751 1652 239 47 32 21 0 0 24 2097680 253816 1378 1720 11012 0 0 0 0 0 0 0 87 746 1687 246 47 33 20 0 0 24 2028800 184168 1330 2020 10614 0 0 0 0 0 73 0 11 719 1608 239 52 33 16 0 0 24 1948016 110880 1350 1649 10790 0 0 0 0 0 56 0 37 764 1605 246 49 32 19 0 0 24 1886176 48208 1282 1666 10187 8 8 0 13 0 1 0 89 793 1934 312 44 37 19 0 0 24 1835416 7280 688 836 5598 5328 5529 0 6586 0 94 0 47 1088 889 238 24 30 46 0 0 24 1803768 6680 353 675 3052 6657 6808 0 6749 0 80 0 80 1287 478 435 16 26 58 0 1 24 1790704 15856 236 393 832 4470 4481 0 1665 0 68 0 107 2579 792 1416 11 41 48 0 1 24 1784160 18152 35 388 812 3136 3144 0 839 0 60 0 92 1473 724 800 10 29 62 0 1 24 1777192 18488 29 317 988 2536 2540 0 634 0 66 0 97 988 446 422 7 15 78 0 1 24 1770768 18664 20 326 942 2334 2345 0 616 0 77 0 77 953 518 409 7 18 75 procs memory page disk faults cpu r b w swap free re mf pi po fr de sr f0 s0 s6 s7 in sy cs us sy id 0 0 24 1764704 18528 37 339 820 2636 2648 0 699 0 105 0 48 961 509 343 8 20 71 0 1 24 1757544 18640 30 264 1051 2206 2214 0 602 0 124 0 43 963 331 398 5 15 80 0 1 24 1753544 18248 19 255 1081 2048 2056 0 880 0 97 0 70 960 323 412 5 13 81 0 1 24 1749440 18664 20 258 1046 2048 2062 0 632 0 99 0 63 974 491 443 8 14 77 0 1 24 1744720 18720 17 255 1009 2152 2153 0 552 0 102 0 58 1012 344 449 6 15 79 0 1 24 1739992 18920 16 256 1008 1974 1982 0 529 0 101 0 55 929 324 379 6 16 78 0 1 24 1735416 18800 16 261 998 2048 2052 0 536 0 107 0 55 966 315 379 5 15 80 0 0 24 1729704 18768 54 268 833 2177 2179 0 546 0 83 0 55 862 352 338 18 13 69 0 1 24 1728480 18816 105 403 1140 1971 1974 0 552 0 110 0 62 1027 569 492 7 11 83 0 1 24 1728600 18888 48 196 1118 1484 1489 0 470 0 110 0 53 1014 261 484 5 10 85 0 1 24 1728496 18832 42 191 1304 1536 1544 0 525 0 123 0 51 1000 160 455 2 8 89 1 0 24 1728344 37712 372 143 946 1176 1178 0 318 0 103 0 43 789 84 335 3 21 76 0 0 24 2489048 652144 3 78 32 0 0 0 0 0 6 0 5 427 310 168 1 4 95 0 0 24 2488840 651872 0 1 11 0 0 0 0 0 1 0 1 351 134 138 0 2 98 0 0 24 2488792 651776 0 0 3 0 0 0 0 0 0 0 0 349 114 128 0 2 98
Notice that for the first half of the output, there is a large number of pi, but no po, due to the file system activity of reading the file. As it progresses, though, notice the abrupt jump in po as well as the sr. Also notice how much the pi and user time (us) drop. The system is spending an inordinate ratio of time managing memory, slowing down how quickly it can read in the file.
As with all stats, brief periods of paging are not important. The purpose of having virtual memory is to allow you to temporarily exceed your available physical memory. You just want to make sure the system is not paging continuously for extended periods of time.
By now you should have a rough idea of what your system is doing. To really understand what is going on, though, you must be able to differentiate between file system pages, executable pages, and so on. To do this you can use the vmstat -p option.
vmstat Command -p Option
Using the vmstat -p option fundamentally changes the type of data reported by the command. The -p option replaces the columns on processes, CPUs, disks, and interrupts with extended statistics on memory and paging, and displays for executable, anonymous and file system pi, po, and pf.
Examine each of the three types of pages shown by the vmstat -p option:
TABLE 1-4 Page Types Shown By vmstat -p Option
Page type |
Meaning |
executable | Images of executable programs and their data |
anonymous | Used for a process heap space, stack, and private pages |
filesystem | Files mapped into address space through the mmap command |
Under each page type heading are the following fields, where ? is replaced with the first letter of the page type:
TABLE 1-5 Page Stats Shown By vmstat -p Option
Column Heading |
Meaning |
?pi | Kilobytes paged in |
?po | Kilobytes paged out |
?pf | Page faults |
As with the vmstat output, the key field is still sr, showing the scan rate. The benefit you get with -p is that you can now see what types of pages need the space, allowing you to better understand what the system is doing.
Look again at the system that is reading in a large file, only this time with the vmstat -p option.
CODE EXAMPLE 1-4 vmstat -p 5 Command Output Reading a Large File
# vmstat -p 5 memory page executable anonymous filesystem swap free re mf fr de sr epi epo epf api apo apf fpi fpo fpf 2406040 431296 8 72 0 0 0 0 0 0 0 0 0 2 0 0 2489992 630792 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2480080 620344 785 1021 1 0 0 6 0 0 67 0 0 6174 1 1 2417296 557472 1514 2830 0 0 0 0 0 0 0 0 0 10777 0 0 2349520 493576 1330 2515 0 0 0 0 0 0 0 0 0 9523 0 0 2293456 459296 1295 2684 0 0 0 0 0 0 0 0 0 9088 0 0 2230072 399424 1256 1751 0 0 0 0 0 0 0 0 0 9881 0 0 2164832 334864 1403 1700 0 0 0 0 0 0 0 0 0 11212 0 0 2097432 267288 1415 1716 0 0 0 0 0 0 0 0 0 11212 0 0 2021736 192344 1330 2024 0 0 0 0 0 0 0 0 0 10638 0 0 1947168 122688 1330 1604 0 0 0 0 0 0 0 0 0 10558 0 0 1883288 59216 1324 1658 0 0 0 0 0 0 0 0 0 10568 0 0 1832784 12056 836 863 3936 0 5059 1 0 76 1 3548 3846 6808 4 12 1798648 8656 207 654 6531 0 6519 0 0 72 353 6374 6446 1502 1 12 1787016 17864 49 461 4094 0 927 6 0 6 646 4076 4084 12 3 3 memory page executable anonymous filesystem swap free re mf fr de sr epi epo epf api apo apf fpi fpo fpf 1776040 18448 38 530 3036 0 678 8 0 6 774 3020 3028 3 0 1 1766800 18488 32 319 2592 0 625 4 0 3 952 2585 2585 1 1 3 1761080 18696 32 309 2465 0 549 0 0 1 963 2457 2460 1 0 3 1754696 18600 31 302 2420 0 534 0 0 8 937 2406 2412 1 0 0 1748608 18640 30 308 2488 0 534 3 0 3 945 2483 2484 1 0 0 1742504 18784 23 285 2318 0 508 3 0 6 968 2304 2307 3 1 4 1736960 18784 21 291 2268 0 491 3 0 8 979 2252 2259 3 1 1 1731008 18584 94 291 2369 0 535 0 0 9 811 2355 2358 3 0 1 1729800 18744 75 214 1697 0 497 0 0 4 1112 1689 1692 1 0 0 1729840 18664 57 202 1601 0 538 0 0 4 1156 1587 1595 0 1 1 1881984 149608 470 122 984 0 366 30 0 6 728 972 976 0 0 1 2490440 672488 0 0 0 0 0 0 0 0 0 0 0 4 0 0 2490744 672672 10 168 0 0 0 8 0 0 6 0 0 16 0 0 2490768 672512 0 2 0 0 0 0 0 0 16 0 0 0 0 0
As you can see, this makes what is happening to the system much clearer. The system starts by paging in the file very effectively, until it hits the lotsfree limit and the page-out scanner starts. At this point, there is a big jump in the sr column. Also notice the abrupt shift from file system page-ins (fpi) to anonymous pi, po, and pf. This means that pages are being taken from other processes to make room for the file in memory. Thus, if you see a lot of activity in the apo and sr columns, you need more memory.
While memory analysis can be complicated if you pay attention solely to the sr and po columns, you should be able to tell if your system needs additional memory.
mpstat Command
The Sun Fire system is designed to be a multiprocessor system, as evidenced by the fact that you cannot even buy a system with only one CPU. Even though you are looking at CPUs secondarily, being processor-bound is the least likely candidate for bad performance. If anything, you are exploring CPUs secondarily so that you can double-check this assumption, and rule it out as a possible factor. CPUs usually only become a factor in heavily loaded systems that are doing lots of interactive or transactional processing. In most other cases, if you buy enough system boards to hold all your memory, the CPUs that are included are usually sufficient.
As mentioned previously, the cpu columns of the vmstat output are a good place to start. Generally, a large percentage of idle time indicates that your processing power is sufficient. However, measuring idle time across a lot of processors can mask situations such as one processor getting swamped with interrupts while the rest do nothing. So, it is important to look at your CPUs in detail to make sure you are not missing anything.
Like vmstat, just launch mpstat with a time interval and let it run:
CODE EXAMPLE 1-5 How to Use the mpstat Command
# mpstat 5 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 372 2 836 447 300 393 47 25 26 1 918 23 10 0 67 1 370 2 622 543 523 301 40 23 35 0 932 24 11 0 65 2 376 2 527 151 100 396 48 25 26 0 926 24 10 0 66 3 372 2 531 151 100 397 48 25 26 0 921 23 10 0 67 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 229 0 546 400 300 458 0 12 13 1 563 2 9 0 89 1 132 0 2018 585 585 111 0 9 16 0 621 4 8 1 88 2 265 0 199 100 100 354 1 9 15 0 770 21 9 1 68 3 363 0 491 101 100 671 1 14 18 0 1339 22 11 0 67 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 155 0 445 400 300 495 0 12 7 0 398 1 6 0 92 1 99 0 145 348 347 134 1 10 10 0 487 13 4 0 83 2 154 0 401 101 100 255 1 8 4 0 723 21 5 0 73 3 307 0 227 100 100 178 0 11 9 0 989 23 8 1 69
This command produces a lot of columns, only some of which you care about:
TABLE 1-6 mpstat Command Output Columns
Column Heading |
Meaning |
xcal | Interprocessor cross-calls |
intr | Interrupts |
csw | Context switches |
icsw | Involuntary context switches |
smtx | Spins on mutex locks |
usr | Percent user time |
sys | Percent system time |
wt | Percent wait time |
idl | Percent idle time |
A cross-call (xcal) is a call used by a processor to tell other processors to do something. Cross-calls are used for a variety of things, such as delivering a signal to another processor or ensuring virtual memory consistency. This latter use is very common, as it happens during file system activity. Heavy file system activity (such as NFS) can result in a lot of cross-calls. Also, it is not unusual for the boot proc to show thousands of xcals, as it maintains lots of information about the others.
An interrupt (intr) is the mechanism that a device uses to signal to the kernel that it needs attention, and some immediate processing is required on its behalf. I/O is the major contributor of interrupts, although there are also "special" interrupts such as the system-wide clock thread that occurs regularly. Interrupts, unlike everything else, are not distributed across all CPUs. Instead, the Solaris OE binds each source of interrupts to a specific CPU.
The term context switch (csw) refers to the process of moving a thread on and off a CPU. Context switches are a normal but somewhat expensive occurrence because switching context involves certain overhead, such as populating the stack. Normally, a context switch occurs when a process is done with the CPU and another process is given a chance to run. Thus, a steady number of context switches is insignificant.
Involuntary context switches (icsw), on the other hand, are much less favorable. When a process is given access to the CPU, it is has a limited time window in which to run, depending on how many other processes are running, what their priority is, and so on. This is the nature of scheduling. An involuntary context switch means that the process was forcibly stopped by the scheduler before it was finished; the time allotted was too short for the process to finish in, or a higher-priority thread preempted it. A few of these is nothing to be concerned about, but getting a large number of these regularly indicates that the system does not have enough processing power to handle all of the things that need to run. You need additional CPUs.
Finally, a spin on a mutex lock (smtx) happens when a thread cannot access a section of the kernel that it needs on the first try. The term mutex is short for a mutual exclusion lock, and is used in multithreaded operating systems like the Solaris OE to allow multiple threads to run concurrently in system mode. When a thread enters system mode, it locks the part of the kernel it is using by acquiring the mutex lock for that section. Once the thread is finished, it releases the lock so other threads can have access. A spin happens when two threads want the same section of the kernel, and one of them has to wait for the other to finish. A few of these are perfectly normal, but a large number means there is contention for the kernel resources.
The difficult thing about CPU performance analysis is that it is nearly impossible to provide hard-and-fast rules of thumb. For example, a high number of interrupts is not necessarily a negative. If you have a lot of I/O, you are going to have a lot of interrupts. Instead, it is the interaction of several statistics that can tell you if you are operating efficiently or having serious problems.
When looking at the mpstat command output, you always want to take into account the last four columnsthe amount of CPU time spent in the different modes. The only real rule of thumb is:
The system should always have some idle time.
Having consistently low idle time means that your system is getting pushed to its limits in one way or another. Even if your system is properly tuned and running at peak efficiency, if you ran anything else on it you would be out of processing capacity.
However, it is also possible for the system to have a significant amount of idle time, but the system still needs more CPUs. Why? Because the amount of time a CPU spends handling device interrupts cannot be seen by the operating system as device interrupts are higher priority than the clock interrupt. This means the time it takes a CPU to handle network and I/O interrupts will be reported as idle time. The intr column only lists disk-based interrupts. (This is different from wait time, which is after an interrupt has been handled and the CPU is waiting for a response from the device.)
Because CPU analysis is so complicated, TABLE 3-7 can help you decipher the different stats. The terms "high" and "low" are used because the exact numbers are very system dependent.
TABLE 1-7 Analyzing the mpstat Command Output
If you see this... |
It probably means... |
High intr High idl Low usr |
Your system is very busy handling I/O interrupts. |
High intr High sys Low usr |
If the system is an NFS server, this is perfectly normal. Otherwise, the system is very busy handling I/O interrupts. |
High intr High wt |
I/O requests are taking a long time to fulfill. This is likely an I/O performance problem, not a CPU resource problem. Check iostat. |
High smtx High sys or idl |
Contention for kernel system resources exists. |
High icsw | Contention for basic CPU resources exists. |
High csw or xcal High sys Low usr |
If this happens consistently, you may require more CPUs, depending on your applications. If you are not noticing any slowness in applications or system problems, however, ignore it. |
High sys Low usr All other stats low |
Your system is spending too much time managing resources. Check vmstat first. |
One nice thing is that the solution to all of these problems is the same. The system needs more and/or faster CPUs. Once again though, the importance of having enough memory is emphasized here. When you add a CPU, you incur additional overhead in the form of more kernel space needed to manage that CPU, and space for that CPU to do its own work. Therefore, the rule of thumb is:
Whenever you add additional CPUs, you should also add memory.
Doing so will help prevent accidental memory shortages, which can actually make your system run slower as you add more CPUs.
iostat Command
Proper I/O layout is complicated; it is almost never done right the first time. Part of the reason for this is that usage patterns and requirements change over time. Also, where you add memory and CPUs is somewhat predetermined. Where you add disk devices and controller cards, though, has a big impact on the system. Therefore, it is important to make sure that the I/O layout is flexible enough to handle future changes and expansion.
Fortunately, I/O analysis is very straightforward. There is only one version of the iostat command to run, iostat -zxcn.
CODE EXAMPLE 1-6 How to Use the iostat Command
# iostat -zxcn 5 <summary omitted> cpu us sy wt id 0 1 5 93 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.2 0.0 1.6 0.0 0.0 0.0 8.2 0 0 c0t0d0 0.0 0.2 0.0 1.6 0.0 0.0 0.0 10.1 0 0 c5t0d0 0.0 34.6 0.0 2201.0 0.0 1.0 0.0 27.9 0 97 c12t1d0 0.0 0.2 0.0 1.6 0.0 0.0 0.0 12.2 0 0 c20t122d0 0.0 0.2 0.0 1.6 0.0 0.0 0.0 14.1 0 0 c20t98d0 0.0 14.2 0.0 113.6 0.0 0.1 0.0 5.5 0 8 c20t101d0 0.0 0.2 0.0 0.5 0.0 0.0 0.0 30.1 0 1 c10t1d0 0.0 58.4 0.0 135.4 0.0 0.3 0.0 5.6 0 30 c2t17d0 1.0 12.8 8.0 135.0 0.0 0.2 0.0 11.6 0 13 c2t16d0 0.0 3.4 0.0 19.2 0.0 0.1 0.0 17.9 0 4 c2t9d0 0.0 0.4 0.0 0.8 0.0 0.0 0.0 5.3 0 0 c2t21d0 0.0 1.8 0.0 1.4 0.0 0.0 0.0 4.2 0 1 c27t42d0 0.4 9.2 3.2 155.9 0.0 0.1 0.0 10.9 0 8 c28t69d0 0.0 9.0 0.0 157.5 0.0 0.1 0.0 9.0 0 6 c28t68d0 0.0 1.8 0.0 1.4 0.0 0.0 0.0 4.8 0 1 c29t1d0 0.0 9.0 0.0 157.5 0.0 0.1 0.0 9.5 0 7 c30t35d0 0.0 0.4 0.0 0.8 0.0 0.0 0.0 5.1 0 0 c30t52d0 0.0 9.2 0.0 155.9 0.0 0.1 0.0 10.3 0 7 c30t36d0 0.0 58.4 0.0 135.4 0.0 0.4 0.0 6.5 0 35 c31t66d0 0.4 12.8 3.2 135.0 0.0 0.2 0.0 12.1 0 12 c31t64d0 0.0 3.4 0.0 19.2 0.0 0.1 0.0 17.7 0 4 c31t90d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 tomax:/export/mirrors/pkg.eng/export/pkg 0.0 0.2 0.0 0.4 0.0 0.0 0.1 1.0 0 0 twinsun-n1:/export/workspace/d0/nwiger
For this version of the iostat command, the output shows extended statistics for only those disk devices with nonzero activity, by physical device path instead of the logical kernel disk name (that is, c0t0d0 instead of sd0). If you are using individual disk partitions, you may also want to use the -p option. However, most production environments manage their disks with some type of volume manager package, so in practice this option is not that useful.
As with the other stat commands, there are only a few columns you care about (TABLE 3-8).
TABLE 1-8 Important iostat Command Columns
Column Heading |
Meaning |
kr/s | Kilobytes read per second |
kw/s | Kilobytes written per second |
wait | Number of transactions waiting for service |
wsvc_t | Average service time in wait queue, in milliseconds |
asvc_t | Average service time for active transactions, in milliseconds |
You can ignore two commonly used columns, %w and %b, which are supposedly the percentage of time spent waiting and busy, respectively. Because of the complexity of modern disks and controllers, these calculations are very inaccurate. Often the two will total more than 100 percent, which should be impossible. Besides, these columns do not tell you anything that you cannot find out by looking at wsvc_t or asvc_t.
Analogous to the mpstat command, when looking at iostat you should always watch the first two columns listed (kr/s and kw/s) to see how much activity the disks are undergoing. Then, basically, the last three columns should be as close to zero as possible. This indicates that the system has very fast disks, and that the I/O is laid out correctly to avoid controller bottlenecks.1
In practice, asvc_t will be nonzero for any disks undergoing activity, since it always takes some amount of time for a disk to fulfill a request. As with any stat, you will only be able to tell if the system is particularly busy after establishing a baseline. However, several facts are true:
Service times across equally active disks should be fairly even.
You should not see huge peaks and valleys under normal conditions.
You should rarely, if ever, see a nonzero number in wait or wsvc_t.
You may, occasionally, see a temporary jump in service times (asvc_t) even though there is nothing apparently going on (that is, kr/s and kw/s are almost 0). This is due to a somewhat strange behavior of fsflush, the daemon responsible for flushing disk buffers. Periodically, it will generate a long, random series of writes in a short time period. This results in a queue forming, which bumps up the service time, even though there is no real apparent activity on the disk. If you see this, ignore it.
Now look at some output from a small NFS server undergoing a fairly heavy load from several different concurrent requests to read and write several files:
CODE EXAMPLE 1-7 iostat Command Output On an NFS Server
# iostat -zxcn 5 cpu us sy wt id 2 72 13 13 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.2 0.2 1.6 1.6 0.0 0.0 0.0 11.9 0 0 c0t0d0 0.0 0.2 0.0 1.6 0.0 0.0 0.0 13.0 0 0 c0t8d0 24.6 42.4 390.9 675.5 0.0 0.9 0.0 13.9 0 34 c4t17d0 25.6 42.6 409.4 681.3 0.0 0.9 0.0 13.6 0 34 c4t20d0 25.4 43.6 403.4 694.5 0.0 0.9 0.0 13.0 0 35 c4t22d0 24.6 43.0 393.5 687.7 0.0 0.9 0.0 13.9 0 34 c4t4d0 25.0 42.4 397.8 676.1 0.0 1.0 0.0 14.3 0 36 c4t18d0 24.2 42.8 385.5 684.5 0.0 0.9 0.0 13.4 0 33 c4t16d0 24.8 43.4 393.9 691.3 0.0 0.9 0.0 13.8 0 34 c4t3d0 25.2 43.8 403.0 700.5 0.0 1.0 0.0 13.9 0 36 c4t2d0 25.6 43.4 409.4 694.1 0.0 0.9 0.0 12.9 0 32 c4t21d0 0.0 132.9 0.0 7936.8 0.0 5.5 0.0 41.5 0 84 c4t6d0 25.2 43.8 403.0 700.5 0.0 1.0 0.0 13.8 0 34 c4t1d0 25.2 42.0 403.0 671.7 0.0 0.8 0.0 12.6 0 31 c4t19d0 cpu us sy wt id 1 63 18 18 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 5.8 0.0 44.2 0.0 0.4 0.0 63.3 0 3 c0t0d0 0.0 5.8 0.0 44.2 0.0 0.3 0.0 58.5 0 3 c0t8d0 25.6 42.8 407.7 685.0 0.0 0.9 0.0 13.7 0 34 c4t17d0 25.2 43.8 400.5 698.3 0.0 0.9 0.0 13.4 0 35 c4t20d0 25.4 43.0 403.7 685.4 0.0 0.9 0.0 13.8 0 34 c4t22d0 25.2 43.0 403.3 688.2 0.0 1.0 0.0 14.0 0 35 c4t4d0 25.6 43.6 405.3 693.5 0.0 1.0 0.0 14.3 0 37 c4t18d0 25.4 42.4 404.9 678.6 0.0 0.9 0.0 13.5 0 35 c4t16d0 25.2 43.2 398.5 688.7 0.0 1.0 0.0 14.5 0 36 c4t3d0 25.2 43.6 400.3 694.9 0.0 1.0 0.0 14.0 0 36 c4t2d0 25.2 43.4 403.3 694.7 0.0 0.9 0.0 12.6 0 32 c4t21d0 0.0 134.8 0.0 7922.1 0.0 5.6 0.0 41.7 0 86 c4t6d0 25.2 43.4 400.5 691.9 0.0 1.0 0.0 14.5 0 37 c4t1d0 25.4 43.2 402.0 687.0 0.0 0.9 0.0 13.2 0 34 c4t19d0
If you look at kr/s and kw/s, the disks on c4 are moving a lot of data in reads and writes. On this server, these are laid out in a striped/mirrored logical volume mounted as /export. From the output, it appears that the volume manager software is doing a good job making sure that the load is spread evenly. The one hot spot is on c4t6d0, which happens to be the volume log. This disk is getting hit heavily because all of the transactions must be logged. While much higher than the others, this disk's performance is still well within acceptable numbers because the asvc_t is not even over 100. This means that the requests are being fulfilled in a very reasonable amount of time.
Notice that in the second set of output, there is a jump in the asvc_t on c0, even though there is no real activity (the disks on c0 are set up as a mirrored root volume). This is likely due to the peculiar activity of fsflush mentioned earlier, so you can ignore it. Look at how the asvc_t is higher than the disk doing 7922kw/s.
Note that the majority of CPU time was spent in system mode (sy), based on the CPU snapshot (provided via the -c option). Since this is an NFS server, this activity is nothing for concern. If this was a system for local users doing interactive work, however, you might want to rerun vmstat and mpstat to make sure there are no memory or processor bottlenecks.
While finding I/O problems is fairly easy, solving I/O problems can be quite difficult. Requirements for I/O vary widely, so a trial-and-error approach is the only reliable way to get good performance from your I/O. Even within a single data center, different servers may undergo vastly different usage patterns and encounter different types of problems.
Do not try to micromanage hot spots on disks. By the time you get everything tuned correctly, the usage patterns will change. This means you will then have a worse configuration than you would have otherwise, since it is tuned to match your now-inaccurate requirements. Instead, use your time making sure that:
The volumes are spread across as many controllers as possible.
You use the proper type of volume for your requirements.
You use the proper stripe unit size to match your needs.
These simple steps are often missed, but will solve virtually any disk I/O performance problems. Just changing the stripe unit size from its default can result in huge performance gains.
The final caveat on I/O layout is fulfilling the RAS requirements of your system, including making sure it supports dynamic reconfiguration (DR) if you require it. Since these are both major concerns on their own, Chapter 4 discusses this issue and I/O design in detail, taking all of these factors into consideration.
netstat Command
Network analysis can be difficult only because the Solaris software does not currently have a solid network utility that really tells you everything you want to know. While you can get a general idea of number of packets, you cannot see things like octets, TCP/UDP throughput rates, or retransmissions. Fortunately, improving the network performance of a system usually amounts to installing an additional network interface card for more bandwidth, even if it is a bit of a "black box" approach.
Despite its limitations, you can tell several things from the netstat command output. Unlike the other stats, you must run the netstat command separately for each interface you have configured by specifying the -I option along with the interface name.
CODE EXAMPLE 1-8 How to Use the netstat Command
# netstat -I ge0 5 input hme0 output input (Total) output packets errs packets errs colls packets errs packets errs colls 909076714 0 837319344 0 0 918674892 0 846917522 0 0 667 0 681 0 0 673 0 687 0 0 426 0 402 0 0 428 0 404 0 0 1886 0 3684 0 0 1886 0 3684 0 0 1878 0 3117 0 0 1882 0 3121 0 0 411 0 391 0 0 411 0 391 0 0
You can tell two things from this display:
Total number of packets received (input) and transmitted (output) during that interval, both for that interface (left set of columns) and for all interfaces (right set of columns). This is not an average per second, but a total count.
Number of errors and collisions, which should always be low or zero.
Network capacity is very difficult to gauge with this limited information. Without the sizes of each packet, it is impossible to know if you are anywhere near the throughput limits for the interface you are analyzing. Given this information, if the network seems slow, and you are seeing thousands and thousands of packets each second, try adding another network interface card to see if it helps. If not, you should examine your network as a whole to see if you have more widespread issues.
Many available freeware tools, such as the SE Toolkit and Multi Router Traffic Grapher (MRTG), provide better network analysis than netstat. You can use tools such as these to more properly gauge the bandwidth being used by each interface. MRTG is especially useful, as it graphs utilization over time so you can easily see when your network interfaces are getting busy, as well as how much bandwidth they are pushing.
Analysis Reveals...
By this point, you should have a good idea about where the system is weak. Make sure you have good notes, as you need this information in the next chapter when you design your new system.
Giving performance tuning a full treatment is beyond the scope of this book. True performance tuning gets exponentially harder; it is much more difficult to get the last 10 percent out of a system than the first 90 percent. If you are interested in high-end performance tuning, read Sun Performance and TuningJava and the Internet, 2nd Edition by Adrian Cockcroft and Richard Pettit (ISBN 0-13-095249-4) and "Application Performance Optimization" by B?je LindhSun Microsystems AB, Sweden Sun BluePrints? OnLineMarch 2002.