- 4.1 Chapter Objectives
- 4.2 Tools That Report System Configuration
- 4.3 Tools That Report Current System Status
- 4.4 Process- and Processor-Specific Tools
- 4.5 Information about Applications
4.4 Process- and Processor-Specific Tools
4.4.1 Introduction
This section covers tools that report the status of a particular process, or the events encountered by a particular processor.
4.4.2 Timing Process Execution (time, timex, and ptime)
The commands time, timex, and ptime all report the amount of time that a process uses. They all have the same syntax, as shown in Example 4.30. All three tools produce output showing the time a process spends in user code and system code, as well as reporting the elapsed time, or wall time, for the process. The wall time is the time between when the process started and when it completed. A multithreaded process will typically report a combined user and system time that is greater than the wall time. The tools do differ in output format. The tool timex can be passed additional options that will cause it to report more information.
Example 4.30. Syntax of the time Command
% time <app> <params>
4.4.3 Reporting System-Wide Hardware Counter Activity (cpustat)
cpustat first shipped with Solaris 8. It is a tool for inspecting the hardware performance counters on all the processors in a system. The performance counters report events that occur to the processor. For example, one counter may be incremented every time data is fetched from memory, and another counter may be incremented every time an instruction is executed. The events that can be counted, and the number of events that can be counted simultaneously, are hardware-dependent. Opteron processors can typically count four different event types at the same time, whereas UltraSPARC processors typically only count two. We will discuss hardware performance counters in greater depth in Chapter 10.
cpustat reports the number of performance counter events on a system-wide basis, hence it requires superuser permissions to run. So, if multiple programs are running, the reported values will represent the events encountered by all programs. If the system is running a mix of workloads, this information may not be of great value, but if the system is performing a single job, it is quite possible that this level of aggregation of data will provide useful information.
Assume that the system is dedicated to a single task—the program of interest—and the program is in some kind of steady state (e.g., it is a Web server that is dealing with many incoming requests). The command line for cpustat, shown in Example 4.31, is appropriate for an UltraSPARC IIICu-based system. The output is a way of determining which performance counters are worth further investigation.
Example 4.31. Sample Command Line for cpustat to Collect System-Wide Stats
$ cpustat -c pic0=Dispatch0_IC_miss,pic1=Dispatch0_mispred,sys \ -c pic0=Rstall_storeQ,pic1=Re_DC_miss,sys \ -c pic0=EC_rd_miss,pic1=Re_EC_miss,sys \ -c pic0=Rstall_IU_use,pic1=Rstall_FP_use,sys \ -c pic0=Cycle_cnt,pic1=Re_PC_miss,sys \ -c pic0=Instr_cnt,pic1=DTLB_miss,sys \ -c pic0=Cycle_cnt,pic1=Re_RAW_miss,sys
When the -c flag is passed to cpustat (and cputrack) it provides a pair of counters on which to collect. These are referred to as pic0 and pic1. More than 60 event types are available to select from on the UltraSPARC IIICu processor, and two can be selected at once. Some of the event types are available on only one of the counters, so not every pairing is possible. The ,sys appended at the end of the pair of counter descriptions indicates that the counters should also be collected during system time. The counters are collected in rotation, so each pair of counters is collected for a short period of time. The default interval is five seconds.
If the program is not in a steady state—suppose it reads some data from memory and then spends the next few seconds in intensive floating-point operations—it is quite possible that the coarse sampling used earlier will miss the interesting points (e.g., looking for cache misses during the floating-point-intensive code, and looking for floating-point operations when the data is being fetched from memory). Example 4.32 shows the command line for cputrack to rotate through a selection of performance counters, and partial output from the command.
Example 4.32. Example of cpustat Output
$ cpustat -c pic0=Rstall_storeQ,pic1=Re_DC_miss,sys \ > -c pic0=EC_rd_miss,pic1=Re_EC_miss,sys \ > -c pic0=Rstall_IU_use,pic1=Rstall_FP_use,sys \ > -c pic0=Cycle_cnt,pic1=Re_PC_miss,sys \ > -c pic0=Instr_cnt,pic1=DTLB_miss,sys \ > -c pic0=Cycle_cnt,pic1=Re_RAW_miss,sys time cpu event pic0 pic1 5.005 0 tick 294199 1036736 # pic0=Rstall_storeQ,pic1=Re_DC_miss,sys 5.005 1 tick 163596 12604317 # pic0=Rstall_storeQ,pic1=Re_DC_miss,sys 10.005 0 tick 5485 965974 # pic0=EC_rd_miss,pic1=Re_EC_miss,sys 10.005 1 tick 76669 11598139 # pic0=EC_rd_miss,pic1=Re_EC_miss,sys ...
The columns of cpustat output shown in Example 4.32 are as follows.
- The first column reports the time of the sample. In this example, the samples are being taken every five seconds.
- The next column lists the CPU identifier. The samples are taken and reported for each CPU.
- The next column lists the type of event. For cpustat, the type of event is only going to be a tick.
- The next two columns list the counts for performance counters pic0 and pic1 since the last tick event.
- Finally, if cpustat is rotating through counters, the names of the counters are reported after the # sign.
4.4.4 Reporting Hardware Performance Counter Activity for a Single Process (cputrack)
cputrack first shipped with Solaris 8. It is another tool that reports the number of performance counter events. However, cputrack has the advantages of collecting events only for the process of interest and reporting the total number of such events at the end of the run. This makes it very useful for situations in which the application starts, does something, and then exits.
The script in Example 4.33 shows one way that cputrack might be invoked on a process.
Example 4.33. Script for Invoking cputrack on an Application
$ cputrack -c pic0=Dispatch0_IC_miss,pic1=Dispatch0_mispred,sys \ -c pic0=Rstall_storeQ,pic1=Re_DC_miss,sys \ -c pic0=EC_rd_miss,pic1=Re_EC_miss,sys \ -c pic0=Rstall_IU_use,pic1=Rstall_FP_use,sys \ -c pic0=Cycle_cnt,pic1=Re_PC_miss,sys \ -c pic0=Instr_cnt,pic1=DTLB_miss,sys \ -c pic0=Cycle_cnt,pic1=Re_RAW_miss,sys \ -o <desired location of results file> \ <application>
The script in Example 4.33 demonstrates how to use cputrack to rotate through the counters and capture data about the run of an application. The same caveat applies as for cpustat: Rotating through counters may miss the events of interest. An alternative way to invoke cputrack is to give it just a single pair of counters. The example in Example 4.34 shows this.
Example 4.34. Example of cputrack on a Single Pair of Counters
$ cputrack -c pic0=Cycle_cnt,pic1=Re_DC_miss testcode time lwp event pic0 pic1 1.118 1 tick 663243149 14353162 2.128 1 tick 899742583 9706444 3.118 1 tick 885525398 7786122 3.440 1 exit 2735203660 33964190
The output in Example 4.34 shows a short program that runs for three seconds. cputrack has counted the number of processor cycles consumed by the application using counter 0, and the number of data-cache miss events using counter 1; both numbers are per second, except for the line marked "exit," which contains the total counts over the entire run. The columns in the output are as follows.
- time: The time at which the sample was taken. In this case, the samples were taken at one-second intervals.
- lwp: The LWP that is being sampled. If the -f option is passed to cputrack, it will follow child processes. In this mode, data from other LWPs will be interleaved.
- event: The event type, such as ticks or the exit line. Each tick event is the number of events since the last tick. The exit line occurs when a process exits, and it reports the total number of events that occurred over the duration of the run. The event line also reports data at points where the process forks or joins.
- pic0 and pic1: The last two columns report the number of events for the two performance counters. If cputrack were rotating through performance counters, the names of the performance counters would be reported after a # sign.
It is also possible to attach cputrack to a running process. The option for this is -p <pid_id>, and cputrack will report the events for that process.
4.4.5 Reporting Bus Activity (busstat)
The busstat tool reports performance counter events for the system bus. The available performance counters are system-dependent. The -l option lists the devices that have performance counter statistics available. The -e option will query what events are available on a particular device.
The currently set performance counters can be read using the -r option. To select particular performance counters it is necessary to use the -w option, but this requires superuser privileges. An example of using busstat to measure memory activity on an UltraSPARC T1-based system is shown in Example 4.35.
Example 4.35. Using busstat to Query Memory Activity on an UltraSPARC T1
# busstat -l Busstat Device(s): dram0 dram1 dram2 dram3 jbus0 # busstat -e dram0 pic0 mem_reads mem_writes .... pic1 mem_reads mem_writes ... # busstat -w dram0,pic0=mem_reads,pic1=mem_writes time dev event0 pic0 event1 pic1 1 dram0 mem_reads 45697 mem_writes 8775 2 dram0 mem_reads 37827 mem_writes 3767
4.4.6 Reporting on Trap Activity (trapstat)
trapstat is a SPARC-only tool that first shipped with Solaris 9 and enables you to look at the number of traps the kernel is handling. It counts the number of traps on a system-wide basis, hence it requires superuser privileges. For example, it is a very useful tool for looking at TLB misses on the UltraSPARC II. Example 4.36 shows output from trapstat.
Example 4.36. Sample Output from trapstat
vct name | cpu0 ------------------------+--------- 20 fp-disabled | 6 24 cleanwin | 31 41 level-1 | 4 44 level-4 | 0 46 level-6 | 2 49 level-9 | 1 4a level-10 | 100 4e level-14 | 101 60 int-vec | 3 64 itlb-miss | 3 68 dtlb-miss | 144621 6c dtlb-prot | 0 84 spill-user-32 | 0 8c spill-user-32-cln | 0 98 spill-kern-64 | 612 a4 spill-asuser-32 | 0 ac spill-asuser-32-cln | 199 c4 fill-user-32 | 0 cc fill-user-32-cln | 70 d8 fill-kern-64 | 604 108 syscall-32 | 26
In the output shown in Example 4.36, a number of data TLB traps are occurring. This is the number per second, and 144,000 per second is not an unusually high number. Each trap takes perhaps 100 ns, so this corresponds to a few milliseconds of real time. We discussed TLB misses in greater detail in Section 4.2.6.
A high rate of any trap is a cause for concern. The traps that most often have high rates are TLB misses (either instruction [itlb-miss] or data [dtlb-miss]), and spill and fill traps.
Spill and fill traps indicate that the code is making many calls to routines, and each call may make further subcalls before returning (think of the program having to run up and then down a flight of stairs for each function call and its corresponding return). Each time the processor makes a call, it needs a fresh register window. When no register window is available, the processor will trap so that the operating system can provide one. I discussed register windows in Section 2.3.3 of Chapter 2. It may be possible to avoid this by either compiling with crossfile optimizations enabled (as discussed in Section 5.7.2 of Chapter 5), or restructuring the code so that each call will do more work.
It is possible to make trapstat run on a single process. The command line for this is shown in Example 4.37.
Example 4.37. Command Line to Run trapstat on a Single Program
# trapstat <program> <args>
At the end of the run of the process, this will report the number of traps that the single process caused. The figures will be reported (by default) as the number of traps per second.
4.4.7 Reporting Virtual Memory Mapping Information for a Process (pmap)
The pmap utility returns information about the address space of a given process. Possibly the most useful information the utility returns is information about the page size mapping returned under the -s option.
Example 4.38 shows a sample of output from the pmap utility under the -s option. The output is useful in that it shows where the code and data are located in memory, as well as where the libraries are located in memory. For each memory range, a page size is listed in the Pgsz column. In this case, all the memory has been allocated on 8KB pages (I discussed page sizes in Section 1.9.2 of Chapter 1). Output from pmap is the best way to determine whether an application has successfully obtained large pages.
Example 4.38. pmap -s Output
% pmap -s 7962 7962: ./myapp params Address Kbytes Pgsz Mode Mapped File 00010000 272K 8K r-x-- /export/home/myapp 00054000 80K - r-x-- /export/home/myapp 00068000 32K 8K r-x-- /export/home/myapp 0007E000 48K 8K rwx-- /export/home/myapp ... 000D2000 2952K 8K rwx-- [ heap ] ... 004D4000 1984K 8K rwx-- [ heap ] 006C4000 8K - rwx-- [ heap ] 006C6000 50944K 8K rwx-- [ heap ] ... FF210000 8K 8K r-x-- /usr/platform/sun4u-us3/lib/libc_psr.so.1 FF220000 32K 8K r-x-- /opt/SUNWspro/prod/usr/lib/libCrun.so.1 ...
The page size for an application can be controlled at runtime (see Section 4.2.6) or at compile time (see Section 5.8.6 of Chapter 5).
4.4.8 Examining Command-Line Arguments Passed to Process (pargs)
The pargs command reports the arguments passed to a process. The command takes either a pid or a core file. An example of this command is shown in Example 4.39.
Example 4.39. Example of pargs
$ sleep 60& [1] 18267 $ pargs 18267 18267: sleep 60 argv[0]: sleep argv[1]: 60
4.4.9 Reporting the Files Held Open by a Process (pfiles)
The pfiles utility reports the files that a given pid has currently opened, together with information about the file's attributes. An example of the output from this command is shown in Example 4.40.
Example 4.40. Output from pfiles
% pfiles 7093 7093: -csh Current rlimit: 256 file descriptors 0: S_IFCHR mode:0666 dev:118,32 ino:3422 uid:0 gid:3 rdev:13,2 O_RDONLY|O_LARGEFILE 1: S_IFCHR mode:0666 dev:118,32 ino:3422 uid:0 gid:3 rdev:13,2 O_RDONLY|O_LARGEFILE 2: S_IFCHR mode:0666 dev:118,32 ino:3422 uid:0 gid:3 rdev:13,2 O_RDONLY|O_LARGEFILE ...
4.4.10 Examining the Current Stack of Process (pstack)
The pstack tool prints out the stack dump from either a running process or a core file. An example of using this tool to print the stack of the sleep command is shown in Example 4.41. This tool is often useful in determining whether an application is still doing useful computation or whether it has hit a point where it is making no further progress.
Example 4.41. Output from pstack
% sleep 10 & [1] 4556 % pstack 4556 4556: sleep 10 ff31f448 sigsuspend (ffbffaa8) 00010a28 main (2, ffbffbdc, ffbffbe8, 20c00, 0, 0) + 120 000108f0 _start (0, 0, 0, 0, 0, 0) + 108
4.4.11 Tracing Application Execution (truss)
truss is a useful utility for looking at the calls from an application to the operating system, calls to libraries, or even calls within an application. Example 4.42 shows an example of running the application ls under the truss command.
Example 4.42. Output of the truss Command Running on ls
$ truss ls execve("/usr/bin/ls", 0xFFBFFBE4, 0xFFBFFBEC) argc = 1 mmap(0x00000000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF3B0000 resolvepath("/usr/bin/ls", "/usr/bin/ls", 1023) = 11 resolvepath("/usr/lib/ld.so.1", "/usr/lib/ld.so.1", 1023) = 16 stat("/usr/bin/ls", 0xFFBFF9C8) = 0 open("/var/ld/ld.config", O_RDONLY) Err#2 ENOENT open("/usr/lib/libc.so.1", O_RDONLY) = 3 fstat(3, 0xFFBFF304) = 0 ....
When an application is run under truss the tool reports every call the operating system made. This can be very useful when trying to determine what an application is doing. The -f flag will cause truss to follow forked processes.
When truss is run with the -c flag it will provide a count of the number of calls made, as well as the total time accounted for by these calls. Example 4.43 shows the same ls command run, but this time only the count of the number of calls is collected.
Example 4.43. Call Count for the ls Command Using truss
$ truss -c ls syscall seconds calls errors _exit .000 1 write .000 35 open .000 7 3 close .000 4 time .000 1 brk .000 10 stat .000 1 fstat .000 3 ioctl .000 3 execve .000 1 fcntl .000 1 mmap .000 7 munmap .000 1 memcntl .000 1 resolvepath .000 5 getdents64 .000 3 lstat64 .000 1 fstat64 .000 2 -------- ------ ---- sys totals: .002 87 3 usr time: .001 elapsed: .020
It is also possible to run truss on an existing process. This will generate the same output as invoking truss on the process initially. This is useful for checking whether a process is still doing something. Example 4.44 shows the command line to do this.
Example 4.44. Attaching truss to an Existing Process
$ truss -p <pid>
It is also possible to use truss to print out the calls within an application. The -u flag takes the names of the libraries of interest, or a.out to represent the application. The tool will then report calls made by these modules, as shown in Example 4.45.
Example 4.45. truss Used to Show Calls Made within an Application
% truss -u a.out bzip2 execve("bzip2", 0xFFBFFB84, 0xFFBFFB8C) argc = 1 -> atexit(0x2be04, 0x44800, 0x0, 0x0) -> mutex_lock(0x456e8, 0x0, 0x0, 0x0) -> _return_zero(0x456e8, 0x0, 0x0, 0x0) <- mutex_lock() = 0 -> mutex_unlock(0x456e8, 0x1, 0x0, 0x0) -> _return_zero(0x456e8, 0x1, 0x0, 0x0) <- mutex_unlock() = 0 <- atexit() = 0 -> _init(0x0, 0x44800, 0x0, 0x0) -> _check_threaded(0x44a68, 0x0, 0x0, 0x0) -> thr_main(0x0, 0x0, 0x0, 0x0) -> _return_negone(0x0, 0x0, 0x0, 0x0) <- thr_main() = -1 <- _check_threaded() = 0x44a68 <- _init() = 0 -> main(0x1, 0xffbffb84, 0xffbffb8c, 0x44800) -> signal(0x2, 0x1724c, 0x0, 0x0) ....
4.4.12 Exploring User Code and Kernel Activity with dtrace
The dtrace utility is part of Solaris 10, and it offers an unprecedented view into the behavior of user code and system code. Using dtrace, it is possible to count the number of times a routine gets called, time how long a routine takes, examine the parameters that are passed into a routine, and even find out what a routine was doing when a given event happened. All of this makes it a very powerful tool, but to describe it in sufficient detail is beyond the scope of this book. A simple example of the use of dtrace is shown in Example 4.46. This script counts the number of calls to malloc, and for each call it records the size of the memory requested.
Example 4.46. dtrace Script to Count the Calls to malloc
#!/usr/sbin/dtrace -s pid$target:libc.so:malloc:entry { @proc[probefunc]=count(); @array[probefunc]=quantize(arg0); } END { printa(@proc); printa(@array); }
The output from this script, when run on the command ls, is shown in Example 4.47. The output shows that malloc is called 15 times, and displays a histogram of the requested memory sizes.
Example 4.47. Number and Size of Calls to malloc by ls
# ./malloc.d -c ls dtrace: script './malloc.d' matched 2 probes ... dtrace: pid 3645 has exited CPU ID FUNCTION:NAME 16 2 :END malloc 15 malloc value ------------- Distribution ------------- count 4 | 0 8 |@@@@@ 2 16 |@@@@@@@@@@@@@@@@ 6 32 |@@@ 1 64 |@@@@@@@@ 3 128 | 0 256 |@@@@@ 2 512 | 0 1024 | 0 2048 | 0 4096 | 0 8192 | 0 16384 | 0 32768 |@@@ 1 65536 | 0
Once the sizes of the memory requests have been determined, it may also be interesting to find out where the memory requests are being made. The script shown in Example 4.48 captures the call stack for the calls to malloc.
Example 4.48. Capturing the User Call Stack for Calls to malloc
#!/usr/sbin/dtrace -s pid$target:libc.so:malloc:entry { @stack[ustack(5)]=count(); @array[probefunc]=quantize(arg0); } END { printa(@stack); printa(@array); }
An example of running the script to capture the call stack of calls to malloc is shown in Example 4.49.
Example 4.49. Call Stack for malloc Requests Made by the Compiler
# ./malloc_stack.d -c "cc -O an.c" dtrace: script './malloc_stack.d' matched 2 probes dtrace: pid 4063 has exited CPU ID FUNCTION:NAME 6 2 :END ... cc`malloc libc.so.1`calloc+0x58 cc`stralloc+0x8 cc`str_newcopy+0xc cc`addopt+0x88 154 malloc value ------------- Distribution ------------- count 0 | 0 1 | 2 2 |@@@ 21 4 |@@@@@@ 52 8 |@@@@@@@@@@@ 88 16 |@@@@@@@@ 65 32 |@@@@@@ 52 64 |@@@@@ 44 128 |@ 5 256 | 2 512 | 1 1024 | 0