3.3 I/O Assessment and Analysis Tools
The best way to look at I/O behaviors and performance is to look at system tools on the hosts that run the applications. An examination using system tools provides a top-down view of the I/O subsystem from the host system's perspective. A higher level view of the I/O behaviors can sometimes be extracted from an application, such as a relational database management system (RDBMS), but not all applications have the ability to report this data. Further, there is no consistent way to extract the data for the applications that report I/O statistics.
Because of this inconsistency and because system tools tend to be more consistent in their availability and data set measurement, it is best to start with the system tools themselves. The system tools provide a distilled version of the application I/O behavior at the device level. Any additional application-level device abstractions are lost, but the raw I/O behaviors will still show through in the analysis.
It is possible to perform an analysis of the I/O system from the storage device point of view in a bottom-up fashion. This method does not have the problems of an application-level analysis because of the common availability of useful statistics on almost all intelligent storage devices. Information gathering takes place with device-specific methods because standards for the contents of the data set and the data extraction method are not quite complete.1 New storage device management standards will make data gathering from storage devices more complete and consistent, so that all devices can provide the same basic utilization and performance data. Implementation is in various stages depending on the hardware and software vendors, the products in use, and the chosen device management method.
In general, put off device analysis until the host system analysis is complete. The storage device analysis has greater depth and narrower scope, and it requires more effort to perform. Delaying this analysis enables a more focused approach on the storage devices, whose greater amount of storage-specific I/O data can easily swamp the investigator.
A few simple scripts written in Perl or a shell language can quickly examine UNIX hosts that have the sar utility. sar is a very useful tool to use, available on almost all UNIX operating system variants. The sar data set and output are quite consistent from UNIX to UNIX. The data available from the Windows NT perfmon command can also be processed fairly easily from its logged format.
A quick look at the sar man page on your UNIX host system will provide details on the timing and amount of data gathered. On most UNIX host systems, the data is the past week's-worth of system data. A simple spreadsheet analysis of the data can provide information on maximum system bandwidth and IOPS. The analysis can also show patterns of usage throughout a day or several days. Once the script is run on each host system, the collected data can be examined and combined with data collected from other host systems, if necessary, to provide a complete snapshot of the host system's workload.
The get_io.sh script in Example 3.1 performs two functions:
-
It gathers bandwidth and IOPS data from a host system.
It outputs data files from sar input data for analysis in a spreadsheet.
The analysis of the data set gathered from the script is performed by putting the comma-separated-value output files of each data type (bandwidth or IOPS) for each day assessed into a spreadsheet. The data can then be graphed versus time in order to visualize the I/O behaviors of the host system under evaluation in the modes of bandwidth, IOPS, and I/O size. The visualization of the data reveals some significant I/O parameters for the SAN design, such as maximum bandwidth utilization, maximum IOPS utilization, workload windows, workload consistency, and characteristic I/O sizes. Additional mathematical analysis may be of use if the visualization of the data provides poor insight into the I/O behaviors of the analyzed host system, but usually this is not required.
The fairly simple script in Example 3.1 takes data collected by the sar utility and creates twenty-minute aggregated data points of bandwidth and IOPS from the host system perspective on all I/O channels combined. See Figure 3.3 (on page 61, top) for an example of the output of the get_io.sh script. The two sets of output files from the script can also be combined to find out the typical I/O size of the application being examined over these intervals.
EXAMPLE 3.1. The get_io.sh shell script
#!/bin/sh # get_io.sh # Gather aggregate bandwidth and IOPS data from a host's sar data files # Gather bandwidth data from sar archives day=1 for sarfile in ´ls /var/adm/sa/sa[0-2]*´ do shour=0 ehour=0 min=0 while [ $shour -le 23 ] do ehour=´expr $shour + 1´ interval=0 # Divide each hour into 3 parts because the data is in 20-minute # intervals while [ $interval -le 2 ] do case "$interval" in 0) blocks=0 sum=0 # Extract the data from a sar archive file and # sum the blks/s column for blocks in ´sar -d -f $sarfile -s $shour:00:00 -e $shour:20:30 | egrep -v "IRIX|sun4|HP-UX|AIX|,|^[0-2]" | awk '{print $5}'´ do sum=´expr $sum + $blocks´ done # Clean up any old temp files, then compute bandwidth rm -f /usr/tmp/bcfile echo $sum " / 2 / 1024" >> /usr/tmp/bcfile echo quit >> /usr/tmp/bcfile bw=´bc -l /usr/tmp/bcfile´ # Store the bandwidth result in a csv file echo $bw >> /usr/tmp/bw_$day.csv # Report the bandwidth result echo "Bandwidth is" $bw "MBps" ;; 1) blocks=0 sum=0 for blocks in ´sar -d -f $sarfile -s $shour:20:00 -e $shour:40:30 | egrep -v "IRIX|sun4|HP-UX|AIX|,|^[0-2]" | awk '{print $5}'´ do sum=´expr $sum + $blocks´ done rm -f /usr/tmp/bcfile echo $sum " / 2 / 1024" >> /usr/tmp/bcfile echo quit >> /usr/tmp/bcfile bw=´bc -l /usr/tmp/bcfile´ echo $bw >> /usr/tmp/bw_$day.csv echo "Bandwidth is" $bw "MBps" ;; 2) if [ $shour -eq 23 ] then break fi blocks=0 sum=0 for blocks in ´sar -d -f $sarfile -s $shour:40:00 -e $ehour:00:30 | egrep -v "IRIX|sun4|HP-UX|AIX|,|^[0-2]" | awk '{print $5}'´ do sum=´expr $sum + $blocks´ done rm -f /usr/tmp/bcfile echo $sum " / 2 / 1024" >> /usr/tmp/bcfile echo quit >> /usr/tmp/bcfile bw=´bc -l /usr/tmp/bcfile´ echo $bw >> /usr/tmp/bw_$day.csv echo "Bandwidth is" $bw "MBps" ;; esac interval=´expr $interval + 1´ done shour=´expr $shour + 1´ done day=´expr $day + 1´ done # Gather IOPS data from sar archives day=1 rm -f /usr/tmp/bcfile for sarfile in ´ls /var/adm/sa/sa[0-2]*´ do shour=0 ehour=0 min=0 while [ $shour -le 23 ] do ehour=´expr $shour + 1´ interval=0 while [ $interval -le 2 ] do case "$interval" in 0) ios=0 sum=0 # Extract the data from a sar archive file and # sum the r+w/s column for ios in ´sar -d -f $sarfile -s $shour:00:00 -e $shour:20:30 | egrep -v "IRIX|sun4|HP-UX|AIX|,|^[0-2]" | awk '{print $4}'´ do echo $ios "+ \\" >> /usr/tmp/bcfile done echo 0 >> /usr/tmp/bcfile echo quit >> /usr/tmp/bcfile # Compute the IOPS iops=´bc -l /usr/tmp/bcfile´ # Store the result in a csv file echo $iops >> /usr/tmp/ios_$day.csv # Report the result echo "IOPS are" $iops # Clean up any old temp files rm -f /usr/tmp/bcfile ;; 1) ios=0 sum=0 for ios in ´sar -d -f $sarfile -s $shour:20:00 -e $shour:40:30 | egrep -v "IRIX|sun4|HP-UX|AIX|,|^[0-2]" | awk '{print $4}'´ do echo $ios "+ \\" >> /usr/tmp/bcfile done echo 0 >> /usr/tmp/bcfile echo quit >> /usr/tmp/bcfile iops=´bc -l /usr/tmp/bcfile´ echo $iops >> /usr/tmp/ios_$day.csv echo "IOPS are" $iops rm -f /usr/tmp/bcfile ;; 2) if [ $shour -eq 23 ] then break fi ios=0 sum=0 for ios in ´sar -d -f $sarfile -s $shour:40:00 -e $ehour:00:30 | egrep -v "IRIX|sun4|HP-UX|AIX|,|^[0-2]" | awk '{print $4}'´ do echo $ios "+ \\" >> /usr/tmp/bcfile done echo 0 >> /usr/tmp/bcfile echo quit >> /usr/tmp/bcfile iops=´bc -l /usr/tmp/bcfile´ echo $iops >> /usr/tmp/ios_$day.csv echo "IOPS are" $iops rm -f /usr/tmp/bcfile ;; esac interval=´expr $interval + 1´ done shour=´expr $shour + 1´ done day=´expr $day + 1´ done
The get_iosize.pl script in Example 3.2 takes pairs of bandwidth and IOPS output files from the script in Example 3.1 and uses the simple equation
I/O size = Bandwidth (KB/s) / IOPS
to generate the typical I/O size over the same intervals.
The output of this script will add a bit more detail to the analysis of the application and host system. See Figure 3.3 (on page 61, bottom) for an example of the output from the get_iosize.pl script. The graphic analysis of the data shows patterns and anomalies. The more regular the patterns look in the graphical analysis in terms of IOPS, bandwidth, and I/O size, the more likely it is that the conclusions drawn from the patterns will be useful. Less consistent graphs indicate
EXAMPLE 3.2. The get_iosize.pl shell script
#!/usr/local/bin/perl # # get_iosize.pl # Find the characteristic I/O size from the output of get_io.sh script $i=1; while ( $i <= 7 ) { # Open the result file for output from this script open (OUTFH, ">>/usr/tmp/iosize_$i") || die "Can't open file, $!\n"; # Open and read the bandwidth and IOPS output csv file pair open (BWFH, "/usr/tmp/bw_$i") || die "Can't open file, $!\n"; @bwinfo=<BWFH>; close (BWFH); open (IOPSFH, "/usr/tmp/ios_$i") || die "Can't open file, $!\n"; @iopinfo=<IOPSFH>; close (IOPSFH); # Make sure the number of data collection intervals # in each file matches or quit if ( $#bwinfo != $#iopinfo) { printf "The files for day $i don't match. Exiting\n"; exit; } $j=0; # Divide the bandwidth in KBytes by the number of IOPS # to get the I/O size while ( $j <= $#bwinfo) { if ( @iopinfo[$j] != 0) { $iosize = $bwinfo[$j] * 1024 / $iopinfo[$j]; } else { $iosize = 0; } # Report the I/O size result and record it in an output file. printf "Typical IO size is $iosize\n"; printf OUTFH "$iosize\n"; $j++; } close (OUTFH); $i++; }
more variable system usage, making the sizing task more difficult. Pattern uncertainties can lead to overconfiguration and waste of resources in the SAN design.