System or Kernel CPU Usage
Chapter 2, "Operating System Performance Monitoring," suggests one of the statistics to monitor is system or kernel CPU utilization. If CPU clock cycles are spent executing operating system or kernel code, those are CPU clock cycles that cannot be used to execute your application. Hence, a strategy to improve the performance of an application is to reduce the amount of time it spends consuming system or kernel CPU clock cycles. However, this strategy is not applicable in applications that spend little time executing system or kernel code. Monitoring the operating system for system or kernel CPU utilization provides the data as to whether it makes sense to employ this strategy.
The Oracle Solaris Performance Analyzer collects system or kernel CPU statistics as part of an application profile. This is done by selecting the View > Set Data Presentation menu in Performance Analyzer, choosing the Metrics tab, and setting the options to present system CPU utilization statistics, both inclusive or exclusive. Recall that inclusive metrics include not only the time spent in a given method, but also the time spent in methods it calls. In contrast, exclusive metrics report only the amount of time spent in a given method.
Figure 6-1 shows the Performance Analyzer's Set Data Presentation form with options selected to present both inclusive and exclusive System CPU metrics. Also notice the options selected report both the raw time value and the percentage of System CPU time.
Figure 6-1 Set system CPU data presentation
After clicking on the OK button, the Performance Analyzer displays the profile's System CPU inclusive and exclusive metrics in descending order. The arrow in the metric column header indicates how the data is presented and sorted. In Figure 6-2, the System CPU data is ordered by the exclusive metric (notice the arrow in the exclusive metric header and the icon indicating an exclusive metric).
Figure 6-2 Exclusive system CPU
Figure 6-2 shows a profile from an application that exhibits high system or kernel CPU utilization. You can see this application consumed about 33.5 seconds of System CPU in the java.io.FileOutputStream.write(int) method and about 11.6 seconds in a method called __write(), or about 65% and 22.5%, respectively. You can also get a sense of how significant the improvement can be realized by reducing the System CPU utilization of this application. The ideal situation for an application is to have 0% System CPU utilization. But for some applications that goal is difficult to achieve, especially if there is I/O involved, since I/O operations require a system call. In applications that require I/O, the goal is to reduce the frequency of making a system call. One approach to reduce the call frequency of an I/O system call is buffer the data so that larger chunks of data are read or written during I/O operations.
In the example shown in Figure 6-2, you can see the file write (output) operations are consuming a large amount of time as illustrated by the java.io.FileOutputStream.write(int) and __write() entries. To identify whether the write operations are buffered, you can use the Callers-Callees tab to walk up the call stack to see what methods are calling the FileOutputStream.write(int) method and the __write method. You walk up the call stack by selecting one of the callees from the upper panel and clicking the Set Center button. Figure 6-3 shows the Callers-Callees of the FileOutputStream.write(int) method.
Figure 6-3 FileOutputStream.write(int) callers and callees
The callers of FileOutputStream.write(int) are ExtOutputStream.write(int) and OutImpl.outc(int). 85.18% of the System CPU attributed to FileOutputStream.write(int) comes from its use in ExtOutputStream.write(int) and 14.82% of it from OutImpl.outc(int). A look at the implementation of ExtOutputStream.write(int) shows:
public void write(int b) throws IOException { super.write(b); writer.write((byte)b); }
A look at the implementation of super.write(b) shows it is not a call to FileOutputStream.write(int):
Public void write(int b) throws IOException { crc = crc * 33 + b; }
But the writer field in ExtOutputStream is declared as a FileOutputStream:
private FileOutputStream writer;
And it is initialized without any type of buffering:
writer = new FileOutputStream(currentFileName);
currentFileName is a field declared as a String:
private String currentFileName;
Hence, an optimization to be applied here is to buffer the data being written to FileOutputStream in ExtOutputStream using a BufferedOutputStream. This is done rather quickly and easily by chaining or wrapping the FileOutputStream in a BufferedOutputStream in an ExtOutputStream. Here is a quick listing of the changes required:
// Change FileOutputStream writer to a BufferedOutputStream // private FileOutputStream writer; private BufferedOutputStream writer;
Then chain a BufferedOutputStream and FileOutputStream at initialization time:
// Initialize BufferedOutputStream // writer = new FileOutputStream(currentFileName); writer = new BufferedOutputStream( new FileOutputStream(currentFileName));
Writing to the BufferedOutputStream, instead of the FileOutputStream, in ExtOutputStream.write(int b) does not require any update since BufferOutputStream has a write() method that buffers bytes written to it. This ExtOutputStream.write(int b) method is shown here:
public void write(int b) throws IOException { super.write(b); // No update required here, // automatically uses BufferedOutputStream.write() writer.write((byte)b); }
The other uses of the writer field must be inspected to ensure the use of BufferedOutputStream operates as expected. In ExtStreamOutput, there are two additional uses of the writer field, one in a method called reset() and another in checkResult(). These two methods are as follows:
public void reset() { super.reset(); try { if (diffOutputStream != null) { diffOutputStream.flush(); diffOutputStream.close(); diffOutputStream = null; } if (writer != null) { writer.close(); } } catch (IOException e) { e.printStackTrace(); } } public void checkResult(int loopNumber) { try { writer.flush(); writer.close(); } catch (IOException e) { e.printStackTrace(); } check(validiationProperties.getProperty(propertyName)); outProperties.put(propertyName, "" + getCRC()); reset(); }
The uses of writer as a BufferedOutputStream works as expected. It should be noted that the API specification for BufferedOutputStream.close() indicates it calls the BufferedOutputStream.flush() method and then calls the close() method of its underlying output stream, in this case the FileOutputStream.close() method. As a result, the FileOutputStream is not required to be explicitly closed, nor is the flush() method in ExtOutputStream.checkResult(int) required. A couple of additional enhancements worth consideration are
- A BufferedOutputStream can also be allocated with an optional buffered size. The default buffer size, as of Java 6, is 8192. If the application you are profiling is writing a large number of bytes, you might consider specifying an explicit size larger than 8192. If you specify an explicit size, consider a size that is a multiple of the operating systems page size since operating systems efficiently fetch memory that are multiples of the operating system page size. On Oracle Solaris, the pagesize command with no arguments reports the default page size. On Linux, the default page size can be obtained using the getconf PAGESIZE command. Windows on x86 and x64 platforms default to a 4K (4096) page size.
- Change the ExtOutputStream.writer field from an explicit BufferedOutputStream type to an OutputStream type, that is, OutputStream writer = new BufferedOutputStream(), instead of BufferedOutputStream writer = new BufferedOutputStream(). This allows for additional flexibility in type of OutputStream, for example, ByteArrayOutputStream, DataOutputStream, FilterOutputStream, FileOutputStream, or BufferedOutputStream.
Looking back at Figure 6-3, a second method calls FileOutputStream.write(int) called org.w3c.tidy.OutImpl.outc(int), which is a method from a third-party library used in the profiled application. To reduce the amount of system CPU utilization used in a third-party supplied method, the best approach is to file a bug or enhancement request with the third-party library provider and include the information from the profile. If the source is accessible via an open source license and has acceptable license terms, you may consider further investigating and including additional information in the bug or enhancement request report.
After applying the changes identified in ExtOutputStream, using the BufferedOutputStream and its default constructor (not including the two additional enhancements just mentioned), and collecting a profile, the amount of system CPU utilization drops substantially. Comparing the profiles in Figure 6-4 to those in Figure 6-2, you can see the amount of inclusive system CPU time spent in java.io.FileOutputStream has dropped from 45.182 seconds to 6.655 seconds (exclusive system CPU time is the second column).
Figure 6-4 Reduced system CPU utilization
Executing this application workload outside the profiler in a performance testing environment prior to making the modifications reports it took this application 427 seconds to run to completion. In constrast, the modified version of the application workload that uses the BufferOutputStream in the same performance testing environment reports it runs to completion in 383 seconds. In other words, this application realized about a 10% improvement in its run to completion execution.
In addition, looking at the Callers-Callees tab for java.io.FileOutputStream.write(int), only the call to org.w3c.tidy.OutImpl.outc(int) remains as a significant consumer of the FileOutputStream.write(int) method. The Callers-Callees of FileOutputStream.write(int) are shown in Figure 6-5.
Figure 6-5 Callers-Callees after changes
Comparing the Callers-Callees in Figure 6-5, after the changes to ExtStream Output, with the Callers-Callees in Figure 6-3, prior to the changes, you can see the amount of attributable time spent in org.w3c.tidy.OutImpl.outc(int) stays close to the same. This should not be a surprise since the changes made to ExtStreamOutput now use BufferedOutputStream. But recall that the BufferedOutputStream invokes a FileOutputStream method when any of the underlying buffer in the BufferedOutputStream becomes full, the BufferedOutputStream.flush() method is called, or when the BufferedOutputSteam.close() method is called. If you look back at Figure 6-4 you see a FileOutputStream.writeBytes(byte[], int, int) method. This is the method that the BufferedOutputStream calls from ExtStreamOutput. Figure 6-6 shows the Callers-Callees tab for the FileOutputStream.writeBytes(byte[], int, int).
Figure 6-6 Callers-Callees of FileOutputStream.writeBytes(byte[],int,int)
Selecting java.io.FileOutputStream.write(byte[], int, int) method from the upper Callee panel and clicking the Set Center button illustrates that BufferedOutputStream.flushBuffer() is its callee; see Figure 6-7.
Figure 6-7 Callers-Callees of FileOutputStream.writeBytes(byte[], int, int)
Selecting the BufferedOutputStream.flushBuffer() method in the upper Callee panel and clicking the Set Center button shows the callee of java.io.BufferedOutputStream.flushBuffer() is BufferedOutputStream.write(int). The Callers-Callees of BufferedOutputStream.flushBuffer() are shown in Figure 6-8.
Figure 6-8 Callers-Callees of BufferedOutputStream.flushBuffer()
Selecting the BufferedOutputStream.write(int) method in the upper Callee panel and clicking the Set Center button shows the callee of java.io.BufferedOutputStream.write(int) is ExtOutputStream.write(int), the method that has been modified. The Callers-Callees of BufferedOutputStream.write(int) are shown in Figure 6-9.
Figure 6-9 Callers-Callees of BufferedOutputStream.write(int)
As mentioned earlier, the next step in reducing System CPU utilization for this application requires a modification to a third-party library, a library that holds the implementation of org.w3c.tidy.OutImpl.outc(int). It may be possible for the maintainers of the third-party library to implement a similar modification to OutImpl.outc(int) as just described and implemented for ExtOutputStream.write(int). However, the performance improvement realized will likely not be as significant since the profile suggests there is more System CPU utilization attributed to the call path of ExtOutputStream.write(int) than to OutImpl.outc(int); refer to Figure 6-3 for attributable System CPU utilization on callers of FileInputStream.write(int). In addition, looking at the amount of System CPU utilization consumed in OutImpl.outc(int), about 6.6 seconds, compared to the total application runtime of 383 seconds is rather small, about 1.5%. Hence, a modification to reduce the amount of System CPU utilization spent in OutImpl.outc(int) would likely not yield more than 1% to 2% improvement.
An additional strategy to reduce system CPU utilization for applications performing large amounts of network I/O is utilizing Java NIO nonblocking data structures. Java NIO was introduced in Java 1.4.2 with many runtime performance improvements added in Java 5 and Java 6. Java NIO nonblocking data structures allow for the ability to read or write as much data as possible in a single call to a network I/O (read or write) operation. Remember that every network I/O call eventually results in the invocation of an operating system's system call, which consumes system CPU utilization. The challenge with using Java NIO nonblocking data structures is it is more difficult to program than using blocking Java NIO or the older, more traditional Java SE blocking data structures such as java.net.Socket. In a Java NIO nonblocking output operation, you can write as many bytes as the operating system allows to be written. But you have to check the return value of the output operation to determine whether all the bytes you asked to be written have indeed been written. In a Java NIO nonblocking input operation, where you read as many bytes as are available, you have to check how many bytes have been read. You also have to implement some complex programming logic to deal with partially read protocol data units, or multiple protocol data units. That is, you may not be able to read enough bytes in a single read operation to construct a meaningful protocol data unit or message. In the case of blocking I/O, you simply wait until you generally read the specified number of bytes that constitute a full protocol data unit or message. Whether to migrate an application to utilize nonblocking network I/O operations should be decided upon by the application's performance needs. If you want to take advantage of the additional performance promised by using nonblocking Java NIO, you should consider using a general Java NIO framework to make the migration easier. Several popular Java NIO frameworks are available such as Project Grizzly (https://grizzly.dev.java.net) and Apache MINA (http://mina.apache.org).
Another area where high System CPU utilization may show up is in applications experiencing heavy lock contention. Identifying lock contention in a profile and approaches to reduce lock contention are discussed in the next section.