23.8 Cookbook Approach to Problem Resolution
When performance problems occur on WebSphere V5 for z/OS, there are a myriad of places to look. The best solution we have found is to divide and conquer, and run the least intrusive tests first (steps 1 through 5). The more intrusive tests generally require a test system and a load generator.
23.8.1 Nonintrusive Procedures
These steps can be performed on a production system and do not involve modifying the code. They can be executed by either administrative or development personnel. An additional benefit to these is that they involve full workload (production or created stress). Thus, the behavior of the system will include all concurrency issues, as well as issues involved in an individual request.
Step 1: Look for the obvious
-
Are there traces on (in WebSphere or on the system) that impact performance?
-
Are the container configurations consistent with best practices?
-
Are the system tuning items (covered in Operations and Administration) done?
-
Are the performance goals reasonable, based on the application and the resources?
-
Review VerboseGC and look for memory leaks?
Step 2: Simple controlled test
Run a simple, controlled (and easily repeatable) test and review the RMF data.
If response time or throughput goals are not being met and the CPU is not under heavy utilization, there are likely external delays. Look for GRS contention and other delays in the RMF reports. Look at the workload analysis for the enclaves to see what may be delaying them on the system. Look at the APPL percent versus the response time in high and low usage times.
If the CPU is heavily utilized, determine the CPU seconds /Tran (discussed earlier) versus the transaction volume. If the CPU Seconds/Tran are higher than expected, then application tuning may be in order (Java tracing, Jinsight, and other actions discussed in the remainder of this section).
Step 3: Dump
Taking one or more snapshots during a period of high volume can provide insight into where the threads are spending their time. Use the method of displaying the trace-backs discussed earlier. Focus on call-stacks that contain the SR_ExecutionRoutine CSECT. Look for patterns in the tops of the call-stacks, especially leading into monitor locks, and so forth. This can often detect bottlenecks in such areas as:
Calls to back-end systems that are taking too long
Calls to DB2
Synchronized sections of an application or third-party product in use
Excessive logging
Step 4: Container tracing
Starting in WebSphere V4 (and improving in V5), container tracing can be turned on and off dynamically. This is a great way to help find the delays in your workload. Lightweight traces can be turned on with the modify commands described in chapter 3 of Operations and Administration. The best details on the tracing can be found in Appendix A of Assembling J2EE Applications. A useful trace option is to turn on the basic trace with the console command:
/f <ServerProc>,tracebasic=(3,4,5,6)
Then create or allow some workload to occur, then:
/f <ServerProc>,traceinit
Step 5: Java Tracing
WebSphere V5 provides dynamic Java tracing which aids tremendously in isolating problems. Via the Administrative console or via the commands in chapter 3 of Operations and Administration, very detailed Java tracing can be turned on and off dynamically.
23.8.2 Intrusive Procedures
These tests involve modifying code or stopping and starting servers. Thus, they are best done on test or performance systems, and they generally do not involve running with stress on the system.
Step 6: WSAD profiling or Jinsight
Before the time of publishing, Jinsight functionality was replaced by WSAD profiling, and it is recommended that this be used instead.
Jinsight is an IBM tool that allows all method calls in a JVM to be captured and timed. Information on Jinsight (including download instructions) can be found at http://www106.ibm.com/developerworks/java/library/j-jinsight/.
WSAD V5 incorporates much of the functionality of Jinsight. By using the profiling features in WSAD, the developers should be able to get a feel for the expense of each of the items in the call.
Jinsight (which is not a supported product) has the additional benefit of running this profiling on the z/OS system which may have some different behaviors than the development system. Jinsight tracing can be done with the generally available Jinsight 2. In reviewing Jinsight output, we have found the following procedure to work best:
Have an IBM person use Jinsight Live to capture multiple separate traces, one for each key request and/or data path.
For each result, load the file into Jinsight by starting Jinsight, and selecting File > Open a Trace File and selecting the appropriate file. Then press the Load button.
Select Views > Execution and you will be presented with the various work on each thread. Use the mouse to turn most of the work on one or more threads yellow (start near the left side of the work graphic, but not all the way to the left).
Select Selected > Drill down from selected items > Call Tree. This provides an excellent view of the amount of time taken by each method of the call. Focus more on the percentages than on the actual contribution column.
Step 7: Footprinting application
If none of the steps already discussed have isolated the problem to the point where it can be fixed, then it is time to begin footprinting the code. This can be done elegantly with Log4J or JRAS, or in a more homegrown method with System.out.println() or something of the sort. This should help to isolate the methods that cause the problem.