23.6 Administration and Monitoring
General administration of WebSphere V5 on z/OS is the same as on distributed platforms and is not the focus of this chapter, except where it impacts performance.
23.6.1 First, Read the Manual
The place to begin creating a WebSphere environment that performs well is the Operations and Administration manual at the following URL: ftp://ftp.software.ibm.com/software/webserver/appserv/zos_os390/v5/bos5b1001.pdf. Chapters 9 and 10 in this manual provide excellent detail in setting up the containers to interact efficiently with the various z/OS subsystems.
23.6.2 RMF and WLM Overview
Using RMF, you can get some great insight into how your container and application are performing, and RMF generally has very low overhead. If WLM (which was covered in an earlier chapter) is configured with reporting and service classes, then RMF can provide an accurate and timely view of how your servers and workloads are performing. This may lead to refinement of the WLM policy. The Operations and Administration manual discusses many of the WLM details, and they are summarized from a WebSphere perspective in chapter 19 of this book. We are summarizing them here along with some special tips and examples.
23.6.3 WebSphere and WLM Classification
23.6.3.1 Basic WLM Considerations
Dynamic application environments simplify the WLM configuration such that you do not need to create new Application Environments for each server created.
Be sure that System address spaces (Controller, Naming daemon, and Node Agent) run with a high priority (i.e., SYSSTC). These are generally short-running tasks, and they should have a high velocity goal.
Servant address spaces need to be prioritized highly as well, although slightly lower than the Controller. These handle dispatching and nonenclave work (Garbage Collection, user threads, timers, etc.). Garbage collection is CPU-intensive, but, since it often stops all threads, a delay can impact all work in the server. User-created threads in the JVM (not something that is recommended, but it is sometimes necessary) will also run under the priority of the address space.
Application environments for work running under the Servants (discussed later in this chapter) should be prioritized with a response-time goal. Keep in mind that the goal should reflect all work. If this application environment is serving static Web content as well as complex requests, the static content could skew the numbers.
Classifying enclave and system work allows for granular RMF reporting to identify the actual "cost" of the various workloads.
TIP
It is best to serve static content out of a Web server instead of the Application Server. This avoids overhead for statics and allows the Application Server to use its cycles on business critical application work.
23.6.3.2 Classifying Work into Enclaves
All of the business logic in the application occurs in the enclave created for the work. Enclaves are discussed in the IBM Redbook OS/390 Workload Manager Implementation and Exploitation which can be found at this URL: http://www.redbooks.ibm.com/redbooks/pdfs/sg245326.pdf.
The application enclave runs under the goal associated with its service class. By defining WebSphere work into meaningful service classes, proper goals can be set both for monitoring and for allowing z/OS to focus resources as needed to meet defined goals. The service class for an enclave is generally set in one of two ways:
Default service class for the WLM subsystem. CB is a predefined subsystem in WLM (which has its roots in the product called Component Broker). You can define a default service class for the CB subsystem. If all WebSphere requests can run under one service class, this is the simplest and most straightforward way to define it
Definition of a transaction class file in WebSphere to route URLs to transaction classes, then classification rules in WLM to use the WebSphere transaction class and classify the work into the appropriate service class. This is discussed in the next section.
Routing WebSphere URLs to Specific Service Classes
Transaction Class Mapping allows for a mapping of URLs in a Servant to specific transaction classes. These transaction classes can then be mapped with WLM classification rules to WLM Service Classes and Reporting Classes. Performance data for these WLM entities will then be reported in the RMF Workload Activity reports. The Mapping is done with a text file that you specify for the server via the admin console in Web Container _ Advanced Settings. The syntax of the file is
TranClassMap <Server:Port> <url> <TranClass>
For example:
TranClassMap *:* /trade/* TRADTCLS
This means that all work for the Trade application coming in to this server (or any server that uses this file since the server is wild-carded) will have a WLM transaction class of TRADTCLS. This can then be used in a WLM classification rule to route Trade work to a specific Service Class and Reporting class. An example of doing this is shown in Figure 23.3.
Figure 23.3 WLM Classification rule to classify WebSphere URLs.
Note that this classification rule is within the CB subsystem, and it is type TC (transaction class). This will cause the work to run under the TRADSCLS service class, and RMF will report on it in the TRADRPT reporting class.
The RMF section later in this chapter will provide a sample of the report output and how this can then be used to determine information about the performance of the application in production.
23.6.3.3 Classifying Nonenclave Work
To get a full picture of how WebSphere is running on the system, it is recommended that you also isolate the performance of the Controller and Servant Address Spaces. The total cost of a WebSphere workload includes its portion of time in the Controller and Servant Address spaces, added to the time in the enclave (and while DB2 time is included in the enclave, CICS Transaction Gateway, CICS, IMS, or any back-end systems may also need to be factored in the equation). Figure 23.4 shows an example of the WLM classification rules within the STC subsystem to classify address spaces with the pattern MCV5S%% into a special Controller service class and to route MCV5S%%S into a Servant service class. These, along with the enclaves that will be tied to these address spaces, will report into the MCRPT1 reporting class. By having Controller, Servant, and enclave time reporting to the same reporting class, RMF does most of the work for us.
Figure 23.4 WLM Classification rules to classify WebSphere Nonenclave work.
Handling the Servant Address Spaces and Threads
In WebSphere, a server is defined as one Controller and one or more Servants. This section discusses how to control the number of Servant address spaces and the number of threads per address space. In general, it is best to let WLM handle the number of address spaces, but there are controls in case they are needed. We would suggest first not setting these and allow WLM to manage it. Changing these settings can have unforeseen effects on system resources and on Servers that process requests for multiple service classes.
Figure 23.5 shows an Administrative Console window under Server _Server Instance with a check-box for handling of multiple Servants. If checked, then a minimum and maximum number of Servant address spaces can be selected. There are numerous factors to be considered in determining if and how to use this functionality, including:
Is there a problem with how WLM is managing it?
How many threads are in each Servant (covered in the next section)?
How many service classes are being serviced by this server (refer back to TranClass Mapping)? A Servant address space can serve one service class at a time (due to storage isolation issues that are beyond the scope of this book). If work arrives for more service classes than are defined with the maximum setting, then a serious queuing problem could occur.
Figure 23.5 Server Instance control over Minimum and Maximum Servants.
BBOO_WORKLOAD_PROFILE is a setting for a server within ORB SERVICE _ Advanced Settings, used to determine the number of actual worker threads that will be started in each Servant. You should avoid assuming that more threads are always better. While this can often be true in scenarios where an application makes calls to external servers, applications that are CPU-intensive within the Servant address space will not benefit from additional threads and may actually see a slight degradation in performance if too many worker threads exist in the address space (and remember that, unless instructed otherwise, WLM can create another servant address space if more threads are needed). The settings and the resulting number of worker threads are
Table 23.2 WLM Settings and Worker Threads
Setting |
Worker Threads |
Normal |
Either 1 or 3 depending on isolation level |
CPUBOUND |
max((num_of_CPU-1),3) |
IOBOUND |
min(30,max(5,num_of_CPU * 3)) |
LONGWAIT |
40 |
TIP
Keep your z/OS LPAR size and your application behavior in mind when deciding on how many threads or Servants to use. Threads are less expensive relative to system resources, but a well-tuned server can use six threads to keep four or five engines very busy. If the server makes calls to external servers (Soap, CICS, etc.), then additional threads may be helpful since threads can be in a wait state.
23.6.3.4 RMF Considerations
Once your work is properly classified by WLM, RMF can provide a wealth of information. While anything in RMF could be relevant to performance, the initial areas to focus on are discussed below. Some important RMF/SMF considerations are
Areas that highlight basic system performance (DASD response times, enqueues, the Coupling Facilities, Crypto hardware, HFS performance, ...) should be reviewed regularly by your z/OS System Performance personnel.
The RMF interval should be shorter during performance testing so that performance can be seen at a more granular level (less smoothing of the spikes). This is a dynamic change, and it can be undone dynamically.
Workload Analysis can provide a great overall view of the application response time, application path-length, and resource consumption.
Partition data, CPU information, and memory information provide a great deal of information about the overall system and how it is running.
The manual Resource Management Facility Report Analysis (http://publibz.boulder.ibm.com/epubs/pdf/erbzra23.pdf) provides an excellent resource for understanding the detail of all of the reports described here. You should review one set of reports for a time period of low WebSphere workload volume and one set of reports for a time period of high WebSphere workload volume. Collect the RMF data beginning just after the heavy workload starts and ending just before the heavy workload ends (avoid ramp up or initialization periods unless those are of specific interest). Like many other z/OS components, RMF creates SMF records within a preassigned range. The SMF records of most interest to WebSphere applications are
RMF recordsrecord types 7079
System Loggerrecord type 88
USS/HFSrecord type 92
HTTPrecord type 103
DB2record types 100102
CICSrecord type 110
MQrecord types 115116
TCPrecord types 118119
WebSphererecord type 120
23.6.3.5 RMF Reports
There are many different RMF reports. We are going to focus on those that report most directly on WebSphere related information. Most examples are excerpts as the reports tend to be wider than can be nicely viewed in this format.
Summary Report
Begin by checking here since the summary report gives an overview of how the overall z/OS system is performing. More detailed and specific reports need only be reviewed if there is an indication of trouble in this report. Listing 23.4 is an excerpt of a typical RMF Summary Report.
Listing 23.4 RMF Summary Report.
1 R M F S U M M A R Y R E P O R T PAGE 001 z/OS V1R3 SYSTEM ID XXX1 START 06/16/2003-09.29.00 INTERVAL 00.14.59 RPT VERSION V1R2 RMF END 06/16/2003-10.29.00 CYCLE 1.000 SECONDS 0 NUMBER OF INTERVALS 4 TOTAL LENGTH OF INTERVALS 00.59.58 -DATE TIME INT CPU DASD DASD OMVS OMVS SWAP DEMAND MM/DD HH.MM.SS MM.SS BUSY RESP RATE MAX AVE RATE PAGING 006/16 09.29.00 15.00 47.4 4.6 1299 1 1 0.00 0.11 06/16 09.44.00 14.59 44.6 4.1 1079 1 1 0.00 0.00 06/16 09.59.00 14.59 41.6 4.8 976.4 1 1 0.00 0.00 06/16 10.14.00 15.00 38.6 5.1 668.3 2 1 0.00 0.09 -TOTAL/AVERAGE 43.0 4.6 1006 2 1 0.00 0.05
This displays the obvious Date, Time, z/OS release, and Interval information. The important information is listed for each interval. The specific items to look for in this report are explained in Table 23.3.
Table 23.3 Key Data from RMF Summary Report
Term |
Definition |
Interval |
Length of the interval for RMF reporting. This should be reduced when performance testing occurs to show a more granular report. |
CPU Busy |
CPU utilization during the interval (since engines are either fully utilized by this LPAR or NOT utilized by this LPAR, it is actually the percent of time that the engines were fully utilized by this LPAR). This number is relative to the processors allocated to this LPAR and is impacted by the weighting value of the LPAR. |
Swap Rate |
Number of swaps per second during this interval. If this is < 2%, then the Paging Report need not be reviewed. |
Demand Paging |
Page Fault Rate. Ideally, this will be 0. If it gets above 1 or 2, then adding memory can have a very positive effect. The old song says, "There's no paging like no paging." |
DASD Resp |
Average milliseconds required to complete an IO request. This should be < 10 milliseconds. If not, check the DASD reports to see if a volume may have caching enabled or if the spread of data is an issue. One exception is the volume containing the JES2 Checkpoint Data Set which tends to have higher utilization and longer response times than other DASD volumes. |
DASD Rate |
Tied to DASD Response. This is DASD activity per second across all devices. The higher this number, the more important the DASD Response. |
Partition Data Report
This report shows a great deal of information about the entire system. It puts CPU utilization in context with image weights, and it shows the overall utilization of all LPARs on the image. Comparison of each LPAR with its associated weight helps determine which enclaves are consuming the most resources. Listing 23.5 is an excerpt from an example (with some line wrapping):
Listing 23.5 RMF Partition Data Report.
1 P A R T I T I O N D A T A R E P O R T PAGE 2 z/OS V1R3 SYSTEM ID MVS1 DATE 06/16/2003 INTERVAL 15.00.715 RPT VERSION V1R2 RMF TIME 09.29.00 CYCLE 1.000 SECONDS - MVS PARTITION NAME MVS1 IMAGE CAPACITY 392 NUMBER OF CONFIGURED PARTITIONS 4 NUMBER OF PHYSICAL PROCESSORS 10 CP 10 ICF 0 WAIT COMPLETION NO DISPATCH INTERVAL DYNAMIC - ------- PARTITION DATA ------- AVERAGE PROCESSOR UTILIZATION PERCENTAGES 0 -CAPPING PROCESSOR- LOGICAL PROCESSORS - PHYSICAL PROCESSORS - NAME S WGT DEF WLM% NUM TYPE EFFECTIVE TOTAL LPAR MGMT EFFECTIVE TOTAL LPAR1 A 40 NO 0.0 6 CP 71.36 71.55 0.11 42.82 42.93 LPAR2 A 3 NO 0.0 2 CP 0.00 0.00 0.00 0.00 0.00 LPAR3 A 30 NO 0.0 5 CP 65.16 65.23 0.03 32.58 32.61 MVS1 A 27 NO 0.0 5 CP 47.05 47.37 0.16 23.53 23.68 *PHYSICAL* 0.24 0.24 ----- ----- ----- TOTAL 0.55 98.93 99.47
Key Data from this Report (Listing 23.5) is highlighted in Table 23.4.
Table 23.4 Key Data from RMF Partition Data Report
Term |
Definition |
System ID |
z/OS Image that this report was run on. |
Number of Physical Processors |
CPUs enabled on the system. |
CP |
In those processors, how many are enabled as general purpose processors. |
ICF |
This is for engines dedicated for Integrated Coupling Facility CPUs, and IFLs (Integrated Facilities for Linux) will also come under this category. |
|
The remainder are for each LPAR. |
S |
Status A=Activated; D=De-activated. |
WGT |
Weight of the LPAR. This is critical as it determines how much resource the z/OS image in this LPAR actually has. The number should be divided by the total weight (for all nondedicated LPARs) to determine the relative weight. If weight is DED, then this processor is dedicated to this engine (this is rare). Note: LPAR weights are relative. In order to interpret the weight of this LPAR, it should be compared to weights of the other LPARs on this processor. |
Capping |
This should always be NO, or one of the benefits of z/OS (sharing resources) will be compromised. Capping is sometimes used to protect a production LPAR, but it can cause some LPARs to be CPU-starved while cycles go un-utilized. You can protect an LPAR simply by giving it adequate weight. |
Processor Num |
Number of processors assigned to LPAR. This will impact CPU utilization reporting and potentially allow for greater concurrency. Total of Processor Num should be <= 3 * CP (from heading). That is, 3:1 is maximum good ratio for Logical Processors vs. Physical processors. |
Avg Util Logical Processors Total |
This should equal LPAR Busy Time Percentage from Summary Report. It includes LPAR overhead and is a percentage of time that the processors allocated to this LPAR were doing work for this LPAR. This is affected by many of the other columns as well. |
Avg Util Physical Processors Total |
This is the utilization spread out over all engines in the LPAR. This should be compared to the weight of the LPAR. If it is at or above the weight, then the LPAR was CPU-constrained during this interval. If performance is sub-par and this is < weight, then there is some other bottleneck. |
TIP
The highlighted sections in the report show that LPAR 1 is weighted at 40 (which in this case means 40 percent based on a total weight of 100). Note that it is actually consuming 42.93 percent of the entire z/OS machine during this interval. z/OS can balance this but, if this is a regular occurrence, then you likely have a CPU shortage here. If this is largely caused by WebSphere (which we can determine by the Workload report discussed later in this section), then either the workload volume is growing, or the application is not performing as well as when it was sized.
CPU Activity
As shown in Listing 23.6 this provides CPU model information which factors into available MIPS. It also shows the CPU utilization information for this LPAR.
Listing 23.6 RMF CPU Activity report.
1 C P U A C T I V I T Y PAGE 1 z/OS V1R3 SYSTEM ID XXX1 DATE 06/16/2003 INTERVAL 15.00.715 RPT VERSION V1R2 RMF TIME 09.29.00 CYCLE 1.000 SECONDS -CPU 2064 MODEL 210 0CPU ONLINE TIME LPAR BUSY MVS BUSY CPU SERIAL I/O TOTAL %I/O INTERRUPTS NUMBER PERCENTAGE TIME PERC TIME PERC NUMBER INTERRUPT RATE HANDLED VIA TPI 0 100.00 53.20 67.44 049823 694.2 3.08 1 100.00 50.54 64.23 149823 645.9 2.83 2 100.00 46.69 60.27 249823 560.3 2.57 3 100.00 43.77 57.01 349823 564.2 3.26 4 100.00 42.65 55.76 449823 651.2 5.01 TOTAL/AVERAGE 47.37 60.94 3116 3.37 SYSTEM ADDRESS SPACE ANALYSIS SAMPLES = 900 - NUMBER OF ASIDS DISTRIBUTION OF QUEUE LENGTHS (%)
Key Data from this Report (Listing 23.6) is highlighted in Table 23.5.
Table 23.5 Key Data from RMF CPU Activity Report
Term |
Definition |
z/OS V1R3 |
OS specification. |
System ID XXX1 |
Name of the LPAR that collected the RMF records. |
-CPU 2064 Model 210 |
Processor specification, helps determine the overall capacity of the system, used in conjunction with the Partition Data Report. |
LPAR Busy Time Perc |
Percent of time this CPU was dispatched with work from this LPAR during this interval In LPAR mode (the norm), this number is more important than MVS Busy time. |
MVS Busy Time Perc |
Percent of online time this CPU was busy overall during this interval. |
IN READY queue |
Indication of queuing for CPU. Depending on number of processors. Greater than 90% of samples should show with queue length less than 3 * number of processors (this is the case here). |
TYPE MIN MAX AVG 0 1 2 3 4 5 6 7-8 9-10 11-12 --- ----- ----- -------- ---- ---- ---- ---- ---- ---- --- ---- ---- ---- IN READY 1 29 5.3 0.0 8.4 13.6 17.1 15.2 10.4 7.7 9.0 6.5 5.0
Workload Activity
All of the preceding reports focused on the details and health of the overall system. If the recommendations from the WLM section were taken, and the definitions made to measure the WebSphere workload in a granular manner, then the Workload Activity reports on Service and Report Classes can provide a great deal of info on the actual work flowing through WebSphere. The following excerpts from Workload activity reports were taken from true runs of a WebSphere Application. We show the Controller Service Class (MC5CTLR), the Servant Service Class (MC5SVNT), and the enclaves (MCCLUST1).
Listing 23.7 is for the enclave as it contains some information not found (or not valuable) in the other Workload Activity Reports.
Listing 23.7 Workload Activity report, enclave service class.
REPORT BY: POLICY=WEB_RAL WORKLOAD=CB SERVICE CLASS=MCCLUST1 CRITICAL =NONE TRANSACTIONS TRANS.-TIME HHH.MM.SS.TTT -SERVICE SERVICE RATES- AVG 6.00 ACTUAL 65 IOC 0 ABSRPTN 8950 MPL 6.00 EXECUTION 40 CPU 13041K TRX SERV 8950 ENDED 36198 QUEUED 25 MSO 0 TCB 93.4 END/S 148.99 R/S AFFINITY 0 SRB 0 SRB 0.0 #SWAPS 0 INELIGIBLE 0 TOT 13041K RCT 0.0 EXCTD 0 CONVERSION 0 /SEC 53674 IIT 0.0 AVG ENC 6.00 STD DEV 197 HST 0.0 REM ENC 0.00 APPL % 38.4 MS ENC 0.00 -RESPONSE TIME--- EX PERF AVG --USING%-- ---- EXECUTION DELAYS % HH.MM.SS.TTT VEL INDX ADRSP CPU I/O TOTAL QMPL CPU GOAL 00.00.02.000 90.0% ACTUALS 3090 99.0% 0.0% 0.5 9.9 0.0 0.0 52.1 45.1 7.0 --------RESPONSE TIME DISTRIBUTION-------- ----TIME---- --NUMBER OF TRANSACTIONS-- ------PERCENT------ HH.MM.SS.TTT CUM TOTAL IN BUCKET CUM TOTAL IN BUCKET < 00.00.01.000 35828 35828 99.0 99.0 <= 00.00.01.200 35840 12 99.0 0.0 <= 00.00.01.400 35842 2 99.0 0.0 <= 00.00.01.600 35842 0 99.0 0.0 <= 00.00.01.800 35842 0 99.0 0.0 <= 00.00.02.000 35842 0 99.0 0.0 <= 00.00.02.200 36196 354 100 1.0 <= 00.00.02.400 36197 1 100 0.0 <= 00.00.02.600 36198 1 100 0.0 <= 00.00.02.800 36198 0 100 0.0 <= 00.00.03.000 36198 0 100 0.0 <= 00.00.04.000 36198 0 100 0.0 <= 00.00.08.000 36198 0 100 0.0 > 00.00.08.000 36198 0 100 0.0
Key Data from this Report (Listing 23.7) is highlighted in Table 23.6.
Table 23.6 Key Data from RMF Workload Activity Report
Term |
Definition |
Workload CB |
Workload classification from WLM. All WebSphere work is in the CB subsystem. |
Service Class MCCLUST1 |
WLM service class in which this workload runs. This is the class for the enclave work. |
Transactions AVG 6.00 |
Number of transactions concurrently being processed (on average). |
Transactions END/S 148.99 |
Number of transactions in this class ended per second on average for this interval. |
TransTime Actual 65 |
Average response time during this interval (in milliseconds). |
Service Rates APPL % 38.4 |
Average % of a single CPU used by running transactions. Note multiprocessor systems can have a number > 100% here as the workload may consume cycles on any processor in the LPAR. |
Goal 2 seconds 90% |
WLM response time goal for this service class. |
Response Time Distribution |
Percentage distribution of response times (full report includes histogram). Breakouts are based on goal. |
You can get a rough feel for CPU Time per transaction within the enclave if you know the number of MIPS per engine of your processor (could look this up based on 2064/210 from the CPU Activity report). The formula is: APPL % * MIPSPerEngine / (EndSecond * 100). In this example, this would be 38.4 * 230.75 / (148.99 * 100) = .59 MIPS/tran. At this point, this does not include CPU in the Controller or Servant (which is usually light) or the time spent in CTG or CICS (DB2 time is reported in the enclave, but CTG and CICS are not). This example is for a simple transaction driving an EJB; real work with DB2, XML, and so on, will likely be notably longer.
GOTCHA
It is not unusual for WebSphere Servers to do re-directions to other URLs. If this URL is handled by this same server and transaction class, then the END/S number may be double what the client perceives. Also, some Web pages may involve multiple individual requests, and again the perception of the client may be different than the transactions as RMF reports them.
Listing 23.8 shows a similar report only for the Controller.
Listing 23.8 Controller Address Space Service Class report.
REPORT BY: POLICY=WEB_RAL WORKLOAD=CB SERVICE CLASS=MC5CTLR CRITICAL =NONE DESCRIPTION =MikeC WASV5 Controller ASIDs TRANSACTIONS TRANS.-TIME HHH.MM.SS.TTT -SERVICE SERVICE RATES AVG 3.00 ACTUAL 0 IOC 4875 ABSRPTN 3885 MPL 3.00 EXECUTION 0 CPU 15030 TRX SERV 3885 ENDED 0 QUEUED 0 MSO 2812K TCB 0.1 END/S 0.00 R/S AFFINITY 0 SRB 16 SRB 0.0 #SWAPS 0 INELIGIBLE 0 TOT 2832K RCT 0.0 EXCTD 0 CONVERSION 0 /SEC 11655 IIT 0.0 AVG ENC 0.00 STD DEV 0 HST 0.0 REM ENC 0.00 APPL % 0.0 MS ENC 0.00
As is normal here, the Controller took a very small amount of CPU, and there is no workload distribution since there are no measurable transactions in this address space. The Servant address space is similar (and, thus, not shown). If there is a significant amount of CPU consumed in the Controller or Servant, it is worth further research to determine why. Typical Servant problems could be high garbage collection processing, user threads doing application work in the region, or monitoring tools that do much of their work in the Servant and not in the enclaves.
The CTG and CICS service classes are also similar, but they can consume much larger amounts of CPU. Unless the workload is highly isolated relative to other CICS workload, it may be difficult to isolate the amount of CTG and CICS work that this application does relative to the overall CICS workload. For this purpose, these service classes were not included in the reporting class. It is always worth reviewing these service classes, however, to get an idea for resources being consumed in the various elements on the system.
23.6.4 DB2
DB2 is likely the most common back-end or data store employed by WebSphere on z/OS. This will focus on use of the standard Type 2 Java Database Connectivity (JDBC) driver. We focus on monitoring performance from a JDBC and DB2 perspective. Tuning DB2 is beyond the scope of this chapter.
23.6.4.1 JDBC Tracing and Reduction
Java Data Base Connectivity (JDBC) is the way that WebSphere or any Java application can connect to DB2. Most current JDBC drivers implement the JDBC 2.0 spec. Soon most will be going to the JDBC 3.0 spec. There are four types of JDBC drivers based on how they communicate between the database client and the database server. The types are
Type 1: JDBC/ODBC bridge. Java interfaces to an ODBC layer in order to get to the database server.
Type 2: Full Java provider that accesses a local database.
Type 3: Full Java provider that interfaces (usually over a network) to a specialized daemon or code on the database server machine which then communicates with the database server.
Type 4: Full Java provider that communicates directly to the database server.
Note that Types 2 and 4 are the only commonly used types. Type 2 can be used to a local database client, which then uses database mechanisms (i.e., DRDA) to access a remote database server.
In addition to JDBC specification level, and driver type, with DB2, there is now a migration on z/OS from the legacy Type 2 driver, to the universal driver that can operate as a Type 2 driver or a Type 4 driver. JDBC trace can be set through the WebSphere Administrative Console on the universal driver. Since most setups prior to DB2 Version 8 and WebSphere 5.02 (W502002), this JDBC trace will focus on the legacy driver on z/OS which is Type 2.
JDBC Tracing can impact performance by more than 50 percent, and it is nondynamic (i.e., turning it on and off requires a recycle of the server), so it should only be enabled for verification and/or for problem determination. JDBC Trace information is very valuable in understanding what is flowing between the application and DB2. To enable JDBC trace, modify the db2sqljJDBC.properties file (specified in was.env in your WebSphere configuration directory for this server) and set the lines as shown in Listing 23.9.
Listing 23.9 db2sqljJDBC.properties directives to enable a trace.
DB2SQLJ_TRACE_FILENAME=</yourdir/yourhfsfile> DB2SQLJ_TRACE_BUFFERSIZE=1024
Normally, the DB2SQLJ_TRACE_FILENAME directive should be commented out or removed as this turns off the trace.
This trace will create two files: /yourdir/yourhfsfile and /yourdir/yourhfsfile.jtrace. The jtrace (Java trace) is very large and very useful. It can be used to truly understand the flow of requests and replies between the application and DB2. A typical flow, for example, of getting a connection would appear as shown in Listing 23.10.
Listing 23.10 JDBC Jtrace excerpt from a prepared statement.
<2003.06.19 23:53:27.692> <Entry> <prepareStatement> <COM.ibm.db2os390.sqlj.JDBC.DB2SQLJConnection@720d0865> <WebSphere t=009c1288> -- <p#1=SQLText=SELECT distinct CORP1_STATE_CD, STATE_CODE FROM DFBP50P.tst_bus_region_typ WHERE EFF_DT <= CURRENT DATE AND (EXP_DT > CURRENT DATE OR EXP_DT IS NULL)> <2003.06.19 23:53:27.696> <Entry> <Initialize> <COM.ibm.db2os390.sqlj.JDBC.DB2SQLJJDBCCursor@218ec865> <WebSphere t=009c1288> -- <p#1=this=COM.ibm.db2os390.sqlj.JDBC.DB2SQLJJDBCCursor@218ec865[pSTMT=0]> -- <p#2=conn=COM.ibm.db2os390.sqlj.JDBC.DB2SQLJConnection@720d0865[pCONN=2938eb00]> -- <p#3=Type=3> -- <p#4=role=-101> -- <p#5=nativeSQLText=SELECT distinct CORP1_STATE_CD, STATE_CODE FROM DFBP50P.tst_bus_region_typ WHERE EFF_DT <= CURRENT DATE AND (EXP_DT > CURRENT DATE OR EXP_DT IS NULL)> -- <p#6=DB2StmtType=0> -- <p#7=numParameters=0> -- <p#8=metaData=0> -- <p#9=ResultSetCount=0> -- <p#10=JDBCProfile=COM.ibm.db2os390.sqlj.JDBC.DB2SQLJJDBCProfile@500cc865> -- <p#11=Section=350> -- <p#12=cursorName=DB2OS390HOLD100> <2003.06.19 23:53:27.698> <Exit> <Initialize> <COM.ibm.db2os390.sqlj.JDBC.DB2SQLJJDBCCursor@218ec865> <WebSphere t=009c1288> <2003.06.19 23:53:27.698> <Entry> <prepare> <COM.ibm.db2os390.sqlj.JDBC.DB2SQLJJDBCCursor@218ec865> <WebSphere t=009c1288> -- <p#1=COM.ibm.db2os390.sqlj.JDBC.DB2SQLJJDBCCursor@218ec865[pSTMT=548743dc]> -- <p#1=DB2SQLJConnection=COM.ibm.db2os390.sqlj.JDBC.DB2SQLJConnection@720d 0865[pCONN=2938eb00]> -- <p#2=TransactionState=1> -- <p#3=COM.ibm.db2os390.sqlj.JDBC.DB2SQLJConnection@720d0865[pCONN=2938eb00]> <2003.06.19 23:53:27.756> <Exit> <prepare> <COM.ibm.db2os390.sqlj.JDBC.DB2SQLJJDBCCursor@218ec865> <WebSphere t=009c1288>
JTrace contains a wealth of information if you take a bit of time to understand the format and learn to extract what you need. The non-JTrace file contains information that is likely more valuable to a DB2 expert than to an application developer or WebSphere administrator. It is a binary file that can be formatted to be viewable with the commands shown in Listing 23.11.
Listing 23.11 JDBC formatted and flowed traces.
db2sqljtrace fmt fname >fname.fmt (to get formatted version) db2sqljtrace flw fname >fname.flw (to get flowed version)
Several configuration and application problems can be diagnosed with the jtrace files, including:
Connections not being closed and returned to the pool.
More SQL calls being executed than expected.
Incorrect SQL being generated by an application.
Commit/rollback issues (conn.setAutoCommit(true) issue).
Improper usage of prepared statements (preparing statements that do not get executed). Prepared statements are good to use as they can result in more hits on the dynamic cache.
All of these can manifest themselves as performance problems.
23.6.5 DB2 Tracing
DB2 Tracing on z/OS is completely dynamic and is excellent for understanding the behavior of your application, relative to DB2, at a high level. In addition to being dynamic, most DB2 systems are already running the type of tracing that can give an excellent high level view of the DB2 subsystem.
GOTCHA
If you use the legacy JDBC provider for z/OS and you have a local DB2 that passes the requests through to a remote DB2, then your analysis should be of the remote DB2. Specifically, if the local DB2 is trusted on the remote DB2, the signons on the remote DB2 should be low compared with the number of requests.
A common trace string is shown in Listing 23.12.
Listing 23.12 Standard DB2 trace options.
-<Db2CmdPref> start trace(stat) class(1 3 4 5 6) dest(smf) -<Db2CmdPref> start trace(acctg) class(1 2 3) dest(smf)
These traces are generally already running on most DB2 systems. Report on these data by doing an SMF extract of type 100102 records for the time period in question. Then run a DB2/PM postprocessing job against the data specifying:
Listing 23.13 DB2/PM control cards to review Accounting and Statistics Data.
DB2PM ACCOUNTING(REDUCE, REPORT(LEVEL(SUMMARY,DETAIL),ORDER(PLAN))) DB2PM STATISTICS(REDUCE,TRACE, REPORT(LEVEL(SUMMARY,DETAIL))) DB2PM EXEC
This will generate several reports.
23.6.5.1 System Parameter Report
The first item we review is the DB2PRMDD output. This provides all of the information from the ZParms. ZParms are DB2 system parameter settings that your DB2 System Programmers can modify if needed (but it does require a DB2 recycle). The items to focus on are
MAX NO OF USERS CONCURRENTLY RUNNING IN DB2 (CTHREAD). This should be at least 300 depending on the expected volume and concurrency.
MAXIMUM KEPT DYNAMIC STATEMENTS (MAXKEEPD). This should be at least 500 as it controls the number of prepared statements kept in the cache (and, thus, do not need to be re-prepared).
CACHE DYNAMIC SQL (CACHEDYN). This turns on the dynamic caching which is critical to performance in a dynamic SQL environment (i.e., JDBC).
MAX NO OF BATCH CONNECTIONS (IDBACK). This may need to be increased with high concurrency WebSphere applications (we use at least 250).
CHECKPOINT FREQUENCY (CHKFREQ) should be increased if possible to reduce DB2 overhead (we set this to 50,000)
23.6.5.2 Statistics Reports
You should then do a quick review of the statistics report which provides a view of overall DB2 subsystem health. The items to review are
Volume of requests in SQL DML and SQL DDL sections (focus on prepares and describes).
IDENTIFY and SIGNON requests under SUBSYSTEM SERVICES. This will identify the amount of time DB2 spends in authenticating connections. This will be impacted by the RunAS settings on EJBs and security settings from servlets.
Various locking information under LOCKING ACTIVITY.
Major focus on prepare requests in DYNAMIC SQL STMT with a focus on FULL vs. SHORT (you want almost all short) as summarized by GLOBAL CACHE HIT RATIO (%).
23.6.5.3 Accounting Reports
The accounting report is similar to the statistics report, only here you can focus on a specific DB2 plan. If using DB2 local to WebSphere on z/OS, the plan should be DSNJDBC. It is also split by Primary Authid on the thread, which can make for a long report requiring a great deal of filtering effort. In some odd scenarios, with distributed usage, it is listed as ?RRSAF (work not actually occurring under a plan). For the appropriate plan, you will want to review the same type of data as you did in the Statistics Report, plus the following:
Listing 23.14 DB2/PM accounting report information.
PRIMAUTH: SRNAME1 PLANNAME: DSNJDBC ELAPSED TIME DISTRIBUTION CLASS 2 TIME DISTRIBUTION ------------------------- --------------------- APPL |=========================> 92% CPU |==> 4% DB2 |===> 7% NOTACC ===================> 88% SUSP |> 1% SUSP |====> 8% AVERAGE APPL(CL.1) DB2 (CL.2) IFI (CL.5) CLASS 3 SUSPENSIONS AVERAGE TIME --------- --------- --------- --------- --------- --------- ELAPSED TIME
1.212208 0.093436 N/P LOCK/LATCH(DB2+IRLM) 0.000202 NONNESTED 1.212208 0.093436 N/A SYNCHRON. I/O 0.000000 STORED PROC 0.000000 0.000000 N/A DATABASE I/O 0.000000 UDF 0.000000 0.000000 N/A LOG WRITE I/O 0.000000 TRIGGER 0.000000 0.000000 N/A OTHER READ I/O 0.000000 Continuation Lines w/in report AV.EVENT HIGHLIGHTS -------- ---------------------- 0.36 #OCCURRENCES : 74873 0.00 #ALLIEDS : 0 0.00 #ALLIEDS DISTRIB: 74873 0.00 #DBATS : 0 0.00 #DBATS DISTRIB. : 0
Under ELAPSED TIME DISTRIBUTION look for Average ELAPSED TIME in Class 1 and Class 2 (Class 1 is total time from client connect to disconnect, Class 2 is time spent actually doing DB2 work). If Class 2 is a small percentage of Class 1, then DB2 is not likely a big factor.
Review Class 3 (suspensions) to see if anything in particular stands out as having long delays.
A review of the DSNJDBC2 vs DSNJDBC3 packages can highlight the isolation level of the work being done. DSNJDBC2 work is more efficient than DSNJDBC3 work (lower isolation level). It is rare to see significant work under the DSNJDBC1 or DSNJDBC4 package.