Solaris 9 System Monitoring and Tuning
As system administrator, it's your job to ensure that the systems you manage are operating at optimum performance levels. Because computer systems are frequently crucial to maintaining the profitability of companies, corporate management is concerned about all computer systems performing as intended. One of the most important tasks you will perform, from a job security point of view, is providing regular reports on the health of your systems to your superiors. It's excellent job security to show that you are maintaining the systems properly and that everything is operating efficiently. After all, that's what your job boils down to.
Sometimes performance bottlenecks do more than just impact response time. Sometimes they can bring the systems down completely. The costs can be staggering. For example, I once worked at a food distributor that estimated that a down system cost them $1 million per hour. The $5 billion company would be out of business if downtime went beyond 2 days. Not to mention what it would do to their stock price. Managed performance is easily equated with money saved.
Tracking down a performance problem on a Sun system takes a great deal of detective work. You need to see the problem before users notice it. It's unacceptable to find out a performance problem from a user or, worse yet, from your manager. You need to begin tracking performance problems by reviewing all of the clues you are able to gather from the various monitoring utilities available in Solaris.
Aside from purchasing a third-party performance-monitoring package for several thousand dollars, I'm going to show you how to utilize the tools that come standard in the Solaris 9 operating environment.
I've always considered UNIX performance monitoring as sort of a "black art." There's lots of smoke and mirrors to look behind, and nothing is ever black and white. As you track performance, you are gathering system activity data on all aspects of the hardware and operating system, and you must determine when the data you've compiled points to a performance problem. Tracing a bottleneck often involves many days of gathering data and many more hours of analyzing the data. What might appear to be the cause of the bottleneck initially might be a symptom of yet another problem that is not quite so obvious.
In this chapter, I'll introduce various methods of gathering system activity data from a Solaris system. Unfortunately, interpreting this data comes more from experience than from anything I can teach you in this book. There are many variables involved. For example, a CPU might show 90% utilization, but this doesn't necessarily mean that the CPU is overloaded and is contributing to the slow performance of a system. Actually, 90% CPU utilization could be the sign of a very healthy, well-balanced system. On another system, memory utilization could be at 100%, and looking at the data could indicate that this is the cause of the performance problem when actually it's an inefficient database query that's causing the system to work extra hard.
Being Proactive
I can't over emphasize the importance of being proactive in monitoring your system's performance from the first day it is installed. The data that is gathered early on, before the system gets loaded down with users, will allow you to set the mark for how the system should be performing. These are the "best-case" figures that you can measure your system's future performance against. Now, as you gather system activity data on a loaded system, you'll have a baseline to compare it against.
Every time you make a change, carefully watch how the change impacts your system's performance. If you wait until users are complaining and performance is severely degraded, you'll have a great deal of backtracking to do before finding the bottleneck that caused this issue to arise. Was it a change made to the kernel? Was a patch recently installed? Was it a sudden increase in user activity? Is there a hardware malfunction?