- Installing the Oracle Solaris OS on a Cluster Node
- Securing Your Solaris Operating System
- Solaris Cluster Software Installation
- Time Synchronization
- Cluster Management
- Cluster Monitoring
- Service-Level Management and Telemetry
- Patching and Upgrading Your Cluster
- Backing Up Your Cluster
- Creating New Resource Types
- Tuning and Troubleshooting
Tuning and Troubleshooting
If you are running your applications on a Solaris Cluster system, then service availability is one of your main concerns. The two major causes of service migration are cluster-node failures and the resource probes detecting that an application is unhealthy.
The Solaris Cluster framework is responsible for detecting and reacting quickly to node failures. When all the cluster private interconnects fail, the default heartbeat settings allow only 10 seconds to elapse before a cluster reconfiguration is triggered and any affected service is restarted on one of the remaining cluster nodes. The application fault probe is responsible for detecting and reacting to an application that is either unhealthy or has failed. Each fault probe can be tuned to take the appropriate action based on the number of failures within a given time interval. Similarly, the resource type Start and Stop methods for the application are responsible for ensuring that the application is successfully started and stopped. The Start, Stop, and Monitor_start (fault probe) methods have a tunable timeout interval, which is the maximum duration for each operation. The fault probe also has a probe interval, which determines how often the probe is run.
When tuning the probe interval and any associated timeout, you must trade off the impact of the probe on system load and application performance. If you set the values too low, you risk false failover. If you set them too high, you risk waiting longer than necessary to detect a critical problem. Because all applications and workloads are different, the only realistic way to achieve an optimal setting is by thorough testing. You start with high values and gradually reduce them until lower values result in misdiagnosed problems. You then increase the values again to give the fault probe some scope of variability. Of course, all your testing must be performed under realistic workload conditions.
For the start and stop timeouts, your task is slightly easier because the system logs (in /var/adm/messages) state how much of the available timeout was used by each method to complete the relevant action. Thus, you can tune your start and stop timeouts such that the application takes no more than, for example, 50 percent of this value under normal conditions.
If it is not obvious why a resource is not functioning correctly, then you can often obtain additional debugging information to help diagnose the problem. The /var/adm/messages file is your main source of help. All your resources will log messages here. You can increase the amount of debugging information logged by ensuring that your syslog.conf file directs daemon.debug output to a suitable file.
Some resource types use flags for specifying the amount of debugging output they produce. Others written in DSDL code (see the section "Data Service Development Library") rely on you specifying a log level from 1 to 9 (with 9 as the maximum) in a file named /var/cluster/rgm/rt/resource-type-name/loglevel.