The Stack
FIGURE 3 and FIGURE 4 show the stack under consideration. FIGURE 3 illustrates the layering relationship between the various software components in this stack, and FIGURE 4 shows the basic cluster configuration. It consists of a two-node cluster providing the Oracle9i Real Application Cluster (RAC) data service.
FIGURE 3 Layering Relationship Between the Various Software Components in the Stack
FIGURE 4 Basic Stack Configuration
The stack has the following main components:
Sun Fire™ 4800 servers as nodes
Sun StorEdge™ T3 array storage
FCAL switches for host-to-storage connectivity
Gigabit Ethernet for private network
Solaris™ 8 Operating Environment (Solaris OE) upgrade 6
Sun Cluster 3.0 upgrade 2 on each node
Oracle9i Real Application Cluster (Oracle9i RAC) software on each node
Veritas Volume Manager, VxVM (CVM)
Database and system stress load is provided by an online transaction processing (OLTP) load. This load simulates an application where each transaction is made up of updates into fixed-size tables and insert operations to some other tables. This workload was chosen because its transactions are uniform and it can thus be handled tractably while doing white-box analysis of the relevant layers in the stack.
Preliminary Results
The black-box measurements of this project have been ongoing, and Table 1 summarizes some of the preliminary results obtained for this stack. Note that these results are preliminary in nature, and more importantly, should not be extrapolated to any other Sun Cluster stack.
TABLE 1 Preliminary Measurements
Event Type |
Outage |
Node Failure |
60-80 seconds |
Interconnect Failure |
80 seconds |
Cluster rgmd dying |
70 seconds |
Oracle process dying |
30 seconds |
Node Rejoin |
20 seconds |
The five fault injection tests reported here are: (1) a node going down, (2) a split brain being induced by causing a failure of the private networks between the nodes, (3) one of the critical Sun Cluster framework daemons dying and causing that node to go down, (4) an Oracle process dying, and (5) a down node rejoining the cluster.
The first three fault injection scenarios correspond to one of the nodes going down, either immediately as a result of fault injection (as in node failure cases) or as a by product of the fault that was injected (as in the split brain induced by interconnect failure case, which would then result in one of the nodes going down). A node going down results in the following outage subcomponents:
Detection of the node death by the clustering software on the other node
Reconfiguration of the Sun Cluster and VxVM software as well as the Oracle9i RAC layers
Recovery of the VxVM and/or Oracle9i RAC layers
The outages reported in TABLE 1 are from the client's perspective, and are defined as the time interval, in seconds, for which no transactions were committed to the database. The outage intervals will span the detection, reconfiguration, and parts of the VxVM and Oracle recovery process. The remainder of the VxVM and Oracle recovery process proceeds after the outage interval, in asynchronous steps.
The fourth fault injection scenario consists of killing one of the Oracle processes, which leads to the Oracle instance on that node going down. This causes the Oracle instance on the other node to perform recovery, leading to an outage.
The last scenario corresponds to the first three fault injections where the node that went down reboots and joins the cluster. During the resulting reconfiguration, the various layers reintegrate with their counterparts in the existing cluster, which results in a short outage (TABLE 1).
TABLE 2 System Availability and Yearly Downtime
Parameter |
PARAMSET 1 |
PARAMSET 2 |
MTBF |
4000 hours |
3000 hours |
MTTR_1 |
1 hour |
1 hour |
MTTR_2 |
2 hours |
4 hours |
Recovery_Time |
60 seconds |
60 seconds |
Node_Rejoin_Time |
20 sees |
20 seconds |
p |
0.99 |
0.98 |
a |
0.2 |
0.2 |
System availability |
0.99998 |
0.99993 |
System yearly downtime |
11 minutes, 13.87 seconds |
36 minutes, 18.12 seconds |
These preliminary results can be used for the outage-related parameters in the RAScad model developed in "RAScad Modeling." For the remaining parameters, data collection efforts are underway. In the meantime, to analyze the effect of these parameters on the availability of the system, some preliminary numbers have been picked for these parameters. These numbers represent preliminary values obtained from a few Sun_ internal deployment environments.
Two different sets of values have been picked for each such parameter. With these values plugged into the tool, RAScad can compute several useful metrics, such as system availability and system downtime. The differential analysis that follows shows that both sets of data yield the same prioritized list of parameters. Thus, at least from the differential analysis perspective, the actual values of these parameters are not as critical, within the specified ranges.
Availability Calculations
TABLE 2 shows the results from RAScad for the stack with the two parameter sets, PARAMSET1 and PARAMSET2. The parameters are listed in the first seven rows (MTBF through a). The two sets differ in the values used for MTBF, MTTR_2 and p. From an availability perspective, the values in the first set are better than those in the second. The parameter values are followed by the calculated values for system availability and system yearly downtime for each set.
For the first set, the system availability is calculated to be 0.99998, and system yearly downtime is calculated to be 11 minutes, 13.87 seconds. With the second set, the system availability decreases to 0.99993 with system yearly downtime increasing to 36 minutes, 8.12 seconds. Note that these results are based on preliminary parameter values. Also, the RAScad model developed in "Proposed Methodology" models outages only due to a node going down, and these results apply only for that scenario.
Differential Analysis in RAScad
For both the sets, each parameter was decreased by ten percent and a differential analysis was done to yield the percentage of downtime change due to the variable change. These results are reported in Table 3 for the two parameter sets. Negative values in Table 3 indicate that the downtime decreases when the parameter is changed, and vice versa.
TABLE 3 Differential Analysis Results
Parameter |
PARAMSET1 |
PARAMSET2 |
MTBF |
11.28 |
11.29 |
MTTR_1 |
-0.13 |
-0.14 |
MTTR_2 |
-4.82 |
-7.87 |
Recovery_Time |
-3.90 |
-1.61 |
Node_Rejoin_Time |
-1.29 |
-0.52 |
p |
462 |
377 |
a |
-0.023 |
-0.025 |
The largest change in downtime is contributed to by the parameter p. A decrease in its value by 10 percent resulted in an increase in downtime by 462 percent and 377 percent, for initial values of 0.99 and 0.98 respectively. Similarly, a decrease in the value of MTTR_2 by 10 percent results in a decrease in the downtime by 4.82 percent and 7.87 percent, for initial values of two hours and four hours, respectively. A comparison of the results across these two data sets clearly demonstrates the same ordering of the importance of the factors to system availability.
In decreasing order of importance, this ordering is:
- p
- MTBF
- Recovery_Time, MTTR_2
- Node_Rejoin_Time
- MTTR_1, a
This analysis reveals the main areas of focus not only for testing and development purposes, but also for establishing a set of best practices that will lead to high availability environments. The next section discusses the basic philosophy behind these best practices, and then lists a set of these practices.
The RAScad model from "RAScad Modeling" applies directly to scalable services only. A similar model has been built for failover services, where only one node is actively servicing the client. Although not discussed in this article, a differential analysis of the failover model has also been done. This analysis results in the same list of the top four parameters to focus on, as in the case of scalable services.