Determining Scope
Once you have determined that your application can be made highly available using Sun Cluster, you should consider the scope of the project you want to attempt. The scope can vary widely depending on the requirements of your particular situation, and it can obviously affect the amount of time it will take to develop the agent and have the data service running successfully in the cluster.
The chief topics to consider when determining scope are:
- Cluster awareness
- Simple failover or scalable
- Fault recovery
- Complex monitoring
- Enterprise architecture considerations
Cluster Awareness
Most applications that are made highly available in a Sun Cluster framework have no internal knowledge of the cluster framework or even if they are running in a cluster. For these applications, the agent consists of wrapper programs used by the cluster to control the starting and stopping of the application. The in-memory state of the application is unlikely to be communicated between cluster nodes or maintained across failovers to different nodes.
This sort of application could be termed cluster compatible, since they can run in the cluster environment (having satisfied the qualification requirements discussed earlier in this chapter), but they don't change their internal behavior. They are often reasonably easy to integrate into the Sun Cluster framework.
At the other end of the scale are cluster-aware applications. These applications actually communicate directly with the Sun Cluster framework and change their internal behavior based on the cluster state. Furthermore, these applications may run instances simultaneously on multiple nodes of the cluster and communicate among those instances to maintain in-memory state even in the event of a node failure. One example of this sort of application is the Oracle 9i Real Application Clusters (RAC) database server. It should be fairly obvious that this sort of application is usually quite complex and requires more complex handling to integrate it into the Sun Cluster environment.
For most projects, the application will be cluster compatible, but if you decide to make one of your own applications cluster aware, you should allow more time for the development effort. Cluster-aware applications are discussed in more detail in Chapter 13, "Developing Cluster-Aware Applications."
Failover or Scalable
The Sun Cluster 3 environment allows for two types of data service: failover and scalable. A failover service runs on one node in the cluster at any given time and if that node crashes the application will start on a different node. To clients of the failover service, the restart will appear as if the server rebooted very quickly, but there will still be some delay (however small) between when the service stops on the first node and restarts on the second node.
By contrast, a scalable service takes advantage of the global file service (GFS) and shared IP addresses to run an application on multiple nodes at the same time, providing extra processing power through horizontal scaling and load balancing. A scalable service can also provide much better availability than a normal failover service because even if one node fails the other nodes are still running the service and can often accept new connections immediately.
In general, if you can make your application into a scalable service, you can more fully exploit the capabilities of the Sun Cluster 3 system. However, failover services more easily fit a wider range of applications without resorting to application code changes.
Scalable services are covered in more detail in Chapter 11, "Writing Scalable Services."
Fault Recovery
When planning the scope of your data service project, you should consider the process of fault recovery, particularly when a cluster node fails.
If client applications do not retry connections to a failed service, then downtime can be extended beyond what is required by the cluster because end users will have to manually detect and recover from the failure. This should be identified during the initial phase of agent development, since it may affect the entire scope of the project.
If an application to be made highly available is itself dependent on the availability of some other service (for example, a database), then this must also be taken into account. In some cases, supporting programs must be developed to check for the availability of the service before the application is started. There may also be special requirements on the fault monitoring programs to take action (such as restarting the application) if the required service is itself failed over.
Complex Monitoring
Fault monitoring and the associated application management can be the most complex part of the entire agent design and development process. While there are usually only very few ways to start and stop an application, there are often countless ways to monitor its behavior or remedial action if required. Questions that you should ask when determining the scope of fault monitoring include:
- Should we only check for node failure?
- Should we check if the application has crashed?
- Should we check if the application has hung?
- Should we check if the application provides incorrect data?
- How frequently should we check?
- How many retries should we allow before taking action?
- How do we detect failures outside the cluster environment?
This is where previous knowledge of an application is extremely useful, if not vital. It is also the area where the scope of the project can balloon enormously.
Enterprise Architecture
No cluster system exists in isolation, and external elements in your IT infrastructure may affect service availability, which would render pointless any work to make the application available. Before starting work on a data service agent, you should investigate your IT infrastructure to check that there are as few single points of failure as possible, including network access, power, and so forth. You should also consider software components in your infrastructure: for example, investigating whether using a transaction processing monitor (TP monitor) would assist in speeding client recovery times.