- The Cookbook for Setting Up a Serviceguard Package-less Cluster
- The Basics of a Failure
- The Basics of a Cluster
- The "Split-Brain" Syndrome
- Hardware and Software Considerations for Setting Up a Cluster
- Testing Critical Hardware before Setting Up a Cluster
- Setting Up a Serviceguard Package-less Cluster
- Constant Monitoring
- Chapter Review
- Test Your Knowledge
- Answers to Test Your Knowledge
- Chapter Review Questions
- Answers to Chapter Review Questions
25.5 Hardware and Software Considerations for Setting Up a Cluster
I won't go through every permutation of supported disk and LAN technologies. But I do want to jog your memory about Single Points Of Failure in relation to hardware components. I will leave it up to you to perform a hardware inventory to ensure that you do not have an SPOF in your design.
-
SPU: It is not a requirement for each node in a cluster to be configured exactly the same way, from a hardware perspective. It is not inconceivable to use a lower-powered development server as a Standby node in case your main application server fails. You should take some time to understand the performance and high availability implications of running user applications on a server with a dissimilar configuration.
-
Disk Drives:
-
These are the devices that are most likely to fail. Ensure that you offer adequate protection for your operating system disks as well as your data disks.
-
Utilizing highly available RAID disk arrays improves your chances of not sustaining an outage due to a single disk failure.
-
If you are utilizing Fibre Channel, ensure that each node has two separate connections to your storage devices via two separate Fibre Channel switches.
-
Ensure that hardware solutions have multiple power supplies from different sources.
-
Software components can offer RAID capabilities as well; LVM can offer RAID 0, 1, 0/1. VxVM can offer RAID 0, 1, 0/1, 1/0, and 5. When utilizing software RAID, ensure that mirrored disks are on separate controllers and powered from a separate power supply.
-
-
Networks:
-
Ideally, have at least one separate heartbeat LAN.
-
Ideally, have a standby LAN for all LAN interfaces, including heartbeat LANs.
-
Utilize multiple bridges/switches between active and standby LANs.
-
If utilizing multiple routed IP networks between servers and clients, ensure that multiple routers are used.
-
Ensure that all members of a network can support dynamic routing.
-
Endeavor to utilize the most robust routing protocols, e.g., RIP2 or even better OSPF.
-
-
If utilizing FDDI, ensure that dual attached stations are attached to separate concentrators.
-
-
Power Supplies
-
Ensure that you have at least two independent power supplies.
-
Independent power supplies should be fed from two external power generators; that includes power supply companies. Take the situation where you have two independent power feeds into your data center both from XYZ Generating Company. If XYZ Generating Company goes "bust" or loses the capability to supply you with electricity, you are somewhat in a pickle. If all else fails, your second power supply should come from an onsite generator.
-
Regularly test the ability of your second power supply to "kick in" seamlessly when your primary supply fails.
-
-
Data Center:
-
You need to consider your data center as an SPOF. How are you going to deal with this? This can involve a Disaster Recovery Plan including offsite tape data storage or could be as sophisticated as an advanced cluster solution such as Metrocluster or Continentalclusters incorporating asynchronous data replication over a DWDM or WAN link.
-
Ensure that management understands the implications of not including your data center in the overall High Availability Plan.
-
-
Performance:
-
Should an application fail over to an active adoptive node, you will have two applications running on one node. Do the consumers of both applications understand and accept this?
-
Have you set up any Service Level Agreements with your user communities relating to individual application performance?
-
How will you manage the performance of individual applications in the event of a failover to an active adoptive node?
-
Will you employ technologies such as Process Resource Manager (PRM) and Work Load Manager (WLM), or leave performance management to the basic UNIX scheduler?
-
-
User Access:
-
Do users need to perform a UNIX login to the node that is running their application?
-
Does the user's application require a UNIX user ID to perform its own level of client authentication?
-
How will you manage providing consistent UNIX user and group IDs across the entire cluster?
-
Do you use NIS, NIS+, or LDAP?
-
Do you use Trusted Systems?
-
-
Security:
-
Do you have a security policy for individual nodes?
-
Do you have a security policy for your network(s)?
-
Do you have a security policy for the cluster?
-
Do you have a security policy for your data center?
-
Do you have a security policy for users?
-
Does everyone concerned know and understand your security policies?
-
Do you employ third-party security consultants to perform penetration tests?
-
Does your organization have an IT Security Department? If so, do they perform regular security audits? Do they understand the security implications of an HP High Availability Cluster?
-
That should jog your memory about the SPOF and matters relating to the availability of your applications. As you can see, it's not just about "throwing" hardware at the problem; there are lots of other technological and process related challenges that you will need to face if you are to offer maximized uptime to your customers.
Let's move on to look at the mechanics of setting up a basic High Availability Cluster.