- Sun Cluster 3.0 Series: Guide to Installation—Part 2
- Objectives
- Prerequisites
- Major Resources
- Introduction
- Sun Cluster 3.0 Software
- Install SC 3.0 On First Cluster Node—Without Reboot
- Identify the Device ID (DID) on the First Cluster Node
- Verify DID is Available on Each Additional Cluster Node)
- Install SC 3.0 Patches on the First Cluster Node
- Verify the Install Mode is Enabled
- Install SC 3.0 on Additional Cluster Nodes— Without Reboot
- Install SC 3.0 Patches on Additional Cluster Nodes
- Establish the SC 3.0 Quorum Device - First Cluster Node Only
- Configure Additional Public Network Adapters - NAFO
- Configure ntp.conf on Each Cluster Node
- Verify /etc/nsswitch Entries
- Update Private Interconnect Addresses on All Cluster Nodes
- Add Diagnostic Toolkit
Sun Cluster 3.0 Software
SC 3.0 is integrated with the Solaris operating environment (OE), and includes specialized software that implements highly available and scalable data services, and manages the Sun Cluster. SC 3.0 supports:
Volume management software for administering shared data storage
Software to enable all nodes to access all storage devices (even SC 3.0 nodes that are not directly connected to disks)
Software to enable remote files to appear on every node as though they were locally attached to that node
Software monitoring of user applications
Software monitoring of cluster connectivity between nodes
Software for configuring and creating highly available (HA) and scalable data services, including configuration files and management methods for starting, stopping, and monitoring both off-the-shelf, and custom applications using the application programmer interfaces (APIs)
Highly Available Data and Applications
The general design goal for the SunPlex™ platform is to reduce or eliminate system downtime due to a software or hardware failure, and to ensure data and applications are available even if an entire node (server) fails. SC 3.0 data services can be implemented to increase application performance (for example, scalable data services), and to provide enhanced availability of the system (for example, enabling maintenance of a node to occur without shutting down the entire cluster).
Failover and Scalable Services
SC 3.0 software provides high availability through application failover, which is the process by which a cluster automatically relocates a service from a failed primary node to a designated (that is, preconfigured) secondary node.
Scalable data services are designed to provide constant response times or throughputs, scaling to meet an increased workload. Each cluster node can process client requests and access shared data.
Topologies
SC 3.0 supports cluster-pairs and N+1 star topologies (or cluster interconnect schemes). Each supported topology requires careful consideration and planning to ensure failover (or scalability) can be achieved for each application hosted, and that systems, subsystems, and networks are configured to handle any additional workloads that can result. That is, alternate nodes have the performance and capacity to handle the increased workload.
Cluster Interconnect
The cluster interconnect is crucial to all cluster operations and should not be used to route any other traffic or data. This private network establishes exclusive use of preassigned (hard-coded) IP addresses.
The cluster interconnect supports cluster application data (shared nothing databases) and locking semantics (shared disk databases) between cluster nodes.
Redundant, fault-tolerant network links are implemented for SC 3.0, and all links are active at any given time. Adding links increases performance over the interconnect. Upon failure of a single link, failover is transparent and immediate. The application shall be the primary factor when determining actual performance (scaling) and which type of interconnects are supported.
Key Practice: Refer to Figure 1, and Table 1 through Table 5 of Module 1, (Sun™ Cluster 3.0 Series: Guide to InstallationPart I, Sun BluePrints OnLine, April, 2003), to configure additional private interconnects to increase interconnect performance and availability. For this configuration, qfe2 and qfe3 are shown as being 'unused.' Either of these connections (or both) can be configured, using Ethernet crossover cables between each cluster node. Sun Cluster software is then used to configure the additional private interconnects, before making the appropriate /etc/inet/hosts entries.
Cluster Membership Monitor
The Cluster Membership Monitor (CMM) is a distributed set of agents, one per cluster member. Agents exchange messages over the cluster interconnect to ensure valid cluster membership is preserved and maintained. The CMM drives the synchronized cluster reconfiguration process in response to changes in cluster membership, and handles cluster partitioning (for example, split-brain or amnesia). The CMM monitors all cluster members to ensure full connectivity.
Ultimately, the CMM ensures valid cluster membership and data integrity by maintaining a valid cluster quorum, and protects the cluster from partitioning itself into multiple, separate clusters in the event of cluster interconnect failure.
Failfast Mechanism
If the CMM detects a critical failure with a node, it calls upon the cluster framework to forcibly shut down (panic) the failing node and to remove it from the cluster membership.
Failfast will cause a node to shut down in two ways:
If the node leaves the cluster and then attempts to start a new cluster without having a quorum, it will be "fenced" off from accessing shared data.
If one or more cluster-specific daemons die (for example, clexecd, rpc.pmfd, rgmd, or rpc.fed) on a given node, that node will be forced to leave the cluster membership when CMM forces a panic.
Global Devices
SC 3.0 uses global devices to provide cluster-wide, highly available access to any device in the cluster.
The cluster automatically assigns "globally" unique IDs to each disk, CD-ROM, and tape device within the cluster.
This enables consistent access to each device from any node in the cluster.
The global device name space is held in the /dev/global directory.
Disk Device Groups
For SC 3.0, all multihost disks are under the control of Sun Cluster software, for which you must perform the following:
Create volume manager disk groups using the multihost disks.
Register the volume manager disk groups as disk device groups (a type of global device).
SC 3.0 then registers every individual disk as a disk device group. After registration, the volume manager disk groups become accessible within the cluster. If more than one cluster node can read and write to a master disk device group, the data stored in that disk device group becomes highly available.
NOTE
Refer to the Sun Cluster 3.0 Data Services Installation and Configuration Guide for additional information about volume manager disk groups, and the association between disk device groups and resource groups.
Global Name Space
The SC 3.0 software mechanism that enables global devices is the global name space, which includes the /dev/global hierarchy and the volume manager namespaces. Normally, for SDS 4.2.1, the volume manager name spaces reside in the /dev/md/diskset/[r]dsk directories, and for Veritas Volume Manager (VxVM) 3.2, the name spaces normally reside in the /dev/vx/[r]dsk directories.
Within the Sun Cluster, each of the device nodes in the local volume manager name space are replaced by a symbolic link to a device node in the /global/.devices/node@nodeID file system, where nodeID is an integer value (for example, node@1 for clustnode1, and node@2 for clustnode2).
The global name space is automatically generated during the SC 3.0 installation procedure (scgdevs (1M)) and is updated during every reconfiguration reboot.
Cluster File Systems
A cluster file system is a proxy between the kernel and underlying file system on one cluster node, and the volume manager running on another cluster node that has been configured with a physical connection to the disks.
Programs can access a file in a cluster file system from any node in the cluster through the same file name (for example, /global/datafile1). Nodes do not require physical connection to the disks where the file is stored. A cluster file system is mounted on all cluster members, and can only see the underlying UFS file system.
Quorum and Quorum Devices
To ensure cluster and data integrity, it is important that a cluster never be allowed to split itself into separate, active partitions. The CMM guarantees that only one cluster is operational at a time and that cluster is able to access shared data. The majority number of votes (or "quorum") is used to determine if an active partition will be allowed to form a cluster.
A quorum device is used to maintain a valid quorum vote count. A quorum device contributes to the vote count only if at least one of the cluster nodes to which it is currently attached is a valid cluster member. During cluster boot, a quorum device contributes to the vote count only if at least one of the nodes to which it is currently attached is booting and was a member of the most recently booted cluster at the time of shutdown.
A quorum device (for example, a dual-ported, shared disk device) is required for all two-node clusters since two or more quorum votes are required in order for a cluster to form. The two votes can come from the cluster nodes, or from just one node and a quorum device.
Key Practice: Protect against individual quorum device failures by configuring more than one quorum device between sets of cluster nodes. Use disks from different enclosures, and always configure an odd number of quorum devices between each set of nodes.
Cluster Reconfiguration
Cluster reconfiguration is performed to ensure a reliable and consistent cluster state. Each time a cluster is started (or when a node joins or leaves the cluster), the cluster framework performs a reconfiguration, which can be observed on the console.
Public Network Management (PNM)
The PNM monitors network interfaces and subnets of the system. Public network interfaces within an SC 3.0 configuration are assigned to a network adaptor failover (NAFO) group. Both primary and redundant interfaces are defined within the NAFO group.