Managing Shared Storage in a Sun Cluster 3.0 Environment With Solaris Volume Manager Software
- Using Solaris Volume Manager Software With Sun Cluster 3.0 Framework
- Configuring Solaris Volume Manager Software in the Sun Cluster 3.0 Environment
- Advantages of Using Solaris Volume Manager Software in a Sun Cluster 3.0 Environment
- Ordering Sun Documents
With Sun™ Cluster 3.0 software, you can use two volume managers: VERITAS Volume Manager (VxVM) software, and Sun's Solaris™ Volume Manager software, which was previously called Solstice DiskSuite™ software.
Traditionally, VxVM has been the volume manager of choice for shared storage in enterprise-level configurations. In this Sun BluePrints™ OnLine article, we describe a free and easy-to-use alternative, Solaris Volume Manager software, which is part of the Solaris™ 9 Operating Environment (Solaris 9 OE). This mature product offers similar functionality to VxVM. Moreover, it is tightly integrated into the Sun Cluster 3.0 software framework and, therefore, should be considered to be the volume manager of choice for shared storage in this environment. It should be noted that Solaris Volume Manager software cannot be used to provide volume management for Oracle RAC/OPS clusters.
To support our recommendation to use Solaris Volume Manager software, we present the following topics:
"Using Solaris Volume Manager Software With Sun Cluster 3.0 Framework" on page 2 explains how Solaris Volume Manager software functions in a Sun Cluster 3.0 environment.
"Configuring Solaris Volume Manager Software in the Sun Cluster 3.0 Environment" on page 10 provides a run book and a reference implementation for creating disksets and volumes (metadevices)1 in a Sun Cluster 3.0 framework.
"Advantages of Using Solaris Volume Manager Software in a Sun Cluster 3.0 Environment" on page 15 summarizes the advantages of using Solaris Volume Manager software for shared storage in a Sun Cluster 3.0 environment.
NOTE
The recommendations presented in this article are based on the use of the Solaris 9 OE and Sun Cluster 3.0 update 3 software.
Using Solaris Volume Manager Software With Sun Cluster 3.0 Framework
Before we present our reference configuration, we describe some concepts to help you understand how Solaris Volume Manager software functions in a Sun Cluster 3.0 environment. Specifically, we focus on the following topics:
Sun Cluster software's use of DID (disk ID) devices to provide a unique and consistent device tree on all cluster nodes.
Solaris Volume Manager software's use of disksets, which enable disks and volumes to be shared among different nodes, and the diskset's representation in the cluster called a device group.
The use of mediators to enhance the tight replica quorum (which is different from the cluster quorum) rule of Solaris Volume Manager software, and to allow clusters to operate in the event of specific multiple failures.
The use of soft partitions and the mdmonitord daemon with Solaris Volume Manager software. While these components are not related to the software's use in a Sun Cluster environment, they should be considered part of any good configuration.
Using DID Names to Ensure Device Path Consistency
With Sun Cluster 3.0 software, it is not necessary to have an identical hardware configuration on all nodes. However, different configurations may lead to different logical Solaris OE names on each node. Consider a cluster where one node has a storage array attached on a host bus adapter (HBA) in the first peripheral component interconnect (PCI) slot. On the other node, the array is attached to an HBA in the second slot. A shared disk on target 30 may end up being referred to as /dev/rdsk/c1t30d0 on the first node and as /dev/rdsk/c2t30d0 on the other node. In this case, the physical Solaris OE device path is different on each node and it is likely that the major-minor number combination is different, as well.
In a non-clustered environment, Solaris Volume Manager software uses the logical Solaris OE names as building blocks for volumes. However, in a clustered environment, the volume definitions are accessible on all the nodes and should, therefore, be consistent; the name and the major/minor numbers should be consistent across all the nodes. Sun Cluster software provides a framework of consistent and unique disk names and major/minor number combinations. Such names are created when you install the cluster and they are referred to as DID names. They can be found in /dev/did/rdsk and /dev/did/dsk and are automatically synchronized on the cluster nodes such that the names and the major/minor numbers are consistent between nodes. Sun Cluster 3.0 uses the device ID of the disks to guarantee that the same name exists for a given disk in the cluster.
Always use DID names when referring to disk drives to create disksets and volumes with Solaris Volume Manager software in a Sun Cluster 3.0 environment.
Using Disksets to Share Disks and Volumes Among Nodes
Disksets, which are a component of Solaris Volume Manager, are used to store the data within a Sun Cluster environment.
On all nodes, local state database replicas must be created. These local state database replicas contain configuration information for locally created volumes. For example, volumes that are part of the mirrors on the boot disk. Local state database replicas also contain information about disksets that are created in the cluster: The name of the set, the names2 of the hosts that can own the set, the disks in it and whether they have a replica on them and, if configured, the mediator hosts. This is a major difference between Solaris Volume Manager software and VxVM, because in VxVM, each diskgroup is self-contained: Each disk within the group contains the group to which it belongs and the host that currently owns the group. If the last disk in a VxVM diskgroup is deleted, the group is deleted by definition.
At any one time, a diskset has a single host that has access to it. The node that has access is deemed to be the owner of the diskset and the action of getting ownership is called "take" and the action of relinquishing ownership is called "release." In VxVM terms, the take/release of a diskset are the import/export of a diskgroup. The owner of a diskset is called the current primary of that diskset. This means that although more nodes can be attached to the diskset and can potentially take the diskset upon failure of the primary node, only one node can effectively do input/output (I/O) to the volumes in the diskset. The term shared storage merits further explanation. They are not shared in the sense that all nodes access the disks simultaneously, but in the sense that different nodes are potential primaries for the set.
The creation of disksets involves three steps. First, a diskset receives a name and a primary host. This action creates an entry for the diskset in the local state database of that host. While Solaris Volume Manager allows for a maximum of 8 hosts, Sun Cluster (at this time) only supports up to 4 hosts. The rpc.metad daemon on the first node contacts the rpc.metad daemon on the second host, instructing it to create an entry for the diskset in the second host's local state database.
Now, disks can be added to the diskset. Again, the primary hosts rpc.metad daemon will contact the second host so that the local state databases on both nodes contain the same information.
Note you can add disks to any node that can potentially own the diskset and the request is forwarded (proxied) to the primary node. This is done through the rpc.metacld daemon, which allows you to administer disksets from any cluster node. Neither rpc.metad and rpc.metacld should be hardened out of a cluster that is using Solaris Volume Manager software because they are both essential to the operation of the Solaris Volume Manager software components.
When you add a new disk to a disk set, Solaris Volume Manager software checks the disk format and, if necessary, repartitions the disk to ensure that the disk has an appropriately configured slice 7 with adequate space for a state database replica. The precise size of slice 7 depends on the disk geometry, but it will be no less than 4 Mbytes, and probably closer to 6 Mbytes (depending on where the cylinder boundaries lie).
NOTE
The minimal size for slice seven will likely change in the future, based on a variety of factors, including the size of the state database replica and information to be stored in the state database replica.
For use in disk sets, disks must have a slice seven that meets specific criteria:
Starts at sector 0
Includes enough space for disk label and state database replicas
Cannot be mounted
Does not overlap with any other slices, including slice two
If the existing partition table does not meet these criteria, Solaris Volume Manager software will repartition the disk. A small portion of each drive is reserved in slice 7 for use by Solaris Volume Manager software. The remainder of the space on each drive is placed into slice 0. Any existing data on the disks is lost by repartitioning.
After you add a drive to a disk set, you may repartition it as necessary, with the exception that slice 7 is not altered in any way.
Using Device Groups to Manage Disks and Volumes
Sun Cluster 3.0 software provides automatic exporting and taking of Solaris Volume Manager disksets and VxVM diskgroups. To accomplish this, you have to identify the diskset or diskgroup to the cluster. For each device (disk, tape, Solaris Volume Manager diskset, or VxVM diskgroup) that should be managed by the cluster, ensure that there is an entry in the cluster configuration repository.
When a diskset or diskgroup is known to the cluster, it is referred to as a device group. A device group is an entry in the cluster repository that defines extra properties for the diskgroup or diskset. A device group can have the following characteristics:
A node list that corresponds to the node list defined in the diskset.
A preferred node where the cluster attempts to bring the device group online when the cluster boots. This effectively means that when all cluster nodes are booted at the same time, the diskset is taken by its preferred node.
A failback policy, that if set to true, migrates the disk set to the preferred node if the node is online. If the preferred node joins the cluster later, it will become the owner of the diskset (that is, the diskset will switch from the node that currently owns it to the preferred node).
Sun Cluster software also provides extensive failure fencing mechanisms to avoid data access by unauthorized nodes during device group transitions.
One of the major advantages of using Solaris Volume Manager software in a Sun Cluster 3.0 environment is that the creation and deletion of device groups does not involve extra administration. When you create or delete a diskset with Solaris Volume Manager software commands, the cluster framework is automatically notified that it should create or delete a corresponding entry in the Cluster Configuration repository. You can also manually change the preferred node and failback policy with standard cluster interfaces.
Using Mediators to Manage Replica Quorum Votes
Disksets have their own replicas, which are added to a disk when the disk is put into the diskset, provided that the maximum number of replicas has not been exceeded (50). It is possible to manually administer these replicas through the metadb command, but generally this is not required. The need to do so is discussed in the next section. Replicas should be evenly distributed across the storage enclosures that contain the disks, and they should be evenly distributed across the disks on a per-disk-controller basis. In an ideal environment, this distribution means that any one failure in the storage (disk, controller, or storage enclosure) does not effect the operation of Solaris Volume Manager software.
In a physical configuration that has an even number of storage enclosures, the loss of half of the storage enclosures (for example, due to power loss) leaves only 50 percent of the diskset replicas available. While the diskset is owned by a node, this will not create a problem. However, if the diskset is released, on a subsequent take, the replicas will be marked as being stale because the replica quorum of greater than 50 percent will not have been reached. This means that all the data on the diskset will be read-only, and operator intervention will be required. If, at any point, the number of available replicas for either a diskset or the local ones falls below 50 percent, the node will abort itself to maintain data integrity.
To enhance this feature, you can configure a set to have mediators. Mediators are hosts that can import [take] a diskset, and, when required, they provide an additional vote when a quorum vote is required (for example, on a diskset import [take]). To assist the replica quorum requirement, mediators also have a quorum requirement that either greater than 50 percent of them are available, or the available mediators are marked as being up to date, this means the mediator is golden and is marked as such. Mediators, whether they are golden or not, are only used when a diskset is taken. If the mediators are golden, and one of the nodes is rebooted, when it starts up, the mediators on it will get the current state from the node that is still in the cluster. However, if all nodes in the cluster are rebooted when the mediators are golden, on startup, the mediators will not be golden and operator intervention will be required to take ownership of the diskset. The actual mediator information is held in the rpc.metamedd(1M) daemon.
For example, if there are two hosts (node1 and node2) and two storage enclosures (pack1 and pack2), diskset replicas are distributed evenly between pack1 and pack2, and node1 owns the diskset. If pack1 dies, only 50 percent of diskset replicas are available and the mediators on both hosts are marked as golden. If node1 now dies, node2 can import [take] the diskset because 50 percent of the diskset replicas are available and the mediator on node2 is golden. If mediators were not configured, node2 would not have been able to import [take] the diskset without operator intervention.
Mediators do not, however, protect against simultaneous failures. If both pack1 and node1 fail at the same time, the mediator on node2 will not have been marked as golden and there will not be an extra vote for the diskset replica quorum, making operator intervention necessary. Because the nodes should be on an uninterrupted power supply (UPS), which means the mediators should have enough time to be marked as golden, this type of failure is unlikely.
Reasons to Manually Change the Replica Allocation Within a Diskset
As alluded to in the previous paragraph, it is possible, even with mediators configured, to require administrator intervention under certain failure scenarios. One such scenario is that of a two room cluster. That is, each room has one node and one storage device. If a room fails, then any diskset that was owned by the node in that room will require manual intervention. On the surviving node, the administrator will need to take the set using the metaset or scswitch command and remove the replicas that are marked as errored. When this is done, the diskset needs to be released and retaken so it can gain write access to the configured metadevices.
It is possible, by manually moving the replicas about, that the use of manual intervention can be minimized. This can be achieved by "weighting" one room over the other, such that if the non-weighted room fails, then the remaining room would be able to take the diskset. If the "weighted" room fails, manual intervention is required. To "weight" a room, add more replicas to the disks that reside in the room, or delete replicas on the disks that do not reside in that room.
Using Soft Partitions as a Basis for File Systems
After adding a disk to a diskset, you can modify the partition layout, that is break up the default slice 0 and spread the space between the slices (including slice 0). If slice 7 contains a replica, leave it alone to avoid corrupting the replica on it. Because Solaris Volume Manager software supports soft partitioning, we recommend that you leave slice 0 untouched.
Consider a soft partition as a subdivision of a physical Solaris OE slice or as a subdivision of a mirror, redundant array of independent disks (RAID) 5, or striped volume. The number of soft partitions you can create is limited by the size of the underlying device and by the number of possible volumes (nmd, as defined in /kernel/drv/md.conf). The default number of possible volumes is 128. Note that all soft partitions created on a single diskset are part of one diskset and cannot be independently primaried to different nodes.
Soft partitions are composed of a series of extents that are located at arbitrary locations on the underlying media. The locations are automatically determined by the software at initialization time. It is possible to manually set these locations, but it is not recommended for general day-to-day administration. Locations should be manually set only during recovery scenarios where the metarecover (1M) command is insufficient.
You can create and use soft partitions in two ways:
Create them on top of a physical disk slice and use them as building blocks for mirrors or RAID 5 volumes, just as you would use a physical slice.
Create them on top of a mirror or RAID 5 volume.
In our example, we use the second approach. We consider this the best solution for two reasons:
Sizing and resizing soft partitions is limited only by the size of the underlying device. If the underlying device is a Solaris OE slice, it is not always possible to increase the size of the soft partition while keeping the file system on the slice intact. However, it is much easier to grow a Solaris Volume Manager software volume, and then grow the soft partition on top of it.
Creating different soft partitions on top of a large mirrored volume allows you to use the Solaris Volume Manager software namespace more efficiently and consistently. Consider the following example: You create one large mirror (d2) on top of two submirrors (d0 and d1). On top of d2, you create soft partitions d10, d11, d12, and so on. On these soft partitions, you create file systems. If you did it the other way around, you would have to create soft partitions d10, d11, d12, and so on, as well as corresponding soft partitions on the other disks d20, d21, d22, and so on. Then, you would have to create the stripes to use as submirrors on top of the soft partitions and finally create the mirrors. In this scenario, you would have used twice as many soft partitions names and, therefore, more of the Solaris Volume Manager software namespace.
While we recommend the second approach, keep in mind that a disadvantage of this approach is that you will have to perform a complete disk resync if the disk fails, while the first approach would only require a resync of the defined soft partitions.
Using the mdmonitord Daemon to Enable Active Volume Monitoring
The mdmonitord daemon quickly fails volumes that have faulty disk components. It does this by probing configured volumes, including volumes in disksets that are currently owned by the node where the daemon runs (note that the daemon runs on all the nodes in a cluster). The probe is a simple open(2) of the top level volume that causes the Solaris Volume Manager software kernel components to open underlying devices. The probe eventually causes the physical disk device to open. If the disk has failed, the probe will fail all the way back up the chain, and the daemon can take the appropriate action.
If the volume is a mirror, the mirror must be in use for the submirror's component to be marked as errored. (That is, another application must have it open, for example, if it has a mounted file system on it). If the mirror is not open, the daemon reports an error and performs no other action to prevent unrequired resyncs for unused mirrors.
In the case of a RAID5 device, the device fails right away because there is not as much redundancy as there is in a mirror (a second failure in the RAID5 device makes the device inoperable), and it is better to have the cost of the failure immediately.
The mdmonitord daemon has two modes of operation: interrupt mode and periodic probing mode.
In interrupt mode, the daemon waits for disk failure events. If the daemon detects a failure, it probes the configured volumes as previously described. This is the default behavior.
In periodic probing mode, you can specify certain time intervals for the daemon to perform probes by giving the daemon the -t option, followed by the number of seconds between each probe. The daemon also waits for disk failure events.
The mdmonitord daemon is useful if your system contains volumes that are accessed infrequently. Without the daemon, a disk failure can go unnoticed for quite some time, unless you manually check the configuration with the metastat -i command. This might not be a problem at first sight, but can be catastrophic if the failed disk is the cluster's quorum disk, or if an entire storage array has failed. These scenarios are described as follows:
If a quorum disk fails and, subsequently, a node fails, cluster operation is seriously impaired. We recommend that you put the quorum disk in a diskset and make it part of a submirror so it is monitored by the mdmonitord daemon. Depending on the usage of the mirror, you might want to consider configuring the mdmonitord daemon to do timed probes. If the mirror is well used, that is, if it has plenty of I/O going to it, you might not want this.
If mediators are configured, they provide extra votes to guarantee replica quorum. Moreover, if, after an array fails, the remaining replicas are updated, the mediators are set to golden. Now, if a node is lost, the golden status on the remaining node allows Solaris Volume Manager software to continue updating the replicas without intervention. Using the mdmonitord daemon to regularly check the status of volumes increases the possibility of replica updates after a storage array fails.
If a host and disk fail at the same time then the mediators are not going to be golden and you will suffer an outage. However if the host is UPS protected (such that it is up for a period of time before power fails) then the mdmonitord could cause an update to the replicas to occur which means a mediator update and if the node then can fail (UPS has gone away) but the mediator will now be golden and we survive the failure, which we may now have done before. This is rather a corner case but does show that the mdmonitord could be used to provide a better uptime.