Solaris Volume Manager Striping Considerations
Software striping is becoming far less necessary as more intelligent hardware storage solutions emerge. Before considering Solaris Volume Manager striping, investigate the capabilities of your current I/O subsystem. If software striping is still deemed necessary, then take care when configuring the volumes.
The main problem with Solaris Volume Manager striping is split I/O. Splitting of an individual I/O operation to more than one disk degrades performance. There are several ways to avoid this:
Use hardware striping where possible. This prevents Solaris Volume Manager software or the operating system from having to split an I/O operation.
Increase the stripe width to lessen the frequency of splitting an I/O operation.
Calculate alignment based on the I/O size and offsets. This works well for databases with a known I/O size.
Increase the Default Solaris Volume Manager Stripe Width
The probability of splitting an I/O operation is inversely proportional to the stripe width. Consider an OLTP system which mostly performs 8-Kbyte I/O. A 32-Kbyte stripe width has a probability of splitting 1 of 4 I/O operations, whereas a 1-Mbyte stripe width splits only 1 of 128 I/O operations (less than one percent). Increasing the stripe width is the single most important improvement you can make to decrease the probability of splitting an I/O operation.
Make the Stripe Width Large Relative to the I/O Size
If you use the metainit command without the -i option, Solaris Volume Manager software uses a default of 16 Kbytes for the stripe width. A narrow stripe width doesn't allow the application to take advantage of read-ahead for sequential I/O provided by the underlying storage subsystem. A stripe width of 1 Mbyte or greater is common, especially when implementing a stripe and mirror everywhere (SAME) strategy.
Align Soft Partitions on the Stripe Boundary
Before the Solaris 8 OE Solaris Volume Manager software, each volume had to match a hard disk partition or volume table of contents (VTOC). With this older scheme, there is a direct relationship between a Solaris Volume Manager device and the physical disk or LUN. This scheme limits the number of volumes to the number partitions or VTOCs that can be created.
Soft partitioning enables one piece of disk to be partitioned into more slices than is possible by a device VTOC. Soft partitioning provides a great amount of flexibility because it enables volumes to be created on top of individual disks or existing Solaris Volume Manager volumes. Layering of volumes is particularly useful with large RAID devices, which can easily exceed 1 terabyte. But soft partitioning on top of striped hard partitions can lead to the split I/O problem.
As mentioned, if the stripe width is large compared to the I/O size, there is less splitting of I/O. In the case where 8 Kbytes is the I/O size and 1 Mbyte is the stripe width, less than one percent of the I/O operations are split. But still, there are some split I/O operations. To get total alignment, striped soft partitions must be aligned on a stripe boundary.
Consider the example of an Oracle application that uses an 8-Kbyte block size. The following example shows the default soft partitions or layered volumes that are created with the metainit command.
# metainit d10 1 2 c3t1d0s7 c5t1d0s7 -i 256k # metainit d11 -p d10 1024m # metainit d12 -p d10 1024m
Meta device d11 happens to begin on a stripe boundary as shown in FIGURE 2. Random block reads to this device are exactly 8 Kbytes in size and involve exactly one disk. Due to the 512-byte label added to a soft partition, meta device d12 will split I/O that falls on the stripe boundary.
FIGURE 2 Split I/O Due to Watermark
To correct this, use the -o option to specify the offset to be a multiple of the stripe width. The following example shows how to use the -o option with the metainit command to specify the starting offset of the soft-partition to align on the stripe boundary.
# metainit d10 1 2 c3t1d0s7 c5t1d0s7 -i 256k # metainit d11 -p d10 -o 512 -b 2097152
Taking this example one step further, use the following formula to determine the next offset:
nextoffset = last_offset + last_size + stripe_width (in 512-byte blocks)
For this example, the result is:
512 + 2097152 + 512 = 2098176 blocks
The corresponding metainit command:
# metainit d12 -p d10 -o 2098176 -b 2097152
This offset guarantees that the d12 meta device starts on a stripe boundary and that the 8-Kbyte I/O will not be split unnecessarily.
To administer volumes using this scheme, you must keep track of offsets. While this is easy during the initial creation of volumes, it is useful to be able to observe the current settings. The following example shows how the metastat -p command can be used to show the volume offsets. This command is useful for documenting the storage configuration as well as for ongoing administration.
# metastat -p |egrep 'd2' d208 -p d2 -o 61473953 -b 9680 d2 1 1 c5t1d0s0 d207 -p d2 -o 61464272 -b 9680 d206 -p d2 -o 60440255 -b 1024016 d205 -p d2 -o 59416238 -b 1024016 d204 -p d2 -o 52264621 -b 7151616 d203 -p d2 -o 45113004 -b 7151616 d202 -p d2 -o 37961387 -b 7151616 d201 -p d2 -o 30809770 -b 7151616 d29 -p d2 -o 25677481 -b 5132288 d28 -p d2 -o 20545192 -b 5132288 d27 -p d2 -o 15412903 -b 5132288 d26 -p d2 -o 10280614 -b 5132288 d25 -p d2 -o 5148325 -b 5132288 d24 -p d2 -o 16036 -b 5132288 d23 -p d2 -o 10691 -b 5344 d22 -p d2 -o 5346 -b 5344 d21 -p d2 -o 1 -b 5344
Make the Software Stripe Width a Multiple of the Hardware Segment Size
Hardware RAID and software RAID are sometimes combined to increase availability and throughput, but there can be performance consequences if the underlying storage layout is not considered.
Make sure that you understand the stripe unit or segment size of the underlying storage architecture when implementing software striping on top of hardware RAID. Failure to do this can cause a single I/O operation to be split between multiple devices on the underlying storage. This unwanted split I/O operation increases latency and degrades overall throughput.
The underlying segment size differs from array to array. For example:
Sun StorEdge 9980 system uses a 48-Kbyte segment size.
Sun StorEdge 6910 system and Sun StorEdge T3 arrays use 32-Kbyte or 64-Kbyte segment size (64-Kbyte provides the best performance).
When combining multiple LUNs, it is best to use a fairly large stripe width. For the StorEdge 9980 example, a stripe width of 20 * 48 Kbytes, or 960 Kbytes, is a good place to start for an expected 8-Kbyte I/O size. This enables the LUN to benefit from read-ahead while reducing the probability of splitting an I/O operation.
Limit Striping and Meta Devices
When coming from a Veritas background, there is a tendency to want to map Veritas concepts to Solaris Volume Manager software. VxVM creates subdisks for every portion of disk that is used in a stripe. This can be simulated with Solaris Volume Manager software by using soft-partitions. There are no performance implications, but this technique can use an excessively large number of meta devices, especially as the number of devices in a stripe is increased.
Consider an example where 200 volumes are created as stripes across 16 drives. If a soft-partition is created for each subdisk, a total of 16*200+200 = 3400 meta devices is needed. If the soft-partitions are created on top of a striped hard partition, only 1+200 = 201 meta devices are needed.
By default, Solaris Volume Manager software can create only 128 meta devices. You can increase this to a maximum of 8192 by modifying the nmd field in the /kernel/drv/md.conf file. To get this to take effect, the machine must be rebooted with boot -r. This procedure is discussed in more detail in the Solaris Volume Manager software administration guide.