vSphere Storage Design for Maximum SQL Performance
We have so far covered SQL Server VM storage architecture from the database down to the data store. We are now ready to dive into VMware vSphere storage design and physical storage design to achieve maximum performance. This section will build on what we’ve covered already and help you to design an underlying storage architecture that supports your high-performance SQL Server systems on top of it. We will cover the impacts of number of data stores, data store queues, storage performance quality of service (QoS), storage device multipathing, RAID, and storage array features such as auto-tiering.
Number of Data Stores and Data Store Queues
The number of data stores you specify for your SQL Servers has a direct impact on the number of VMs and hosts that you can support in a vSphere cluster. The maximum number of data stores per host is 256, and all data stores should be visible to all hosts in a single cluster to ensure features such as VMware HA and DRS function correctly. For SQL Servers that will have a low IO requirement, you may be able to host a number of them on a single data store. This is one of the great benefits of using VMFS data stores over RDMs. Ultimately the number of data stores you need depends on how many IOPS you can get from a single data store, the combined IOPS and queue depth (QD) requirement of the VMs, and the queue depth you have configured per LUN on each vSphere host. For example, if each SQL Server consumes six LUNs or data stores and you can support four SQL Servers per host, your vSphere cluster would be limited to 10 hosts, plus one host for failure.
The IOPS for a particular data store is usually measured and specified in terms of IOPS per TB. This makes it very easy to explain to application owners what performance they should expect from their storage related back to the capacity. However, the calculation can become a little more complicated when features such as array auto-tiering, compression, and de-duplication are used. As part of designing your storage environment, we recommend you specify an SLA for each type of data store that is backed by a different class of storage (or different storage policy). As part of the SLA, calculate the IOPS per TB achievable and make this known to the application owners. Knowing the IOPS per TB achievable and required will also help if you are looking to host any SQL servers in a cloud environment. Whatever the IOPS per TB is for a particular data store, it will potentially be divided by the number of hosts sharing the data store, so you will most likely not be able to run a single host to the limit, unless there is only one VM on the data store.
In many cases, you can reduce the number of data stores you need to manage by increasing the queue depth per HBA LUN on each vSphere host. This allows you to place additional virtual disks on the data store, but without sacrificing the aggregate number of available storage IO queues. We recommend you do not increase the aggregate queue depth to the storage processors. By this we mean that by reducing the number of LUNs and increasing the queue depth per LUN, the total queue depth to the storage processor ports should be the same.
In Table 6.9, where the data store maximum number of VMs per host is 1, the maximum VMs on a given data store is effectively the maximum number of hosts that can be supported in a cluster. To increase the aggregate amount of active IOs per VM, you need to increase the number of LUNs and ensure VMs sharing those LUNs are split across hosts.
Table 6.9 Calculating Load on a VMFS Volume for Sample Configurations
Max Outstanding IO per LUN (n) |
Avg. Active IO per VM (a) |
LUN Queue Depth (d) |
Max VM per Host m=(d/a) |
Max VM on Data Store (n/a) |
256 |
4 |
32 |
8 |
64 |
256 |
4 |
64 |
16 |
64 |
1,024 |
4 |
64 |
16 |
256 |
256 |
32 |
32 |
1 |
8 |
1,024 |
32 |
32 |
1 |
32 |
1,024 |
1 |
32 |
32 |
1,024 |
256 |
64 |
64 |
1 |
4 |
1,024 |
64 |
64 |
1 |
16 |
Table data sourced from http://www.vmware.com/files/pdf/scalable_storage_performance.pdf, with additional scenarios added.
You don’t just have to worry about your maximum LUN queue depths. You also have to consider the queue depths of your HBA. Many HBAs have a queue depth of 4,096, which means you’d only be able to support 64 LUNs per host at a queue depth of 64, assuming all queues were being used. Fortunately, this is rarely the case, and overcommitting queues at the host level has less drastic consequences than overcommitting queues at the storage array level. Any IOs that can’t be placed into the HBA queue will be queued within your vSphere host, and the consequence is increased IO latency, the amount of which will depend on your IO service times from your storage. Queuing inside your vSphere host can be determined by monitoring the QUED value and KAVG in ESXTOP. Recommended thresholds for average and maximum values can be found in Chapter 10.
The LUN queue depth isn’t the only value that you may need to modify in order to increase performance from your data store. The LUN queue setting goes hand in hand with the VMware vSphere advanced parameter Disk.SchedNumReqOutstanding (DSNRO). DSNRO is used to control the queue maximum depth per VM when there are multiple VMs per data store. The goal of this setting is to ensure fairness of IO access between different VMs. When there is only one VM per VMFS data store, the LUN queue depth will always be used. In vSphere, Disk.SchedNumReqOutstanding is a global value up until vSphere 5.5. In vSphere 5.5, Disk.SchedNumReqOutstanding is specified on a per-device basis. This setting is modified dynamically, as is the LUN queue depth when Storage IO Control is enabled on a data store with multiple VMs that is experiencing performance constraints.
Figure 6.23 shows the different queues at each level of the vSphere storage architecture. The two values that are usually worth monitoring as a vSphere admin are the AQLEN and the DQLEN. DQLEN can be adjusted up or down, depending on your requirements. For high-IO SQL Server systems where PVSCSI is used on VMDKs, we suggest you set the DQLEN to 64 as a starting point, while taking into account our previous recommendations when modifying queue depths.
Figure 6.23 VMware vSphere storage queues.
Figure 6.24 shows the different areas where storage IO latency is measured and the relevant values inside vSphere. DAVG, which is the device latency, will indicate if you have a bottleneck in your storage array, which may mean you need to add more disks or reduce the load on that device. If you start to see KAVG constantly above 0.1ms, this means the vSphere kernel is queuing IOs and you may need to increase device queue depth, especially if the DAVG is still reasonable (< 10ms).
Figure 6.24 VMware vSphere storage latency.
We want to optimize the queues through the IO stack so that the disk devices are the constraint, and not the software or queues higher in the stack. Periodic spikes in DAVG and KAVG are acceptable, provided the averages are not consistently high. Brief spikes in DAVG and KAVG are acceptable; however, high average values are a sign of a performance problem. Suggested thresholds are listed in Chapter 10.
Number of Virtual Disks per Data Store
This section is only relevant if you’re building standalone SQL Server or using AlwaysOn Availability Groups with virtual disks (VMDKs). SQL FCI requires RDMs, and therefore each drive is mapped to a LUN and you can’t share the LUN with multiple VMDKs. You can, however, share the LUN and drive with more data files and achieve the balance of outstanding IOs to queue depth that way.
The number of VMDKs per data store will be limited by the performance characteristics of the data store and the performance requirements of the VMs and their VMDKs. Our primary goal when we decide on the number of VMDKs per data store is to try and balance the average number of active outstanding IOs per host with the queue depth of the data store. In most cases, not all VMDKs will use all their available queue depth all of the time, and not all VMs will use their available queue depth all the time either, but they may have peaks. We need to be able to handle these peaks within a reasonable time in terms of the IO latency or service time.
The example in Figure 6.25 shows a configuration where two VMDKs are on each data store. Each VMDK has a queue depth of 64, resulting in an over-commitment in queue depth of 2:1 from the VMDKs to the data store. On average, each VMDK will be able to issue 32 outstanding IOs (assuming they’re on the same vSphere host) before any additional queuing occurs in the vSphere kernel. If one VMDK is idle, the other VMDK can issue the maximum number of outstanding IOs to the data store and make use of the full queue depth. This may seem to be a rather conservative number of VMDKs per data store, but for very-high-performance systems this (or even 1:1 VMDK to data store) may be necessary to achieve the performance requirements.
Figure 6.25 Two VMDK per data store.
The example in Figure 6.26 shows a queue depth over-commitment of 4:1, assuming all VMDKs from a single VM on the single vSphere host. Each VMDK would be able to issue on average 16 outstanding IOs, while if the other VMDKs are idle an individual VMDK will be able to fill the entire queue.
Figure 6.26 Four VMDK per data store.
This is quite possibly fine for a single host and a single VM for this data store. But a data store is shared between all hosts in the cluster. If we only host a single VM on the data store and only on a single host, we are not able to utilize all of the queue depth that is usually available at the storage array. This assumes that the physical LUN configuration can support a higher aggregate queue depth and higher IOPS at the storage array level. If your backend storage is already performance constrained by its configuration, adding more queue depth and more VMs and VMDKs to the data store will only serve to increase latencies and IO service times.
Figure 6.27 shows two SQL VMs on two different ESXi hosts accessing the same data store. In this scenario, because each host has a LUN queue depth of 64, the combined queue depth to the LUN at the storage array could be up to 128. Provided the LUN can support the additional queue depth and IOPS without increasing latency, this would allow us to extract more performance from the same LUN, while reducing the number of LUNs that need to be managed. For this reason, sharing data stores between multiple VMs and VMDKs across multiple hosts can produce more optimal performance than alternatives. But it is important to make sure that each VM gets a fair share of the performance resources of the data store.
Figure 6.27 Multiple VMs on different ESXi hosts per data store.
Storage IO Control—Eliminating the Noisy Neighbor
One of the potential impacts of working in a shared storage environment is having one VM monopolize storage performance resources to the detriment of other VMs. We call this the Noisy Neighbor Effect. If one VM suddenly starts issuing a lot more IO than all the other VMs, it could potentially slow down other VMs on the same data store, or on the same array. To combat this problem, VMware introduced Storage IO Control (SIOC) in vSphere 4.1 and has made enhancements to it in vSphere 5.x.
Where there is more than one VM sharing a data store and SIOC is enabled, if the latency exceeds a threshold (default 30ms), vSphere will take action to reduce the latency. The way it reduces the latency is by dynamically modifying the device queue depth of each of the hosts sharing the data store. What it is doing is in effect trading off throughput for latency. The result is, individual VMs may see higher latency from storage but they each get their fair share of the storage performance resources.
SIOC should be activated only to deal with unexpected peaks of IO activity and should not be stepping in all the time. SIOC should be seen as more of a last resort. If you observe higher latency in your VMs and SIOC working constantly, this is an indication that your data store or storage platform can’t support the required IO workload. You may need to add more physical disks to your storage platform or reorganize some of the LUNs to reduce hot spots.
As shown in Figure 6.28, if one VM or one host begins to monopolize the available performance resources, the other VMs sharing the same data store or storage array suffer.
Figure 6.28 The Noisy Neighbor Effect.
In some cases, it’s not just that other VM’s performance suffers, but other more important VMs sharing the same data store don’t get the IO resources they are entitled to.
Figure 6.29 provides an example where three VMs share the same data store. One important VM and a less important VM share a vSphere host, while another less important VM is on another vSphere host. The relative importance is defined by the shares value, which uses a proportional share algorithm to carve up the performance resources. Because this doesn’t work across hosts, the less important VM on its own host has full access to the available queue depth and therefore is getting more than its fair share of IO performance resources.
Figure 6.29 Storage congestion without SIOC.
With Storage IO Control activated, the proportional share of resources and fairness are enforced across all hosts and for all VMs that share the same data store. In the example shown in Figure 6.30, SIOC takes action to reduce the queue depth that the less important VM has access to and to ensure that the most important VM gets its full entitlement to the available IO resources of the data store. Because Storage IO Control is only going to become active when there is congestion on the data stores, it is perfectly safe to use with array auto-tiering. SIOC will simply balance out the latency while the array moves blocks around if the operations cause any latency spikes.
Figure 6.30 Storage controlled with SIOC.
In vSphere 5.5, Storage IO Control uses an injector that periodically tests the performance capability of a given data store and can dynamically change the thresholds it uses to determine data store congestion. If you prefer to use the traditional method of a static latency threshold, you still can. The static latency threshold will be preferable if your storage array is using sub-LUN auto-tiering, where blocks of data may be migrated to different types of storage dynamically based on the performance profile requirements. If you used the injector method to determine congestion in conjunction with an auto-tiering array, there is a high probability the injector would get inaccurate data, because sometimes it would hit high-performance blocks and sometimes it would hit low-performance blocks.
We recommend you enable SIOC as a standard on all of your data stores when using traditional block-based storage arrays, regardless of whether or not they are hosting more than one VM. This will ensure if things change in the future you know that your VMs will always receive their fair share of the storage IO performance resources available. If you have an auto-tiering array, we would recommend using the traditional default values of 30ms for the static latency threshold and not using the injector with vSphere 5.5.
Figure 6.31 shows the vSphere 5.5 Storage IO Control Settings dialog box. By setting SIOC to Manual, you effectively disable the injector, which is the preferred setting when using auto-tiering arrays, or storage platforms where the injector is likely to get inaccurate data.
Figure 6.31 vSphere 5.5 Storage IO Control settings.
vSphere Storage Policies and Storage DRS
With vSphere 5.x, we use Storage Policies and Storage DRS not only to reduce management overheads in a vSphere environment but also to improve performance. By using vSphere Storage Policies, you can take some of the guesswork out of provisioning your SQL Servers. By creating policies that align to the IOPS per TB and protection or availability requirements of your databases, it becomes very easy to provision new databases to the correct storage to achieve their requirements. You can manually assign storage capabilities to data stores and then create policies for those capabilities. Alternatively, you can use a storage vendor provider that leverages the vSphere API for Storage Awareness (VASA) to automatically provide visibility of the capabilities to vSphere. With VASA, when LUNs are provisioned at the physical array, the capabilities will flow through to vCenter. Storage Vendor Providers and storage capabilities are then visible when creating data stores. This allows vSphere Storage Administrators to easily include the correct storage into the correct data store, and this can later be used to create data store clusters.
Figure 6.32 shows a virtual data center where there are three possible storage policies that could be used based on the requirements of the SQL Server. You might choose Gold or Silver for different production or test database systems and you might choose Bronze for development databases. Your policies would be based on your particular requirements. To make it easy to architect for storage performance, the IOPS per TB should be known for each storage policy, and this should be communicated to all the key stakeholders until it is clear what they are getting when they provision VMs.
Figure 6.32 vSphere storage policies.
By pooling multiple (up to 32) similar data stores into data store clusters and using Storage DRS, you can ensure that initial placement of virtual disks to the best data store is automated, and this reduces the number of individual elements you need to actively manage. Storage DRS can be configured to load balance based on capacity, IO performance, or both, and can be set to simply make recommendations (manual) or be fully automated. If your array does not include automated storage block tiering, you can use Storage DRS to load balance data stores for IO performance, in addition to simply load balancing for capacity. When IO Load Balancing is enabled, Storage DRS works cooperatively with Storage IO Control and will collect IO metrics from the data stores and uses the IO injector to determine performance capabilities. The data is then analyzed periodically (by default, every 8 hours) to make IO load-balancing decisions. Importantly, the cost of any storage migrations is taken into consideration when making IO load-balancing decisions. Load balancing based on capacity or IO is achieved by performing Storage vMotion migrations between the source and destination data stores within a data store cluster.
The example shown in Figure 6.33 is of the standard storage DRS options, including the Storage DRS Automation Level, configured for Fully Automated, and the I/O metrics settings, which are disabled. You may wish to set Storage DRS to No Automation (Manual Mode) for a period of time during operational verification testing or if you are unfamiliar with Storage DRS and data store clusters, until you are familiar and comfortable with the recommendations it makes.
Figure 6.33 vSphere Storage DRS options.
The example in Figure 6.34 shows the Storage DRS Advanced Options expanded. Here, you can set whether to keep VMDKs together by default and other settings. These parameters will influence how much of an imbalance there needs to be before Storage DRS will consider taking action. The most relevant settings for SQL Server are “Keep VMDKs together by default” and the advanced option shown in this figure, “IgnoreAffinityRulesForMaintenance.”
Figure 6.34 vSphere Storage DRS advanced options.
The default option for Storage DRS will keep all VMDKs from a VM on the same data store. For a high-performance database, this is not what you would want. You will want to leverage the available data stores and queue depth to get the best performance while Storage IO Control sorts out any bumps in the road and ensures quality of service. Our recommendation for SQL Server environments is to have Keep VMDKs Together unchecked. This will cause Storage DRS to spread out the VMDKs among the available data stores. If you have large numbers of SQL Servers, it may be preferable to run them in a dedicated data store cluster, because this could limit the impact they have on other workloads, and vice versa.
If at a later stage you want to add data store performance as well as capacity, you can simply add more data stores to the data store cluster and they will be used for load-balancing operations per VMDK as well as during initial placement. Separating the VMDKs among the data stores will ensure quality of service and access to performance of all the databases added to the data store cluster while making administration and management significantly easier. We would recommend you leave the IgnoreAffinityRulesForMaintenance advanced setting at 0, unless you are willing to compromise your affinity rules and performance during data store maintenance operations.
In Figure 6.35, we have combined storage policies with multiple data store clusters. With the different virtual disks of each VM configured with a storage policy based on the required capabilities, the storage policy then maps to a particular data store cluster. Whenever a new VM is provisioned, its virtual disks will be provisioned in the correct data store cluster. The advantage of this method is that you can have the different VMDKs of a VM on a different class of storage—for example, where you want backup on a lower tier, or the OS on a lower tier, while the database files and transaction logs files are on a higher tier.
Figure 6.35 vSphere storage policies with data store clusters.
Having the flexibility of provisioning the VM to multiple storage clusters and different classes of storage sounds okay at a distance, but it also introduces additional management overheads. In storage platforms that already do automated block tiering, there is limited benefit to this approach. This approach is also difficult in Infrastructure as a Service (IaaS) environments or Cloud environments (including VMware vCloud Director or vCloud Automation Center), in which case a single VM may only be associated with a single storage profile, and automated tiering is used to manage the performance of the particular VM within the defined physical storage policy.
The sample diagram in Figure 6.36 shows multiple SQL Server VMs entirely within a single data store cluster, which would be backed by a single class of storage or single physical storage policy. Each VM’s individual VMDKs would be split among the data stores of the data store cluster. Storage Policies on each VM would dictate which data store cluster the SQL Server is assigned, but an individual VM is not split between multiple data store clusters, as was the case in Figure 6.35. This is the recommended approach in environments that support automated block tiering at the storage array.
Figure 6.36 Multiple SQL Servers—single vSphere storage cluster.
This design ensures simplified management and operations while ensuring the appropriate performance of the group of SQL Servers. It is also compatible with IaaS environments and use with Cloud environments, such as VMware vCloud Automation Center and VMware vCloud Director. You may still support multiple storage policies and service levels for the storage, each being a different data store cluster. But the VMs that map to those policies would be entirely contained within the relevant data store cluster.
vSphere Storage Multipathing
Each block storage device (FC, FCoE, iSCSI, and so on) on the VMware Hardware Compatibility List (HCL, http://www.vmware.com/go/hcl) leverages VMware Native Multipathing (NMP) and will have a Storage Array Type Plugin (SATP) and a default Path Selection Policy (PSP). The default SATP and PSP for your storage device will depend on the vendor, and in some cases it will use a VMware generic SATP, such as VMW_DEFAULT_AA. The PSPs that are part of the built-in VMware NMP are referred to as initiator-side load-balancing or path selection policies. This is because all path selection decisions are made from the host only.
There are three built-in PSPs to choose from: VMW_PSP_MRU, VMW_PSP_FIXED, and VMW_PSP_RR. To get the best performance out of your storage and provide the highest performance and lowest latency to SQL Server, we recommend you use the VMware Round Robin PSP (VMW_PSP_RR) where possible. Your storage vendor may have a particular best practice with regard to advanced options when using Round Robin that you should follow.
VMware has designed vSphere’s storage multipathing to be flexible and to allow storage vendors to write their own multipathing plugins. The advantage of many of the third-party vSphere multipathing plugins, such as EMC’s PowerPath/VE, is that they use target-side load balancing. This is where the load on the storage array’s paths, storage processors, and individual queue depths may be taken into consideration when choosing the best path for a particular IO operation. This can produce greatly improved performance and lower latency. Many vendors offer their own plugins, so you should check with your storage vendor to see if they have a plugin and what advantages it might have for your environment. Most of these plugins come at an additional cost, but in our experience it can usually be justified based on the additional performance.
The VMware vSphere Native Multipathing modules eliminate a lot of the problems and complications traditionally associated with in-guest multipathing drivers. To simplify your environment further, you could choose to put your VMDKs onto NFS data stores mounted to vSphere. When using NFS, your load balancing will most likely be done on the array, or by using the correct network teaming. NFS as a data store instead of VMFS is a great solution, provided it is designed and deployed correctly to meet the performance needs of your SQL Servers. The protocol itself will not be your limiting factor for performance, especially on 10GB Ethernet. Whichever storage option or protocol you choose, you just need to design it to meet your performance requirements and verify through testing that it does. There are many situations where NFS could be a valid option, and some of the benefits are covered in the section “SQL Server on Hyperconverged Infrastructure.”
vSphere 5.5 Failover Clustering Enhancements
In response to customer demands for increasing levels of database availability over and above the 99.9% easily obtainable with vSphere HA, VMware has provided a number of enhancements to the support of Windows Failover Clustering over the years. From vSphere 5.1, VMware supported five-node Windows Failover Clusters, where it previously supported only two nodes. In vSphere 5.5, VMware has again enhanced the Windows Failover Clustering support, and this is particularly relevant to high-performance SQL server databases that wish to make use of AlwaysOn Failover Cluster Instances.
Figure 6.37 shows the enhancements available when using AlwaysOn Failover Cluster Instances on vSphere 5.5.
Figure 6.37 vSphere 5.5 failover clustering enhancements.
The new failover clustering support in vSphere 5.5 means you can use the Round Robin multipathing policy to load-balance multiple active storage paths, Windows Server 2012 clusters are fully supported (up to five nodes when using RDMs), and FCoE and iSCSI protocols in addition to FC are supported for the RDMs.
RAID Penalties and Economics
Most storage arrays in use today use RAID (Redundant Array of Independent Disks) as a way to protect data from physical disk failures. Even though many newer storage arrays are starting to use different techniques for data protection, it’s still important to understand RAID. Using the right RAID levels can have a big impact on performance and also on cost of your SQL environments and virtual infrastructures. This section more than any other will clearly demonstrate how designing for performance will take care of capacity, at least where using RAID is involved, especially as you reach for even higher performance from your SQL databases. Before we discuss RAID penalties, we will cover some IO characteristics that have a direct performance impact when used with RAID.
Randomness of IO Pattern
The randomness of IO is a very important consideration in storage design and has a direct impact on IO latency and throughput when using spinning disks. Most virtualization environments will generate a completely random IO pattern, even with sequential IO from individual VMs, as we covered previously in “The IO Blender Effect.” This is because the underlying VMFS data stores are shared between multiple VMs in most cases. With SQL Server, you will have cases where VMs should still share some common VMFS data stores, as we have covered, in order to get maximum performance utilization from your arrays.
The reason that random IOs have such an impact is because the disk heads have to move between different sectors and the disk has to spin around to the correct location for a block to be read from or written to. For this reason, the average seek time and rotational speed of the disks are very important. On average, the disk heads will need to wait for 50% of the disk to spin past it prior to performing a read or write operation. Each operation is then multiplied by the RAID penalties of that operation.
The impact of randomness on reads can be worse than the randomness for writes. In most storage systems, writes will be cached (backed by battery or some other persistent form), ordered, and then written to physical disk in a way that reduces the overall impact. For reads, however, the chances of getting a cache hit in your array when the randomness increases are very low; therefore, most reads may have to come from spinning disks. The alternative would be to assign very large read cache on the array, but that is not efficient or cost effective in most cases, and still may not result in significant cache hits. The end result is that many more disks may be needed to get the best read latency and throughput for your database.
Fortunately, SQL is very good at caching, and this is why the buffer pool in a SQL Database is so big. This is also the reason there is a direct tradeoff between assigning SQL Server RAM and using it in the buffer pool and read IO from disk. This becomes especially important when things fail, such as disks in your RAID groups, which causes additional delays and additional latency.
Read/Write Bias
Just because your applications drive SQL to generate a read-biased workload doesn’t mean the underlying storage system will see a read-biased IO pattern. The reason for this is the SQL buffer pool is likely to mask a lot of read IO if you have sized your VM correctly. This will mean your IO patterns may be very write biased. Writes will be going to your data files, Temp DB files, and your transaction log all at the same time. You will need to make sure you have sufficient array write cache so you don’t get into a position of a force flush and a subsequent instance of the cache going write through, which will significantly degrade performance. You must have sufficient numbers of disks in the array to handle the cache flushes easily.
Plan Your Performance for Failure
Your storage system at some point will experience a failure. You need to ensure that your critical SQL systems will perform at the minimum acceptable level during these failure operations. During a disk failure in certain RAID configurations, you will have significantly slower performance for both read and write operations; this is due to parity calculations and the performance required for rebuilding data on replacement disks. Disk rebuild can take a significant amount of time, and during rebuild situations you may have a risk of multiple disk failure. The bigger and slower the disk, the longer the rebuild.
RAID Penalties
Random IO patterns, read/write bias, and failure events have a big impact on performance due to the overheads and penalties for read and write operations associated with using RAID. This is especially so with spinning disks. Storage array vendors have come up with many ways to try and work around some of the limitations with RAID, including the smart use of read and write caches. In your storage design, though, we recommend you plan your performance based on the physical characteristics of the underlying disks and plan for the rest to be a bonus. Table 6.10 displays the IO penalties during normal operations for each of the common RAID schemes.
Table 6.10 RAID IO Penalties During Normal Operations
RAID Level |
RAID Write Penalty |
Read IOPS (15K RPM) |
Write IOPS (15K RPM) |
Read IOPS (7.2K RPM) |
Write IOPS (7.2K RPM) |
RAID 0 |
1 |
175–210 |
175–210 |
75–100 |
75–100 |
RAID 1 |
2 |
175–210 |
88–105 |
75–100 |
38–50 |
RAID 5 |
4 |
175–210 |
44–52 |
75–100 |
18–25 |
RAID 6 |
6 |
175–210 |
30–35 |
75–100 |
12–16 |
RAID 1+0 (10) |
2 |
175–210 |
88–105 |
75–100 |
38–50 |
RAID DP |
2 |
175–210 |
88–105 |
75–100 |
38–50 |
IOPS listed in Table 6.10 are per disk. RAID 0 is included for illustrative purposes only and is not recommended, as it is simple disk striping with no data protection.
As you can see from Table 6.10, if you have a very write-biased workload, you could get very low effective IOPS from your RAID disks. This is the primary reason why arrays have write cache—and in some cases, lots of it. This allows the array to offset much of the penalty associated with writes to RAID groups of disks. But the arrays assume there will be some quiet time in order to flush the cache; otherwise, there will be an impact to performance. The calculation for write IOPS is as follows:
Write IOPS = Disk IOPS / RAID Write Penalty
However, this only works when things are going well. If you fill your cache by having too much write IO on slow spindles, or just from general overloading, your array will stop caching writes and bypass the cache altogether (go write through). In this case, you’ll get at best the raw performance of the RAID groups. This problem can be made worse when there is a disk failure and a group of RAID disks needs to be rebuilt. Depending on the type of disks, this can take many hours and severely impact performance during the rebuild operation.
Let’s take the RAID penalties a bit further and look at an example where we are sizing for performance. In this example, we will look at the requirements of a SQL data store that needs to be able to deliver 5,000 IOPS. We will assume that the workload is 70% read and 30% write, which is typical for some OLTP systems.
First, we need to calculate the effective number of IOPS required. This takes the 5,000 IOPS of a 70/30 read/write workload and adjusts for the RAID penalty as follows:
Required Array IOPS = (Required IOPS * Read %) + RAID Write Penalty * (Required IOPS * Write %) Example RAID 5 Required IOPS = (5000 * 70%) + 4 * (5000 * 30%) = 9500
You can see from the example in Table 6.11 that to achieve 5,000 IOPS for a 70% read-biased SQL workload, we need 9,500 IOPS at RAID 5 from the array. Now that we know the required array IOPS, we can calculate the number of disks required to achieve this performance at each of the RAID levels. To do this, we divide the number of IOPS by the number of IOPS per disk. RAID penalties have already been taken into consideration due to the previous calculations.
Table 6.11 Array IOPS Required at Different RAID Levels to Achieve 5,000 SQL IOPS
RAID Level |
RAID WritePenalty |
Array IOPS (70% Read) |
Array IOPS (50% Read) |
Array IOPS (30% Read) |
RAID 0 |
1 |
5,000 |
5,000 |
5,000 |
RAID 1 |
2 |
6,500 |
7,500 |
8,500 |
RAID 5 |
4 |
9,500 |
12,500 |
15,500 |
RAID 6 |
6 |
12,500 |
17,500 |
22,500 |
RAID 1+0 (10) |
2 |
6,500 |
7,500 |
8,500 |
RAID DP |
2 |
6,500 |
7,500 |
8,500 |
To calculate the number of disks required to meet the required IOPS of a workload, we use the following formula:
Required Disks for Required RAID IOPS = Required Array IOPS / IOPS per Disk
Example RAID 5 Disks = 9500 Array IOPS / 210 IOPS per 15K Disk = 45 Disks
As Table 6.12 demonstrates, to achieve 5,000 SQL IOPS 70% read at RAID 5 on 15K RPM disks requires 45 disks, whereas it only requires 31 disks at RAID 1, RAID 10, or RAID DP—a saving of 14 disks. If the workload is only 30% read, then we would require 74 15K RPM disks at RAID 5 and only 40 15K RPM disks at RAID 1, RAID 10, or RAID DP. This would be a saving of 34 disks to achieve the same performance. This assumes each disk can achieve the high end of the IOPS for that device. The less number of IOPS per disk, the more disks in total will be needed. In this example, we’ve used the high-end IOPS of each disk for the calculations. Be sure to check with your storage vendor on their recommendations for IOPS per disk when doing any calculations.
Table 6.12 Min Disks Required at Different RAID Levels to Achieve 5,000 SQL IOPS
RAID Level |
15K RPM Disk 70% Read |
15K RPM Disk 30% Read |
7.2K RPM Disk 70% Read |
7.2K RPM Disk 30% Read |
RAID 0 |
24 |
24 |
50 |
50 |
RAID 1 |
31 |
40 |
65 |
85 |
RAID 5 |
45 |
74 |
95 |
155 |
RAID 6 |
60 |
107 |
125 |
225 |
RAID 1+0 (10) |
31 |
40 |
65 |
85 |
RAID DP |
31 |
40 |
65 |
85 |
To achieve 5,000 IOPS at RAID 6 70% read on 7.2K RPM disks, we’d need 125 disks in total. At RAID 10 on 7.2K RPM disks, the required disks falls to 65, a saving of 60 disks. The difference is even more pronounced when the workload is only 30% read. At RAID 6, we would require 225 disks, whereas at RAID 10, we would only require 85 disks—a saving of a whopping 140 disks.
Those of you who know RAID will be thinking at this point that some of the numbers in Table 6.12 are wrong, and you’d be right. How do you get 31 disks in RAID 1 or 10, or 225 disks in RAID 6? The answer is, you don’t. The numbers in Table 6.12 have not been adjusted for the minimum required for a complete RAID group, or the likely size of each RAID group that would be required to make up an entire aggregate or volume to be created from. You would need to increase the numbers of disks to be able to build complete RAID groups. For example, in RAID 5, it’s common to build RAID groups consisting of 7 data disks +1 parity disk (8 total), and in RAID 6, it is common to build 8+2 or 10+2 RAID groups. RAID5 7+1 or RAID6 10+2 may be terms you’ve heard before when talking to storage administrators.
Now that we’ve adjusted the figures in Table 6.13 for the RAID groups, you can see that RAID 1 and 10 are even more efficient than RAID 5 and 6 in terms of the number of disks to achieve the same performance. This is important to understand because it also has a direct impact on the amount of capacity that will be provisioned to reach the desired performance level.
Table 6.13 Min Disks per RAID Group Adjusted to Achieve 5,000 SQL IOPS
RAID Level |
15K RPM Disk 70% Read |
15K RPM Disk 30% Read |
7.2K RPM Disk 70% Read |
7.2K RPM Disk 30% Read |
RAID 0 |
24 |
24 |
50 |
50 |
RAID 1 |
32 (1+1) |
40 (1+1) |
66 (1+1) |
86 (1+1) |
RAID 5 |
48 (7+1) |
80 (7+1) |
96 (7+1) |
160 (7+1) |
RAID 6 |
60 (8+2) |
110 (8+2) |
130 (8+2) |
230 (8+2) |
RAID 1+0 (10) |
32 (1+1) |
40 (1+1) |
66 (1+1) |
86 (1+1) |
RAID DP |
36 (10+2) |
48 (10+2) |
72 (10+2) |
96 (10+2) |
For this part of the example, we’ll imagine that our SQL database that needs 5,000 IOPS will be 2TB in size. There will be an additional 200GB for transaction logs, 200GB for Temp DB, and another 100GB for the OS, page file, and so on. In totally, the capacity required is approximately 2.5TB.
From Table 6.14, you can see the usable capacity after taking into consideration the redundant or parity disks of the various RAID types needed to achieve 5,000 IOPS based on the previous examples. The 2.5TB usable capacity requirement for our sample SQL Server can easily be met by any of the selected RAID levels based on the number of disks required to achieve 5,000 IOPS. In fact, all of the RAID levels provide a lot more capacity than is actually required—some in the extreme.
Table 6.14 Usable Capacity Deployed to Achieve 5,000 SQL IOPS
RAID Level |
300GB 15K RPM Disk 70% Read |
300GB 15K RPM Disk 30% Read |
1TB 7.2K RPM Disk 70% Read |
1TB 7.2K RPM Disk 30% Read |
RAID 0 |
7.2TB |
7.2TB |
50TB |
50TB |
RAID 1 |
4.8TB |
6TB |
33TB |
43TB |
RAID 5 |
12.6TB |
21TB |
84TB |
140TB |
RAID 6 |
14.4TB |
26.4TB |
104TB |
184TB |
RAID 1+0 (10) |
4.8TB |
6TB |
33TB |
43TB |
RAID DP |
9TB |
12TB |
60TB |
80TB |
Table 6.14 shows that a large amount of the deployed usable capacity is actually unusable from a performance perspective. Or to put it another way, you have way too much capacity at the end of your performance. This clearly demonstrates Principle 3 of sizing for performance, and in doing so, capacity will usually take care of itself.
Now that we have calculated the usable capacity that needs to be provisioned to achieve the 5,000 SQL IOPS, we can calculate the IOPS per TB. As mentioned, previously using IOPS per TB is a good way to communicate with applications teams how much performance they should expect for each TB of data based on the different available storage policies available. For this example, we will take a conservative approach so that the application teams are planning on a worst-case scenario and their performance surprises will be positive. You’ll remember the quote from Principle 3: “The bitterness of poor performance lasts long after the sweetness of a cheap price is forgotten.”
To illustrate this, we will define three tiers of storage or storage policies:
- Gold—RAID10 300GB 15K RPM disks
- Silver—RAID5 7+1 300GB 15K RPM disks
- Wood—RAID6 8+2 1TB 7.2K RPM disks
We will base the IOPS per TB calculation on a 30% read-biased IO pattern. This will mean our DBAs and applications teams will likely get better performance than the defined service level.
Based on our example in Table 6.15, we could set an SLA for Gold at 800 IOPS per TB, Silver at 200 IOPS per TB, and Wood at 20 IOPS per TB. We have rounded down to take a conservative approach and ensure the SLA can always be met.
Table 6.15 IOPS per TB Based on Example 30% Read Workload at 5000 IOPS
Storage Policy |
Disks Required for 5000 IOPS |
Usable Capacity |
IOPS per TB |
Recommended SLA IOPS/TB |
Gold (RAID10) |
40 |
6TB |
833 (5000 / 6) |
800 |
Silver (RAID5 7+1) |
80 |
21TB |
238 (5000 / 21) |
200 |
Wood (RAID6 8+2) |
230 |
184TB |
27 (5000 /184) |
20 |
The Economics of RAID Performance
You have seen how performance requirements can drive storage design, and how many spinning disks are required when using different RAID levels to meet performance requirements. In our example, we used a SQL Server requiring 5,000 IOPS and 2.5TB capacity. Now we will look at the economics of different RAID choices and using solid state disks (SSDs) or enterprise flash disks (EFDs) instead of spinning disks.
From Table 6.15, in order to meet a 30% read 5,000 IOPS requirement and a 2.5TB capacity requirement, the Gold storage policy is the most cost effective. It would use half the number of disks to deliver the performance required, and more than covers the capacity requirements. It would be half the cost of the Silver storage policy for this workload. Now let’s take a look at how this might change if EFDs were used instead of spinning disks.
Table 6.16 shows the effective read and write IOPS after accounting for RAID penalties associated with using EFD disks with an assumed 5,000 IOPS per disk.
Table 6.16 RAID IO Penalties During Normal Operations of Enterprise Flash Disk
RAID Level |
RAID Write Penalty |
EFD Read IOPS |
EFD Write IOPS |
RAID 10 |
2 |
5,000 |
2,500 |
RAID 5 |
4 |
5,000 |
1250 |
RAID 6 |
6 |
5,000 |
833 |
Table 6.16 assumes a performance level of 5,000 IOPS for a single EFD disk. Depending on the type of EFD or SSD, these numbers could be very different. You should check with your particular vendor for their latest numbers. Also, it’s quite common for the read and write performance to be different even without the RAID penalties.
Table 6.17 shows the number of EFD disks required at different RAID levels to meet the IOPS as well as the capacity requirements of our sample SQL database workload.
Table 6.17 EFDs at Different RAID Levels Required for Example SQL DB
RAID Level |
RAID Write Penalty |
Array IOPS (30% Read) |
Required # of EFD for IOPS |
# of 400GB EFD for 2.5TB Capacity |
RAID 10 |
2 |
8500 |
2 |
14 |
RAID 5 |
4 |
15500 |
4 |
8 |
RAID 6 |
6 |
22500 |
5 |
9 |
Table 6.17 illustrates the number of EFDs required to meet both the performance and capacity requirements of our sample SQL DB. In this example, the RAID 5 option is the most cost effective from a performance and capacity perspective.
Comparing the number of 400GB EFDs required to meet the SQL requirements against the most cost effective options for spinning disks (Gold Policy RAID 10), we can see that we need five times less EFDs. For this workload, the eight EFDs may be the best option if their combined cost is less than the 40 spinning disks. In many cases, the EFDs will be less cost, especially when the reduced space, power consumption, and cooling of EFDs is considered.
Let’s add a Platinum storage policy in addition to the previous defined policies and calculate the effective IOPS per TB based on our 400GB EFD example.
With the new Platinum storage policy in Table 6.18, we can easily meet the performance requirement of 5000 IOPS, but we need additional disks to meet the capacity requirement. Table 6.15 shows us that we need eight EFDs at 400GB in order to achieve the required 2.5TB. Based on provisioning 2.8TB of usable capacity, we can calculate that our achievable IOPS from that capacity at a conservative 4000 IOPS per TB at RAID5 with write penalty of 4 is 11,200 IOPS. At this point, it’s likely that we’d run out of capacity well before running out of performance.
Table 6.18 IOPS per TB Based on Example 30% Read 5,000 IOPS and 2.5TB Capacity
Storage Policy |
Disks Required for 5000 IOPS |
Usable Capacity |
IOPS per TB |
Recommended SLA IOPS / TB |
Platinum (400GB EFD RAID5) |
4 |
1.2TB |
4,166 (5000 / 1.2) |
4000 |
Gold (300GB 15K RPM RAID10) |
40 |
6TB |
833 (5000 / 6) |
800 |
Silver (300GB 15K RPM RAID5 7+1) |
80 |
21TB |
238 (5000 / 21) |
200 |
Wood (1TB 7.2K RPM RAID6 8+2) |
230 |
184TB |
27 (5000 /184) |
20 |
At this point, you might consider doubling the size of each EFD to 800GB. This would halve the number of disks required to meet the capacity requirements. Assuming that each individual 800GB EFD has the same IOPS performance as the 400GB versions, you could achieve a better balance of performance and capacity. The larger EFDs would have half the IOPS per TB—in this case, to around 2,000. Five EFDs would be required to reach the required capacity. This would mean 3.2TB of usable capacity is deployed. The achievable IOPS from the deployed usable capacity would drop to 6,400. This is still a more performance than required. Also, although we are only using 5 × 800GB EFDs instead of 8 × 400GB EFDs, because they are double the capacity, they are also likely to be double or more the cost.
An EFD might be marketed at 400GB or 800GB in size, but to protect against wear of the NAND flash cells, the disk will usually have more physical capacity. This is to provide more endurance and a longer service life. This may vary between different vendors and individual SSDs, and we recommend you check with your storage vendor.