- SCSI-2 Reservation Issues
- Performance-Gathering and Hardware Agents Within a VM
- Data Store Performance or Bandwidth Issues
- Other Operational Issues
- Conclusion
Data Store Performance or Bandwidth Issues
Because bandwidth is an issue, it is important to make sure that all your data stores have as much bandwidth as possible and to use this bandwidth sparingly for each data store.
"As much bandwidth as possible" and "use sparingly" may sound counter intuitive, but they are not from an operational perspective. Normal operational behavior of a VM often includes such things as full disk virus scans, backups, spyware scans, and other items that are extremely disk-intensive activities. Although none of these activities will require any form of locking of the data store on which the VMDK resides, they all take a serious amount of bandwidth to accomplish. The bandwidth requirements for a single VM are not very large compared to an ESX host with more VMs. All activities are fairly additive in nature. What you do within one VM, from a disk perspective, affects all other VMs on the same datastore and, depending on the storage solution, all VMs on other data stores. How is this possible? Think about the networks involved, with traditional iSCSI over the network and NFS; your ultimate bandwidth is limited to the speed of the links used, so a single gigabit ethernet link is much more limited than links that use Fibre Channel host bus adapters (FC or iSCSI). This is why it is important to have as much bandwidth as possible, including using load balancing of your storage links for each data store in use. If you have access to a multipath plug-in driver, you may also be able to aggregate your storage links to form one larger trunk of pipes to your storage device and at the same time increase your overall storage bandwidth. Even with MPP and bandwidth aggregation, load balancing, either by hand or automatically, is a step in the proper direction.
Staggering storage-intensive activities in time will greatly reduce the strain on the storage environment, but remember that staggering across ESX hosts is a good idea as long as different data stores are in use on each ESX host. For example, it would cause locking issues for VMs that reside on the same LUN but different ESX hosts to be backed up at the same time, unless you are using in-VM agents; in that case, no locking issues would exist. Locking should be avoided. However, virus scans will not cause many issues when done from multiple VMs on the same LUN from multiple ESX hosts, because operations on the VMDK do not cause locks at the LUN level. By running backup and vStorage based antivirus tasks on different ESX hosts, you are using different links to the SAN and therefore are spreading your overall bandwidth usage across multiple links and hopefully using less of each link than running everything on a single ESX host.
It is possible that running of disk-intensive tools within a VM could cause results similar to those that occur with SCSI reservations, such as overloaded links that return errors instead of completing the operational task. These types of failures are not SCSI reservations. Instead, they are load issues that cause the SAN or NAS to be overworked and therefore present failures similar to SCSI-2 reservations.