- SCSI-2 Reservation Issues
- Performance-Gathering and Hardware Agents Within a VM
- Data Store Performance or Bandwidth Issues
- Other Operational Issues
- Conclusion
Performance-Gathering and Hardware Agents Within a VM
Performance and other types of monitoring are important from an operational point of view. Many customers monitor the health of their hardware and servers by monitoring hardware and performance agents. Although hardware agents monitor the health of the ESX host, they should not monitor the health of a VM, because the virtual hardware is truly dependent on the physical hardware. In addition, most agents are talking to specific chips, and these do not exist inside a VM. So using hardware agents will often slow down your VM.
Measuring performance now is a very important tool for the Virtual Environment; it will tell you when to invest in a new ESX host and how to balance the load among the ESX hosts. Although there are automated ways to balance the load among ESX hosts (they are covered in Chapter 11, "Dynamic Resource Load Balancing"), most if not all balancing of VM load across hosts is performed by hand, because there are more than just a few markers to review when moving VMs from host to host.
There is an argument that Dynamic Resource Scheduling (DRS) will balance VMs across all hosts, but DRS does balancing only when CPU contention exists. If you never have contention, you may still want to balance your loads by hand, regardless of DRS settings.
The first item to understand is that the addition of a VM to a host will impact the performance of the ESX host—sometimes in small ways, and sometimes in other ways that are more noticeable. The second item to understand is how performance tools that run within a VM, for example Windows, calculates utilization. It does this by incrementing a tic counter in its idle loop and then subtracts that amount of time from the system clock time interval. Because the VM gets put to sleep when idle, the idle time counter is skewed, which results in a higher utilization representation than typical. Because there are often more VMs than CPUs or cores, a VM will share a CPU with others, and as more VMs are added the slice of time the VM gets to run on a CPU is reduced even further. Therefore, a greater time lag exists between each usage of the CPU and thus a longer CPU cycle. Because performance tools use the CPU cycle to measure performance and to keep time, the data received is relatively inaccurate. When the system is loaded to the desired level, a set of baseline data should be discovered using VMware vCenter or other Performance Management tools.
After a set of baseline data is available, internal to the VM performance tools can determine whether a change in performance has occurred, but it cannot give you raw numbers, just a ratio of change from the baseline. For example, if the baseline for CPU utilization is roughly 20% measured from within the VM and suddenly shows 40%, we know that there was a 2x change from the original value. The original value is not really 20%, but some other number. However, even though this shows 2x more CPU utilization for the VM, it does not imply a 2x change to the actual server utilization. Therefore, to gain performance data for a VM, other tools need to be used that do not run from within the VM. VMware vCenter, a third-party tool such as Vizioncore vFoglight, or the use of esxtop from the command line or resxtop from the remote CLI are the tools to use because these all measure the VM and ESX host performance from outside the VM. In addition, they all give a clearer picture of the entire ESX host. The key item to realize is that when there is a sustained over 80% utilization of CPU for an ESX host as measured by vCenter or one of the tools, a new ESX host is warranted and the load on the ESX host needs to be rebalanced. This same mechanism can be used to determine whether more network and storage bandwidth is warranted.
Balancing ESX hosts can happen daily or even periodically during the day by using the vMotion technology to migrate running VMs from host to host with zero downtime. Although this can be dynamic (see Chapter 11), using vMotion and Storage vMotion by hand can give a better view of the system and the capability to rebalance as necessary. For example, if an ESX host's CPU utilization goes to 95%, the VM that is the culprit needs to be found using one of the tools; once found, the VM can be moved to an unused or lightly used ESX host using vMotion. If this movement becomes a normal behavior, it might be best to place the VM on a lesser-used machine permanently. This is often the major reason an N+1 host configuration is recommended.
Deployment of VMs can increase CPU utilization. Deployment is discussed in detail in a later chapter, but the recommendation is to create a deployment server that can see all LUNs. This server would be responsible for deploying any new VM, which allows the VM to be tested on the deployment server until it is ready to be migrated to a true production server using vMotion.
For example, a customer wanted to measure the performance of all VMs to determine how loaded the ESX host could become with the current networking configuration. To do so, we explained the CPU cycle issues and developed a plan of action. We employed two tools in this example, VMware vCenter, and esxtop running from the service console or from the vMA in batch mode (esxtop –b). For performance-problem resolution, esxtop is the best tool to use, but it spits out reams of data for later graphing. vCenter averages things over 5-minute or larger increments for historical data, but its real-time stats are collected every 20 seconds. esxtop uses real and not averaged data gathered as low as every 2 seconds with a default of 5 seconds. The plan was to measure performance using each tool as each VM was running its application. Performance of ESX truly depends on the application within each VM. It is extremely important to realize this, and when discussing performance issues to not localize to just a single VM, but to look at the host as a whole. This is why VMware generally does not allow performance numbers to be published, as the numbers are workload dependent. It is best to do your own analysis using your applications, because one company's virtualized application suite has nothing to do with another company's; therefore, there can be dramatic variations in workload even with the same application set.
If you do want to measure performance of your ESX hosts for purposes of comparison to others, VMware has developed VMmark, which provides a common workload for comparison across multiple servers and hypervisors. Unfortunately, VMmark is not a standard yet. There also exists SPECvirt_sc2010 from the Standards Performance Evaluation Corporation located at www.spec.org/virt_sc2010/.
Network Utilization
Network utilization or I/O is a constant operational concern within the physical data center, and this does not change within the virtual environment. What does change is that network concerns are now affected by ALL virtual machines using the link in question and that the virtual switches are tied to CPU utilization of the ESX host in question. Many people claim that no one VM would ever saturate a single gigabit connection, and now with 10 gigabit connections this is impossible. Neither of these are impossible; there is more than enough capability in modern hypervisors to saturate any link. However, as with the discussion of disk I/O and CPU performance, we must remember that many VMs are sharing those same network links and that the bandwidth used by one VM will affect all other VMs using the same link.
Even when you use VLANs, the traffic for all those VLANs is running over a single wire, perhaps a few wires if you are using the built in ESX load-balancing methods. Even so, it is possible for all VMs to adversely affect overall network utilization. Now when we throw into the mix VMsafe-Net and other network and security virtual appliances, we can throttle down bandwidth even more. At the very least we are adding to the overall CPU requirements for networking.
In a recent class, I was asked, "Why is this the case when virtual switches are 100% in memory?" The problem is that while the data and virtual switch code is in memory, that code must still run within the CPU as the vmkernel. So as you add more VMs, virtual switches, and snapshots, there is an increase in overall CPU utilization because now the vmkernel has to do more to handle virtual networking (an increase in snapshots implies that the CPU has to do more work to handle disk blocks, which includes change block tracking modes in vSphere because there is now more to do within the CPU for snapshots).
Anything happening within the vmkernel will impact CPU requirements just as VMsafe will impact virtual switch performance and therefore directly impact a host's virtual networking. What happens within a given VM can impact a host's virtual networking. Solutions like Intel-VT and AMD RVI reduce overall CPU overhead as they offload what the vmkernel needs to do. vStorage API for Array Integration (VAAI) will also decrease overall vmkernel needs because repetitive actions for storage will be placed into the arrays.
Virtual Machine Mobility
Virtual machine mobility is becoming an increasing concern for operations and compliance tracking because we often need to answer the question: "Where is our data?"
Given VMware vMotion, DRS, DPM, FT, HA, and Storage vMotion, we could surmise that our virtual machines are always in motion and therefore the data within the VM is never actually at rest. Given this, it is becoming more of an issue to know exactly where that VM is at all times. Did the VM end up on a host where it should not be running because of compliance, networking, or other concerns?
The current guidance with respect to compliance and virtualization security is to silo VMs within security zones contained within specific clusters. One way to enforce this is to tag the VMs, hosts, virtual switches, and other virtualization host objects with security zone tags (such as how the HyTrust Appliance and Reflex Systems approaches compliance) so that a VM cannot be placed on a host, virtual switch, and so forth that does not share the same tag.
Without these types of tags, a VM could end up on a host that does not have the proper virtual trust zones configured. Host Profiles can help to solve many of these configuration issues for the same cluster, but does not solve the problem for a different cluster, enclave, datacenter, and so on. However, tags apply to manual operations. The automatic operations from HA, DRS, and DPM are limited to only those hosts within the cluster. Hence, we see the oft-required security zone silo per cluster.
If tags are not in use, and they are not in use for the vast majority of virtualization systems today, there is an increasing risk that VMs end up on mis-configured hosts, and therefore application availability is impacted. In addition, a VM could end up somewhere else within your virtual environment. Perhaps it ends up on a single development host that is part of the virtual environment but shares the same LUNs as your production hosts.
Operationally, it is important to know where a virtual machine is at all times. To aid in this there are tools such as the HyTrust Appliance, VMware vCenter, and Hyper9, as well as any other virtualization search tools. For large environments, you may need to search for the location of your critical virtual machines or have canned reports that report on anomalies caused by vMotion and Storage vMotion. Anomalies to look for could be VMs for one trust zone ending up on hosts not vetted for that trust zone (DMZ VMs are a good point). These anomalies can happen to hosts outside a given cluster but within the same datacenter, as defined by VMware vCenter.