CPU Performance, Meet the Internet
Most of those who leverage cloud computing do so using public cloud providers. Although some companies can afford dedicated connections into a cloud provider, most of us will leverage cloud services over the open Internet. That causes a few problems when it comes to CPU performance, as well as data consumption and transmission.
As you can see in Figure 3-4, applications that are network-bound cause an issue. This means that the applications leverage the Internet to transmit and receive inter-process communications (IPCs) to share data and application message traffic between applications. This does not affect applications that are designed not to carry out IPCs and data exchange over the open Internet and may be sharing data only via the much faster network that exists internal to a cloud provider. By avoiding use of the Internet and its bursty latency that often occurs, your application won’t be bound to the speed of the Internet when considering overall performance.
FIGURE 3-4 Although you might leverage the fastest CPUs, if you move data or inter-process communications over the open Internet, the speed of network communications will become the bottleneck. This is something that many using the public cloud attempt, but often redesign or rehost the applications and data back onsite, so that network latency is diminished.
Yes, most of us are aware of performance issues when you carry out more heavy-duty processing and data communications over the Internet. What we often overlook are changes to the criteria you should use when selecting compute platforms, including CPU speed and type, memory size, and even the operating systems you plan to leverage, such as Linux and Windows NT.
Figure 3-4 shows performance at various levels of network performance, or an increase and a decrease in bandwidth, with a significant drop-off at the end. Note that the processor speed is about the same, and even rises slightly as it moves to the right. However, as network performance decreases, it really does not matter which processor you picked because that’s not what determines overall performance. For example, let’s say you have a saturated Internet router or even a denial-of-service attack. If the application or applications are bound to Internet speeds due to dependencies on IPCs or data exchanged that’s needed to drive the application(s), the CPU speed and performance become irrelevant.
This is not a call to rewrite all your migrated applications to remove or reduce IPCs or data exchanged over the Internet, or communications that once existed only on your corporate network. Instead, consider the cost you’ll pay for the platform, including CPU, memory, and so on.
The moral of this story: If you don’t get the extra benefit of faster CPU processors, why spend the extra money?
The Slowest Components Determine Performance
Here’s another way to think about buying compute. If your applications are processor bound, meaning they consistently wait for a process to complete or finish a compute cycle to continue, then you’ll likely get real value out of deploying the fastest and most expensive CPUs. This includes high-performance computing (HPC) platforms that are now available, which clients often leverage in response to slow-running applications.
However, for many applications, the CPU, memory size, and speed have little to do with overall performance. It’s the application design that’s at fault; trying to fix the problem on the provider side just wastes money. Sometimes it costs tens of thousands of dollars extra per month to run network-bottle-necked applications on expensive CPUs and HPCs, which removes any value gained by using public clouds.
Many enterprises go wrong here when they pick cloud computing services and components, which often includes the compute platform. With an incomplete understanding of what determines overall application performance, the typical response is to upgrade and up-spend the cloud provider’s CPUs. After all, it’s a simple click of a mouse to do the upgrade; you don’t have to visit a data center to integrate a new compute server with the other physical servers and the network. The process is so easy that you can get into real trouble with cloud spending before you realize the root causes of application performance problems.
A detailed discussion of typical application performance problems, how they are diagnosed, and how to design an application for good performance is beyond the scope of this book. However, it’s an area of focus that many cloud computing and cloud application architects should better understand.
Remember: The slowest running component determines overall performance. The most frequent culprit is the network. However, storage I/O delays, omnibus latency, and yes, the CPU could also be factors.
This point brings up more complexities when you pick components to build your application(s) or system. It’s not about what you spend for better-performing cloud services. It’s more about the design of your application(s) or system and where the performance bottlenecks will likely exist. The kneejerk reaction to a poorly performing application is to toss money at the problem and hope that fixes it. Instead, that approach introduces a whole new set of problems. The result is that you’ll spend too much for the application’s infrastructure that will not solve the root cause of the performance issue.
For example, let’s say a client of mine migrated an inventory control application from an existing LAMP stack (Linux, Apache, MySQL, PHP/Perl/Python) that ran in their data center to a public cloud provider. Minimum changes were made to the application, choosing instead to take a lift-and-shift approach to minimize the costs of migration.
After the migration was complete, performance issues were noted during acceptance testing. The user interface ran 40 percent slower than it ran on the traditional platform. Without truly diagnosing the problem, the application owners decided to leverage a more powerful and more costly platform, meaning a higher-end CPU cluster and more memory. The result was a 5 percent increase in performance, with the resulting performance determined to be unacceptable.
In this example, it could be any number of components or cloud services that caused the performance problem or problems. Without a sound diagnosis of what’s at the root of the performance issues, you’re just taking shots in the dark by renting better components and cloud services and hoping for the best.
Upon detailed diagnoses, it was determined that a combination of the database and network was the root cause of the problems. The database was fixed by changing a few tunable parameters, in this case, significantly increasing the data cache size. A failing Internet router in the department that used that application the most caused the network issue. Spending more money on CPU and memory resources did not help and only confused the matter more. By reviewing the application’s overall design and usage, we identified the actual issues and fixed them for a minimal amount of money.
How to Speed Things Up Through Design
The lesson here is that the key to selecting the best compute configuration (covered next) is to first understand the design of the application that will leverage the compute instance. Many of us hate the “it depends” answer, but here it really depends on how the application was structured, and thus how it leverages compute, storage, memory, and the network.
Fortunately, you can run the application within an application profiler to understand how the application leverages infrastructure resources, such as I/O, storage, compute, memory, and network. In many instances, this is done prior to migrating the application to the cloud so that you can make a more educated determination of how to set up the target cloud’s infrastructure to better support the application and data storage.
You can also understand the design of the application in other ways. A good old-fashioned review of programming code and database structure comes to mind. A review of the documentation left behind from the original design is another sound idea, as is talking to those who originally designed and built the application. During speedy migration projects, you’ll find that most of those who migrate the application don’t go to this level of due diligence and end up having performance problems. Reminder: Tossing more resources at the application after the fact is the costliest way to bandage over performance problems.
Enterprises consistently under- or overestimate the amount and configuration of cloud-based resources needed for a single application, or many applications. This is more than an application-level problem. It’s now a holistic migration problem.
The key to understanding the resulting performance problems is to first understand the design of each application, how it leverages different resources, and thus how the target cloud compute instances should be configured, along with other services that the application may need, such as storage.
Application design needs to be considered when building net-new cloud applications or when migrating an existing application to the cloud. Almost all performance issues I encounter, in terms of applications being migrated to the cloud or built on the cloud, end up being issues that were fixed by changing the application’s design.
Examples include leveraging new models for utilizing memory more efficiently, reducing calls across the network, and even performing foundational tasks such as leveraging a database caching system to reduce disk I/O and network utilization. Some of these are tweaks, such as tuning your database. Others are major surgery, such as leveraging a new and more efficient sorting approach. Potential design patterns pretty much number in the millions. It’s important that you have some visibility into these patterns to make the most of your cloud deployments.
Picking the Most Optimized Compute Configuration
So, let’s say we do most things right, in terms of understanding the design of our application workloads and data storage requirements. That means we pretty much know what we need for CPU, memory, and other platform requirements such as operating systems. How do we pick the most optimized compute configuration? Keep in mind that we can make mistakes here in two different directions.
As you can see in Figure 3-5, the value delivered is at its lowest points when we leverage too few resources or too many resources. If we leverage too few resources, this will save money on cloud resource usage, but application performance and resiliency will suffer, leading to a reduction in the business value the application will deliver to the business.
FIGURE 3-5 Here’s what happens when you leverage too few or too many resources, and the corresponding effect on value delivered to the business. The objective here is a fully optimized system that delivers the most value to the business. You will find that just tossing money at problems fails to deliver the value. Also, not spending enough on the resources that you need also removes value. It’s a balancing act.
It’s the same case for using too many resources. Although application performance and resiliency should be good, we pay too much for unnecessary resources. Thus, the extra cost reduces the value delivered to the business by overspending. We maximize the value delivered to the business when the value delivered to the business is about centered where both curves meet, and the number of resources is as fully optimized as value delivered.
Keep in mind that we may be facing a mindset issue here. Customers build in the cloud like they used to build data centers. They build like they are retailers trying to build for the Christmas rush. This is largely because it’s so easy. The best analogy that I have is the current rise of food delivery services. The ease in which we can obtain our favorite foods, with pretty much no effort or pain, means that we will have the mindset to buy more, and thus get fat (or fatter). The mindset around cloud computing means that we’re getting fat with cloud services that we most likely don’t need.
What’s interesting about this chart is that the current number of fully optimized enterprise cloud applications is pretty much zero. Most of those charged with picking cloud resources, including the CPU and memory resources, over- and underestimate the number of required resources. They end up on either end of the resource and value curves and almost never near the center. One problem is that few cloud architects don’t suffer the immediate consequences of resource allocation mistakes and remain blissfully unaware of the ripple effect of their choices.
Although overallocated resources result in higher cloud bills, the cloud-based application typically performs just fine, which is the metric most application owners use. Even those who underallocate resources may not know they have a value delivery problem unless they measure productivity, which may suffer due to application latency and outages. It will take time to go back and identify where a system failed to live up to expectations, but these ghosts in the machines will eventually come back to haunt you.
Again, the ability to optimize systems comes down to understanding your requirements before selecting the cloud resources you should leverage. The process should never be about guessing, nor trial and error. It should be about mathematically understanding the requirements around processor speed and memory use to get as close as you can to full optimization. This is a bit like horseshoes and darts in that you’ll win this game only if you get close. Very few will obtain full optimization when it comes to business value; what’s important is that you get as close as possible.
Figure 3-6 looks at your options when it comes to selecting a cloud compute platform. Selecting and configuring a compute instance sound simple; just select the CPU type (such as x86), including brand (Intel, AMD, and so on) and processor speed. However, you must also consider the number of processors configured and the size and speed of the memory. And then there are different types of processors, such as a microprocessor, microcontroller, embedded processor, and digital signal processor. Also, different processor generations, brands, and standards.
FIGURE 3-6 When picking a compute platform, you must decide the power and type of processor, number of processors, memory configuration and size, and operating system. Input/output usually needs to be configured as well.
Yes, you can configure and deploy some pretty powerful compute platforms. However, it’s not only about what you pick to align the platform to the requirements of the application, but what goes on behind those choices, in terms of power and cost of the compute resource.
The operating system the platform will employ focuses on the type and brand. For example, if your application runs on Linux on its traditional platform, you’ll need to pick a Linux-compatible brand for cloud migration such as Red Hat, or perhaps a version created by the cloud provider such as AWS Linux. You might also pick Windows NT and other operating systems that you need to configure.
You’ll find that the types, brands, and even the versions of operating systems vary a great deal from public cloud provider to public cloud provider. It would be nice to have a list here by platform. However, there could be a dozen more operating system types, brands, and configurations available by the time of publication. But I’ve found that most of them do the same things, and today arguing about what operating system is best has diminished returns. Don’t get caught up in the silliness of becoming a believer around one specific operating system or another. Certainly, not the public clouds where changing operating systems takes a very short period of time and does not require touching physical hardware.
Normally, cloud providers give you a few choices. First, you can select one of their prebuilt configurations with the processor, memory, and operating system preconfigured for you. These “packages” are often the easy choice because these components are known to work with the provider’s systems, or you can select the components a la carte from a list of prebuilt configurations provided by the public cloud provider. Or, you can select just the memory, operating system, and processor(s) that you need. So, what’s the best path?
I never really liked the idea of selecting the packages that the public cloud providers put together. Not that there is anything wrong with prebuilt bundles that are known to work together, but the odds are against a prebuilt bundle meeting the exact needs of your application workload. This is especially true if you’re only guessing about application requirements without having a detailed profile, and little thought is given to your exact and specific needs.
Picking the correct compute configuration is an often-overlooked art form, certainly in the world of cloud computing. Let’s look again at Figure 3-6, where there are three to four components to choose. You’ll also need to attach storage (perhaps a database), configure the network, and even attach special services such as machine learning and data analytics. For now, let’s just keep the discussion to compute.
For most migrated applications, you’ll need to profile the application workload using an automated profiling tool to determine the exact processor and memory needs. Or, you can use formulas to determine the compute needs of the application based on what it’s built to do and how it should use processor and memory. Guessing is only a last resort that pretty much ensures a wrong answer.