Climbing the Mountain to 24/7 Data Access
- 1 Fault Tolerance
- 2 Performance
- 3 Scalability
- 4 Reliability
- 5 Designing Reliable Data Access
- 6 Summary
Technology is, in many ways, more an art than a science. It is driven more by innovative minds than by textbook models. Much in the same way our drive to invent products for the future is often hindered by a lack of engineering ability, the development of products for the Internet generation has often times placed a distance between what the developer envisioned and the final product. After years of developing products under such constricting realities, it is easy to see how the integration of those products could be challenging. For example, the idea and development of tools to produce media-rich Web content was created many years before the telecommunications industry could provide the necessary bandwidth to make those products a reality. So, for many years, companies tried to integrate media-filled Web content with their own products and ideas, causing them to be challenged by the integration of products and the limitation of bandwidth.
Amidst the mass of connectivity now flowing through our homes and telephone lines, rarely a day goes by that we do not have an "imperfect" experience on the Internet. We all have left shopping cart Web sites due to badly written programs or slow response times. Even on our best day, we often cannot connect to our favorite site or online store. Most people don't give thought to why they cannot access a site; they just accept it as commonplace, for the Internet has always been known as an unreliable, public medium.
A chief executive from one of the largest software companies in the world once criticized the automobile industry, stating, "If the automobile industry had grown the way the technology industry has grown, we would all be driving hover crafts." In retort, a chief executive from a leading automobile maker said that although the automobile industry could have grown that fast, the result would have been unreliable automobiles that stopped for no reason, consistently failed to start, and required constant maintenance to keep on the road.
Although the automobile executive responded in jest, his point rings clear. People who purchase automobiles expect them to be reliable. They expect them to operate when the key is turned, to go when the accelerator is pressed, and to stop when the brake is applied. Unreliability and component failure is not acceptable.
In the technology arena, users accept intermittent failures as commonplace, viewing it as a price paid for innovation. The demand for bulletproof hardware and software has not been great because the personal computer, from the date of its inception, has always been riddled with a lack of fault tolerancethe ability to recover from a single point of failure (SPOF). As the computing environment in companies began to move from internal data processing to a model that was more accessible by their clients, it made sense to replace the large mainframe environment with a lower cost client/server architecture that is accessible by a World Wide Web browsergiving birth to the era of e-commerce.
Unfortunately, as the systems in use changed from large, reliable mainframe systems to low-cost client/server solutions, companies sacrificed fault tolerance for cost and the ease of integration with existing systems. Systems and applications became less reliable, even though they were more functional and easier to use. The sacrifice of fault tolerance was unavoidable due to a lack of client/server technologies that could provide the reliability and performance of a mainframe. But things are now changing.
Today, systems and services are being designed for high availability. Applications such as ERP (Enterprise Resource Planning) databases, business-to-business (B2B) e-commerce storefronts, and Web portals are in high demand. Along with that demand comes a need for servers that can provide data to users in a reliable fashion. Similarly to how mainframes were often used for data center environments that demanded 24/7 access, groups of servers, known as clusters, are now being deployed within data centers to rival, if not equal, the reliability and stability of the mainframe. A cluster, loosely defined, is a group of computers that work together to create a virtual server that can provide seamless services to a client or group of clients.
To acquire a better understanding of why a cluster is needed and how it provides the same reliability as a mainframe environment, several areas of concern must be addressed: fault tolerance, performance, scalability, and reliability.
1.1 Fault Tolerance
Fault tolerance is the ability of a system to respond to failure in a way that does not hinder the service offering provided by the server. This is often referred to as "graceful" response because it eliminates the failure gracefully, without interrupting service to users. Fault tolerance theory was applied to several other industries before it was ever used within a technology setting.
Consider the airline industry. If an airline did not have fault tolerance built into its airplanes, one small failure could bring a plane crashing to the ground. To avoid such catastrophe, airlines have eliminated every SPOF that they can. Most commercial airplanes have not one engine, but several, all of which are capable of keeping the plane in the air if one or more engines should stop working.
The average home computer user does not have fault tolerance built into his or her system. What happens if the power supplythe component that provides power to the machinefails? The machine does not function anymore. What if the hard drive stops working? The user will likely lose not only the ability to run his or her computer, but also the data resident on the hard drive. Computers built for the consumer market have several SPOFs and typically do not provide fault tolerance. This is because of cost and the uptime factor. The machine you run at home does not provide services to other machines that may need to 24/7 access it, and therefore the cost of the fault tolerance is outweighed by the home user's need for inexpensive computing.
Many people run backup software to eliminate the risk of losing data, which is good practice, but backup software is a disaster recovery option, not a substitute for fault tolerance. Disaster recovery is a means of restoring your computer and all your data in the event of a failure, but it does not prevent the failure or assist the computer in functioning in spite of the failure.
In a corporate environment, fault tolerance is a bit more important, not necessarily for the workstations in use by the company's employees, but for the servers that contain the company's data. In most companies today, business cannot operate without the data stored on the server. How many times have you called a merchant to hear, "I would love to help you with that, but our systems are down right now." Companies have developed a dependence upon technology to the point that they cannot operate without the "system."
To ensure that downtime is minimized, network engineers apply fault tolerance theory to their systems and applications. They ask the question, If blank were to fail, would my system still run and provide service to users? That blank could be anything that assists in providing an application or service to the company's users. For instance, if the power were to fail, would the system still run and provide service to users? Unless your system is battery-powered or protected by an uninterruptible power supply (UPS) system, then the answer is no, and power becomes an SPOF for your system.
Fault tolerance theory should be applied not only to the environmental issues such as power and cabling, but also to the machine or server itself. Within the server, you have several points of failure: the power supply, the hard drive(s), the network card, the motherboard, the memory, and so on. A failure in any one of those areas can cause your system not to function. Hardware manufacturers have begun to provide fault tolerance within their architectures, often building redundancy in the power supply or network card, and systems such as RAID (Redundant Array of Inexpensive Disks) have been developed to provide fault tolerance to as many parts of the machine as possible. However, even in the most advanced systems built today, redundancy for all of the components is not available. If your server has redundant power supplies, redundant network cards, redundant hard disks, and redundant processors, and the machine experiences a motherboard failure, the system still goes down.
So, although every machine you run that provides business-critical functionality should have every form of redundancy available, even the most advanced hardware/software combinations don't offer complete redundancy and fault tolerance within a single machine.