The Intelligent Architectures Design Philosophy
IS/IT staff are continually faced with the need to architect systems, whether for the deployment of new systems and services or in the redesign of existing systems and services. Sun's Enterprise Engineering group has developed Intelligent Architectures as a technique to architect systems and datacenters.
This paper will introduce the design philosophy and tenets of the Intelligent Architectures (IA) approach to systems architecture; a philosophy centered on the use of archetypesoriginal models after which similar things are patterned.
This paper will present the IA archetypes in brief, as well as rules and recommendations for combining archetypes to architect systems and datacenters. Additionally, recommendations for the combination of archetypes through the use of traditional design techniques such as top-down or bottom-up and structured or ad-hoc design will be explored.
Throughout this paper the terms "application" and "service" are used interchangeably to refer to a software construct that provides a function to other systems or end-users. The term "server" is used to refer to the hardware platform upon which a service executes. Finally, the term "system" is used to denote the combination of a server and all of the applications provided by that server.
Quality of Design
The IA approach is built on several tenets that we consider to be essential to "good design." However, before examining these tenets the concept of what is meant by a "good design" should be clarified.
For our purposes, a good design is one that helps maximize the reliability, availability and serviceability of a system. Further, a good design is as simple as possible while providing maximum extensibility. This simplicity enables the system to be easily understood, enhancing the maintainability of the system.
A good design must allow for the inevitable failure of a component or of the system as a whole. A good design provides recovery tools that aid in the rapid recovery of the system.
Finally, a good design must meet the requirements for the system, regardless of whether all of the requirements are known or whether they are accurate. It is only through an agreed upon set of requirements that a solution can be formed and eventually judged a success or failure. In the absence of known requirements, assumptions must be documented, and, most importantly, agreed upon with the customer (or end-user). The requirements definition process, like the design process, can be viewed as an iterative process that produces successive refinement.
Intelligent Architectures Design Tenets
Clustering for highly-available (HA) services, for example, with Sun™ Cluster software, is an area where service requirements must be examined very carefully. When implementing an HA service, there is a common misconception that the boot disk does not need to be mirrored because the service can be failed-over to another host in the event of a boot disk failure. While this is true, it does not account for the time the service is unavailable during the fail-over process. Although it may not take the fail-over very long to complete, if the requirements of the service are for five nines (99.999%) of availability, you should avoid the fail-over time altogether by mirroring the boot disk.
The inverse of this concept is also interesting to consider; it may not always be necessary to mirror the boot disk on systems providing an HA service. Consider the HA system management services provided by System Service Processor (SSP) to the Sun Enterprise™ 10000 platform; all Sun Enterprise 10000 configurations should have two physically distinct SSPs; a main SSP, and a spare SSP. With this configuration, there is no need to use a logical volume manager (LVM) to mirror the system disk of the main or spare SSP. In the event of a system disk failure in the main SSP, the SSP services can simply be failed-over to the spare SSP. The unavailability of the SSP during fail-over will not affect any running domains on the platform managed by the SSP. Further, because the SSP has only one system board and one SCSI disk controller (which are both single-points of-failure), the complete physical redundancy of SSPs gives you greater protection than mirroring with an LVM on the main or spare SSP.
In short, no single rule or best practice can be given that is appropriate for all systems in all datacenters. While you should take the datacenter site standards and experience of the datacenter personnel into account, the requirements and needs of the application must be the driving force for the decisions and compromises made in designing a system and planning its boot disk.
The IA design tenets are:
Design for the needs of the application
Employ an iterative design process
Design with simplicity in mind
Employ reusable design components
The following sections explain these tenets.
Design For the Needs of the Application
It is crucial that the development or choice of software tool be based on the requirements and needs of the system or datacenter. Consider the need of choosing an LVM, such as the Solstice DiskSuite™ software or VERITAS Volume Manager (VxVM). All too often, the choice of LVM is based on emotion, uninformed opinion, misunderstanding, or misconception. While the system administrator's experience and comfort level with an LVM are important, these factors should contribute to the decision of which LVM to implement, but must not be the driving force in the choice.
Before creating or implementing a system architecture, you must define and understand the availability and serviceability requirements of an application or service. These requirements must be the driving force in selecting and implementing a system and system software. The availability and serviceability needs of the application are paramount.
Further, the availability and serviceability requirements of an application must be addressed in the design of a system. For example, a system to be used exclusively to provide a data warehouse service is ideally suited to have its database implemented on a RAID5 volume, especially RAID5 implemented in hardware of the storage device or enclosure. The data warehouse transaction mix, which is almost entirely database reads, is well suited to RAID5 and data redundancy, and availability is achieved without the high cost of availability that RAID1+0 would impose. The read-oriented nature of the data warehouse allows the major weakness of RAID5 (extremely slow writes) to be avoided.
It is important to note that the preceding example did not mention the operating system (OS) or boot disk. The OS and the on-disk image provided by the boot disk exist to provide an environment for the application or service to function. The system architecture, of which the boot disk is a key component, must be designed and implemented to optimize the application, not the OS. In the preceding example, the LVM that provides the most efficient implementation of software RAID5, or the LVM that works best with the hardware RAID5 subsystem of the data warehouse, is the best LVM for the system.
This concept of system architecture is the architecture side of the axiom used in performance tuning; That axiom states that the greatest performance increases are gained in tuning the application, and the least performance increases are achieved in tuning the OS.
Employ an Iterative Design Process
When creating a system architecture, be aware that very few (if any) systems are deployed and then put into stasis. Systems, by definition and by necessity, are in an almost constant state of flux. The system life cycle is used to describe the phases a system or component goes through. These phases usually include initial concept, deployment, sustaining maintenance, and retirement. With many systems or components in a modern IT datacenter, the system life cycle dictates that constant changes will occur.
This concept of a system life cycle maintains that nearly every system will be designed and then redesigned several times in its lifetime. In fact, the analysis of existing system architectures and system redesign, or remodeling, is quite possibly one of the most common activities for system administrators and IT architects. They are rarely given an empty datacenter, a multibillion dollar budget, and told to "create a datacenter." More often, system architects are given a datacenter with legacy systems, very little remaining floor space, aggressive schedules, and insufficient budget.
The main constraint when remodeling is that one can't start over from scratch. Further, constraints like cost, risk, time, fallback plans, and return on investment create additional problems to remodeling systems. Making appropriate choices regarding the system architecture are critical to being able to work within these constraints.
Simplicity in Design
In addition to masking potential reliability issues, system complexity is often one of the largest inhibitors to system recovery. Because less-complex systems are more readily and quickly understood (and, therefore, easier to troubleshoot) than complex systems, decreased system complexity may help speed system recovery in the event of a failure.
Simplicity in design also benefits junior system administrators and junior datacenter operations staff by enabling them to administer systems that otherwise may have been beyond their experience or abilities.
Additionally, decreased system complexity helps minimize potential exposure to software bugs or harmful interactions between software components. Simply by decreasing the number of installed software components, you can reduce the potential for bugs or harmful interactions between software.
Reusability of Design Components
Reusing system components offers many advantages to the system architect. By having a common and reusable component design, the time required to architect systems decreases.
Additionally, reusable components provide a "known quantity" to system designs. The software components can be debugged and their correctness verified before integration into the system architecture. By combining these reusable components, you can design and deploy systems faster than you could by designing each system from the ground up.
Further, utilizing reusable components increases consistency across all systems in the datacenter. This minimizes the "one-off solutions" and ensures adherence to site standards.
This consistency across systems is, quite possibly, the greatest benefit of using reusable design components. Consistency across systems in your datacenter improves system recovery, simplifies maintenance procedures and run books, and enables systems to be installed and deployed faster.
The benefits and importance of reusable components are so great that the IA design philosophy is built around the utilization of archetypes, an original model after which similar things are patterned; essentially reusable components.