Introduction
This section provides a technical description of DR concepts and attach and detach operations. The functionality and architecture of the software components of the Sun Fire 15K/12K DR framework are presented.
DR on a Sun Fire 15K/12K server is a powerful technology for reallocating resources in a domain while the domain is still running the Solaris OE and applications.
To successfully use DR on Sun Fire 15K/12K servers, it is strongly recommended that you test all DR operations in a test domain before a domain is put into production. In addition, all DR operations should be documented in a runbook for quick reference. Only tested and well-documented DR procedures should be performed on a production system.
Following best practices for DR in the architect, implement, and manage phases allows you to unleash the full potential of DR.
What is Dynamic Reconfiguration?
Dynamic reconfiguration allows resources to be dynamically reconfigured, removed, installed, and reallocated by enabling a physical or logical restructuring of the hardware components of Sun Fire 15K/12K servers. These actions can be accomplished without the need to reboot the Solaris OE.
DR allows the user to alter the configuration of a running domain by bringing components online or taking them offline. These configuration changes can be performed with minimal or no disruption to domain operation and do not require a domain reboot. With the availability of DR, system boards can be logically and physically included in the domain configuration, or logically deactivated and removed while the domain is running.
DR is a critical part of the availability strategy prevalent in enterprise computing environments. In mission-critical environments, DR is useful if:
A CPU/Memory board, MaxCPU board, or hsPCI I/O board needs to be added or removed.
A memory DIMM or bank(s) of memory DIMMs need to be added or removed.
PCI I/O adapters need to be added or removed.
Redundant I/O adapters need to be replaced.
Redundant I/O assemblies need to be replaced.
With DR, the following functions can be performed:
Dynamic reconfiguration of CPU/Memory boards, MaxCPU boards, hsPCI I/O boards, and PCI cards.
Move a CPU/Memory board, an I/O board, or a MaxCPU board between dynamic system domains.
Reconfiguration Coordination Manager (RCM) support.
DR support to IP multi-pathing (IPMP).
DR support to Sun StorEdgeTM; Traffic Manager Software.
DR support to Sun JavaTM; System Cluster 3.x. (formerly SunTM; Cluster)
DR is a significant feature of high-end Sun Fire servers. Being now in its fifth generation, DR has continued to develop as a technology to provide Sun Fire servers with virtually unmatched capabilities in the high-end UNIX server marketplace. Since 1995, DR technology has developed to exploit new capabilities of each new server product line and Solaris OE release:
1st generation: Solaris 2.4 OE Cray Version R Cray CS6400 servers
2nd generation: Solaris 2.5.1 OE Sun EnterpriseTM; 10000 servers
3rd generation: Solaris 2.6 OE Sun Enterprise 3000 to Sun Enterprise 10000 servers
4th generation: Solaris OE versions 7 and 8 Sun Enterprise 10000 server
5th generation: Solaris OE versions 8 and 9 Sun Fire 3800 to 15000 servers
In this article, we present the value add of DR on Sun Fire 15K/12K servers, the concepts underlying the DR framework, and the proper planning of DR.
Thorough planning and testing of DR operations are the essential prerequisites for successful usage of DR on Sun Fire 15K/12K servers.
The goal of this two-part series is to provide the reader with the necessary knowledge to successfully integrate DR in existing data center processes and thereby maximize system availability.
Why Use DR?
The primary benefit of using DR is that it allows hardware reconfigurations to occur while domains are running Solaris OE and production applications. This section details other benefits of DR.
Maximizing System Availability
Configuring a domain for DR is one of the most important prerequisites in maximizing system availability.
The DR feature enables you to insert and remove hardware components without bringing the domain down. For example, DR can be used to unconfigure a faulty hardware component from a running domain so that the component can be removed from the platform. The replacement component can be inserted while the domain continues to run. DR then configures the hardware component into the domain.
Proactive hardware replacements (for example, DIMMs experiencing an increasing number of correctable errors) can be performed without waiting for a scheduled downtime.
In most cases, these replacements can be performed online while the domains are running the Solaris OE; therefore such replacements become transparent.
With DR, maintenance windows can be minimized and system service disruption and downtime can be reduced, thus maximizing availability.
Reduced Maintenance Costs
The usage of DR not only maximizes availability, but also reduces downtimes and associated costs. DR allows for savings in reducing customer downtimes as well as service costs.
Scalability Without Downtime
DR allows you to upgrade existing Sun Fire 15K/12K domains. For example, you can add new hardware resources such as faster CPUs and additional memory or I/O interfaces. Such hardware reconfigurations can be performed while the domains are running the Solaris OE.
Simplified Resource Management
DR is the key technology for enabling dynamic provisioning of hardware resources. With DR, hardware resources can be dynamically provisioned to meet peak system load levels with better utilization of existing resources during non-peak periods.
Success Barriers
Although DR provides many benefits, there are some circumstances when the use of DR may not be appropriate.
Some customers might not allow DR due to their datacenter policy or because they perceive the risk of using DR as too high. Other customers might not be willing to use DR because they have not carefully prepared the usage of DR on their Sun Fire 15K/12K servers and, therefore, cannot experience the benefits of DR.
In some domain configurations, removing hardware resources might not be sustainable. Careful planning and architecting of domain configurations for DR can mitigate or eliminate these barriers.
Special restrictions do exist for dynamically removing components in clustered domains; however, DR is supported in Sun Java System Cluster 3.x configurations. These restrictions are addressed in detail in "Dynamic Reconfiguration in Clustered Domains" on page 42.
Domains containing third-party I/O cards require special planning and testing before implementing DR operations in a production environment. All Sun PCI cards, firmware, and drivers that are certified for use in Sun Fire 15K/12K (apart from HIPPI/P 1.1) are considered DR-safe.
Also, conflicts can exist with the Solaris OE and other applications, for example, the use of Intimate Shared Memory (ISM), kernel memory (on the CPU/memory board to be removed), processes bound to CPUs, and real-time processes and threads during Solaris OE quiesce (see "Operating System Quiescence" on page 15).
While some reasons are legitimate (for example, datacenter policy and configurations that do not allow DR), many times DR is not used due to misconceptions and lack of preparation. This article addresses the misconceptions and the proper planning for DR, with the intent of broadening the acceptance of DR.