- Minimum Software Requirements
- Introduction
- DR Architecture
- Preparing the System: Requirements
- About the Authors
- Acknowledgements
- Related Resources
- Ordering Sun Documents
- Accessing Sun Documentation Online
Preparing the System: Requirements
In this section, we provide a list of prerequisites and requirements that should be fulfilled in order to guarantee successful DR attach and detach operations on Sun Fire 15K/12K servers. Although some prerequisites must be fulfilled for all configurations, some of them might not be valid for a specific configuration.
Keep in mind that all prerequisites that are valid for your particular configuration must be fulfilled before you attempt a DR operation.
Hardware, Software, and Firmware
The following are requirements and recommendations for hardware, software, and firmware.
Hardware
For the full support of DR, for example, DR operations of hsPCI assemblies, the expander board AXQ 6.1 and hsPCI I/O board Schizo 2.3 ASICs are required.
NOTE
Do not perform DR operations on MaxCPU boards that would result in a split-expander configuration. POST does not allow MaxCPU boards to be configured in split-expander configurations.
Software
This section describes the requirements on the Solaris OE and patches for both slot 0 DR and slot 1 DR.
NOTE
Slot 0 DR denotes operations involving boards that physically reside in slot 0 of an expander. Slot 1 DR denotes operations involving boards that physically reside in slot 1 of an expander.
Slot 0 DR
For both the SCs and the domains, the minimum Solaris OE version for using slot 0 DR is Solaris 8 OE (02/02) or Solaris 9 OE (04/03) with patches. The minimum SMS software version is SMS 1.2.
In addition, for both the SMS software on the SC and the Solaris OE on the domains, patches are required. The minimum patch levels for the SMS software and the Solaris OE are detailed in the Sun Fire 15K/12K DR Installation Guide and Release Notes. As a best practice, having the most recent patch levels is preferred. A listing of the currently available SMS patches for the SC can be obtained from the PatchPro Interactive tool via the SunSolveSM Web site at:
Slot 1 DR
The minimum Solaris OE version for the SC is Solaris 8 OE (02/02). For domains, the minimum Solaris OE is Solaris 8 OE (02/02) with patches or Solaris 9 OE (04/03). The minimum SMS software for slot 1 DR is SMS 1.3.
The minimum patch levels for SMS and Solaris OE are detailed in the Sun Fire 15K/12K DR Installation Guide and Release Notes. As a best practice, having the most recent patch levels is preferred. A listing of the currently available SMS patches for the SC can be obtained through the PatchPro Interactive tool at:
The minimum software requirements for both slot 0 and slot 1 DR are listed in the following table.
TABLE 3 Minimum Software Requirements for Slot 0 and Slot 1 DR
DR Phase |
SMS Software Version |
System Controller |
Domain |
Slot 0 DR (Phase 1) |
SMS 1.2 and patches |
Solaris 8 OE (02/02) |
Solaris 8 OE (02/02) and patches or Solaris 9 OE (04/03) and patches |
Slot 1 DR (Phase 2) |
SMS 1.3 |
Solaris 8 OE (02/02) and patches |
Solaris 8 OE (02/02) and patches or Solaris 9 OE (04/03) |
Firmware
Every board in the system must have the correct firmware levels installed. The minimum firmware levels are as follows:
Slot 0 DR CPU/Memory board Local Power On Self Test (LPOST) version: 5.13.4 (or newer)
Slot 1 DR CPU/Memory board LPOST version: 5.14.2 (or newer)
NOTE
It is always best practice to have the latest firmware level installed. Check http://sunsolve.sun.com for the latest firmware version.
DR Attach
Before attaching a new component into a domain, verify the new hardware is fault free. Verifying a new component in a test domain is a best practice.
CPU/Memory Board
Attaching a CPU/Memory board requires only a few prerequisites. The board must be powered on and must not belong to another domain. Also keep in mind the following prerequisites regarding the firmware and Capacity on Demand boards.
Firmware Version
Before inserting a new CPU/Memory board, verify the board contains the same firmware as all other boards in the platform. Use the flashupdate command to obtain the firmware levels and to upgrade or downgrade the firmware if necessary.
The firmware of the CPU/Memory boards should be at least firmware level 5.13.4 for slot 0 DR and firmware level 5.14.2 for slot 1 DR. As a best practice, always use the latest firmware level that is available at http://sunsolve.sun.com.
Capacity on Demand (COD) CPU/Memory Board
If the CPU/Memory board being attached is a capacity on demand (COD) board, verify that a Right To Use (RTU) license for each CPU has been purchased.
hsPCI I/O Board
This section describes the considerations that must be made when attaching an hsPCI I/O board.
Qualified PCI Adapters
All Sun PCI cards, firmware, and drivers that are certified for use in Sun Fire 15K/12K (apart from HIPPI/P 1.1) are considered DR-safe.
For third-party adapters, it is the responsibility of the vendor to ensure the adapter is suspend-safe. Before performing DR operations involving third-party adapters, it is always best practice to test the DR operation in a test domain using the PCI adapter with the same firmware.
PCI Adapter Testing
When attaching PCI adapters or I/O boards containing PCI adapters, the adapters are not tested by the Power On Self Test (POST) code, only the presence of the adapter is reported by POST. To avoid inserting faulty PCI cards into a production system, the PCI adapters should be tested in a test domain, for example, by running the Sun Validation Test Suite (SunVTS) software against a connected peripheral device or network.
I/O Board Testing
When an hsPCI assembly is attached into a domain, the board is automatically tested by POST. This testing requires dedicated CPU and memory resources. Because I/O boards do not contain processors or physical memory to run the POST tests, the Solaris OE must temporarily use a CPU and some memory from a CPU/Memory board (from the target domain to which the board will be attached) to test the board. The CPU and memory become logically isolated from the target domain.
If the applications cannot tolerate one less CPU, the DR attach of the I/O board cannot be performed. Also, if there are bound processes to all CPUs on the CPU/Memory boards in the domain, the CPU cannot be taken offline. The DR attach operation fails. Rebinding of processes to other CPUs must be performed by an administrator using the pbind or psrset commands.
When a CPU can be borrowed for testing the I/O board, the SC performs the POST test of the new board in a Transaction/Error cage so that transactions only go to the borrowed CPU and memory. Therefore, the new hsPCI I/O board is isolated and cannot inadvertently interrupt the running domain.
After the testing, the CPU and the memory are released again. However, during the POST testing in the attach phase, the borrowed CPU and memory are not available to the hosting domain.
MaxCPU Boards
Before attaching MaxCPU boards into a domain, the firmware must be checked and updated if necessary, as described in this section.
NOTE
Do not perform DR operations on MaxCPU boards that would result in a split-expander configuration. POST does not allow MaxCPU boards to be configured in split-expander configurations.
Firmware Version
Before inserting a new MaxCPU board, verify the board contains the same firmware as all other boards in the platform. Use the flashupdate command to verify the firmware levels and to upgrade or downgrade the firmware if necessary.
The firmware of the MaxCPU boards should be at least firmware level 5.14.2. As a best practice, always use the latest firmware level.
Check the PatchPro Interactive tool at http://sunsolve.sun.com for the latest firmware patch.
DR Detach
While DR attaching is a straightforward operation, detaching a component has many considerations. The DR detach operation requires strategic planning because it reduces the overall resources available to a domain. The ability to successfully perform a DR detach operation of a component depends on how the OS uses resources provided by the component to be detached. In some cases, it might not be possible to perform the DR detach operation.
The following sections describe the restrictions that impact DR detach operations.
CPU/Memory Board
This section describes the restrictions when detaching a CPU/Memory board and suggests possible solutions.
CPU and Memory Resource Removal
Detaching a CPU/Memory board reduces the domain in CPU processing power and memory capacity. This might have an impact on domain applications. As a general rule, do not attempt to detach a CPU/Memory board if the CPU utilization is above 95 percent, the memory utilization is high, and the available memory is less than the amount that is being detached.
To lessen the impact of hardware removal on DR detach operations, a hot spare CPU/Memory board can be attached to the domain prior to the detach operation. The new CPU/Memory board provides additional CPU processing power and memory capacity so that the performance impact of a DR detach operation can be made transparent.
Bound Processes
CPUs with bound processes cannot be detached. DR does not automatically rebind processes to other CPUs. If there are any bound processes to the CPU/Memory board being detached, the administrator must rebind the processes manually.
Processor sets must be removed using the psrset command, and bindings of processes to single processors must be removed using the pbind command.
The following code sample shows a DR detach operation of a CPU/Memory board with processes bound to one of the CPUs.
CODE EXAMPLE 5 DR Detach and Bound Processes
# cfgadm -c unconfigure SB1 Jan 30 14:01:42 15k-dom dr: WARNING: Failed to off-line: dr@0:SB1::cpu2 cfgadm: Hardware specific failure: unconfigure SB1: Failed to off-line: dr@0:SB1::cpu2 # Jan 30 14:01:42 15k-dom dr: WARNING: dr_pre_release_cpu: thread(s) bound to cpu 34 # # ps -ealfP F S UID PID PPID PSR C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 19 T root 0 0 - 0 0 SY ? 0 17:50:15 ? 0:00 sched 8 S root 1 0 - 0 40 20 ? 191 ? 17:50:15 ? 0:00 /etc/init - ... 8 S root 2176 1 - 0 40 20 ? 253 ? 17:52:07 ? 0:00 /opt/SUNWsymon/base/sbin/sparc-sun- 8 S root 1286 1 34 0 40 20 ? 617 ? 17:51:09 ? 0:00 /usr/lib/ssh/sshd 8 S root 2154 727 - 0 40 20 ? 342 ? 17:52:01 ? 0:09 esd - shell perftool-shell.tcl 8 S root 3432 1 - 0 40 20 ? 802 ? 13:58:03 ? 0:00 /usr/lib/rcm/rcm_daemon 8 S root 2179 727 - 0 40 20 ? 132 ? 17:52:45 ? 0:00 sh 8 O root 3448 1253 - 0 53 20 ? 419 14:02:22 console 0:00 ps -ealfP # # pbind -u 1286 process id 1286: was 34, now not bound # cfgadm -c unconfigure SB1 Jan 30 14:02:50 15k-dom dr: OS unconfigure dr@0:SB1::cpu0 ...
Swap Space
Verify that there is enough swap space available if a CPU/Memory board is detached. When memory is detached, there must be enough memory and disk swap space remaining in the domain to accommodate all of the currently running programs. On DR detach operations, swap space is used to store memory pages when a DR drain flushes pageable memory from the detaching system board. Therefore, the minimum amount of swap space is at least the maximum memory size of any board in the domain. The general rule of recommending twice the maximum amount of main memory on any CPU/Memory board in the domain ensures that enough swap space is always configured for DR operations.
Insufficient swap space prevents DR from completing the DR detach operation. The memory drain phase of the DR detach operation is not completed.
DR does not check the swap space configuration before starting the DR detach operation. The administrator must check how much memory will be drained (using the cfgadm command), then check if the current swap space is sufficient (using the swap -l command).
NOTE
These recommendations are solely for using DR. Swap space requirements for applications are not included and must be considered separately.
Intimate Shared Memory
Intimate shared memory (ISM) pages cannot be paged out and, therefore, cannot be released from a CPU/Memory board to be detached. ISM pages must be relocated to a new CPU/Memory board on DR detach operations.
ISM is used in database applications such as ORACLE, Sybase, and Informix. Detaching a CPU/Memory board containing ISM pages can impact database performance and can take a long time due to the relocation of ISM pages. As for other restrictions, it is best practice to test DR operations configurations containing ISM pages before implementing them in a production environment.
It is recommended that you check for ISM pages before detaching a CPU/Memory board. ISM allocations can be seen on a live domain via the ipcs(1M) command. A separate core analysis tool, scat (SolarisTM; Core Analysis Tool), can be downloaded from SunSolve. This tool contains a method of mapping ISM pages to the physical address space that is assigned to each board. In this manner, you can map ISM pages to a board to predict the effect of a detach operation on an application. The following code example shows analysis with scat.
CODE EXAMPLE 6 Locating Non-Pageable Memory Using scat
# scat(vmcore.2) ipc -M ... 0x1017c800000 - 0x1017cbfffff (4M) 0x10188800000 - 0x10188bfffff (4M) 0x10189800000 - 0x10189bfffff (4M) ... 0x101ffc62000 - 0x101ffc63fff (8K) 0x1217c800000 - 0x1217cbfffff (4M) 0x12188800000 - 0x12188bfffff (4M) 0x12189800000 - 0x12189bfffff (4M) ... 0x121fdd4e000 - 0x121fdd4ffff (8K) 0x121fec6a000 - 0x121fec6bfff (8K) 0x1417c000000 - 0x1417c3fffff (4M) # scat(vmcore.2) seg phys ... phys_install list: pa range size =========================== ============== 0x10000000000-0x101ffffffff 0x200000000 (8G) 0x12000000000-0x121ffffffff 0x200000000 (8G) 0x14000000000-0x141ffffffff 0x200000000 (8G)
Note that we have ISM chunks on three boards, although the third has only one small segment. The seg phys command maps the physical address ranges to all boards in the domain at the time it is executed.
In the previous code example, the following mapping is derived:
First CPU/Memory board has 3x4M and 1x8K segments
Second CPU/Memory board has 3x4M and 2x8K segments
Third CPU/Memory board has 1x4M segment
For more information and to get the latest Solaris Core Analysis Tool software, refer to:
http://wwws.sun.com/software/download/operating_sys.html
Detach of CPU/Memory Board with Kernel Memory
The kernel permanent memory cannot be paged or swapped out. Therefore, detaching the CPU/Memory board containing the kernel memory requires OS quiescence. During the OS quiescence, the kernel memory is relocated to another CPU/Memory board in the domain.
The location of the kernel permanent memory can be determined by using the cfgadm command from the domain.
# cfgadm -alv | grep SB | grep permanent
or by using the rcfgadm command from the SC:
# rcfgadm -d <domain_id> -alv | grep SB | grep permanent
where <domain_id> represents the domain (A through R for a Sun Fire 15K server, or A through I for a Sun Fire 12K server).
Before the OS can achieve quiescence, it must temporarily suspend all execution threads (processes, CPUs, device activities). In general, if the detach operation fails due to a process that will not suspend, it is a temporary condition. Verify that the OS could not quiesce due to a failure to suspend a process, then retry the operation.
In particular, pay attention to the following considerations when you plan to detach a CPU/Memory board with kernel memory:
Relocating the Kernel Memory: A second CPU/Memory board containing the same or more amount of memory as the CPU/Memory board being detached must be available to relocate the kernel permanent memory as a single, contiguous slice of physical memory. If no suitable target memory on another CPU/Memory board can be found, the relocation of the kernel memory aborts.
Running Real Time (RT) Processes in Solaris 8 OE: Detaching the CPU/Memory board with the kernel memory fails if the domain is running Solaris 8 OE and has RT threads in use. The DR framework does not suspend the system if there is an RT process in the system.
If possible, stop the RT process or put the RT process temporarily into the timeshare (TS) class by using the priocntl command.
Disable the checks for user thread activity by adding the following entry to the /etc/system file: set dr:dr_skip_user_threads = 1.
Force the DR detach operation by using the -f option of the cfgadm command.
Using the Correct Quad FastEthernet (QFE) Cards Patch: When detaching the CPU/Memory board containing the kernel memory, verify that the Sun Quad FastEthernet qfe driver patch 108806-14 is not installed if QFE cards are configured in the domain.
Using the Correct Fibre Channel Cards in Fabric Mode Patch: In a domain with fibre channel cards in fabric mode, verify the fctl/fp/fcp/usoc driver patch 111095-13 (or newer) for Solaris 8 OE and patch 113040-04 (or newer) for Solaris 9 OE is installed before attempting to detach the CPU/Memory board containing the kernel memory. Otherwise, access to all disks via QLogic-driven fibre channel cards is lost.
Using the Correct Sun StorEdge Traffic Manager Software (Sun StorEdge Traffic Manager Software) Patch: Any attempt to detach the CPU/Memory board with the kernel memory while the Sun StorEdge Traffic Manager Software multi-pathing software is running causes a stack overflow panic. To avoid this panic, verify that kernel patch 108528-15 (or newer) for Solaris 8 OE and kernel patch 112233-01 (or newer) for Solaris 9 OE is installed.
Understanding Restrictions with Clustered Domains: Although DR is supported in a clustered environment, some restrictions apply when performing a DR detach of the board with kernel memory in a clustered domain.
DR detach of a board containing permanent memory. Sun Java System Cluster software does not support detach operations on permanent memory and the system prevents such operations.
DR detach of quorum devices. DR detach operations cannot be performed on a device currently configured as a quorum device.
DR detach on active devices on primary node. DR detach operations on active devices in the primary node are not allowed. DR operations can be performed on non-active devices in the primary node and on any devices in the secondary node.
DR detach on active private interconnect interface. DR detach operations cannot be performed on active private interconnect interfaces. A workaround is to disable and remove the interface from the active interconnect.
Handling Suspend-Unsafe Device Drivers: When DR suspends the OS, all of the device drivers attached to the OS must also be suspended. If a driver does not support the DDI suspend and resume functions, the DR operation fails.
Using Third Party Adapters and Driver Software: Sun does not qualify all third-party PCI adapters and driver software with DR. It is the responsibility of the vendor to ensure that the adapter, the driver stack, and the adapter firmware are suspend-safe.
To provide suitable target memory, a hot spare CPU/Memory board can be attached to the domain prior to the detach operation.
NOTE
In Solaris 9 OE, the RT check is no longer performed. Therefore, RT threads no longer block copy rename operations on Solaris 9 OE domains.
The most common RT process is the Network Time Protocol (NTP). Other applications might be scheduled as real time, for example, Sun Java System Cluster and ORACLE LMS and LMD processes.
To perform the detach operation on a Solaris 8 OE domain when the DR detach of the board with kernel memory is blocked by RT threads, use one of the following workarounds:
Instead, use patch 108806-17 (or newer) for Solaris 8 OE systems and patch 112764-06 (or newer) for Solaris 9 OE systems.
In a Sun Java System Cluster configuration, the following operations are not supported:
Refer to the Sun Cluster 3.0 System Administration Guide for details on DR in clustered Sun Fire 15K/12K domains.
Similar restrictions apply for third-party cluster products. For Veritas Cluster software, the Veritas Cluster Server (VCS) must be stopped when detaching the CPU/Memory board with the kernel memory. Refer to the Veritas Cluster Server Application Note Sun Fire 12K/15K Dynamic Reconfiguration for planning DR operations in VCS clustered domains.
Suspend-unsafe drivers continue to access memory or send interrupts while the OS is quiesced, in some cases leading to domain stops (dstops) or unscheduled domain interrupts. Suspend-unsafe drivers can be added to the unsafe driver list in the dr.conf file to prevent the devices from accessing memory or sending interrupts during OS quiescing.
A conservative approach is to add third-party device drivers to the unsafe driver list when it is not known whether the device driver is suspend-safe. Before detaching the CPU/Memory board with kernel memory, the administrator can manually suspend I/O to the device and unload the driver.
Page Retirement
Solaris 8 OE kernel update patch 108528-23 (and older) and Solaris 9 OE kernel update patch 112233-08 (and older) contain a rudimentary implementation of page retirement. The page retirement mechanism is disabled per default and should not be enabled. Otherwise, DR operations on system boards containing known retired pages hang.
For information on the page retirement feature refer to the Sun BluePrintsTM; OnLine article titled "Solaris Operating System Availability Features."
hsPCI I/O Board
This section describes the restrictions that exist when detaching an hsPCI assembly and suggests possible solutions.
Stopping I/O Device Activity
Before an hsPCI board with I/O devices can be unconfigured (and eventually removed) from the domain, all of its devices must be closed, and all of its file systems must be unmounted. The following list provides tasks which should be performed prior to detaching an hsPCI board.
Use the fuser command to see which processes have the device open.
Run the showdevices command on the SC to determine the state and usage of devices.
Either kill any process that directly opens a device or raw partition or direct it to close the open device on the board.
Unmount all file systems hosted by the CPU/Memory board to be detached unless multi-pathing software, for example, Sun StorEdge Traffic Manager Software, is configured.
If disk mirroring is used, reconfigure the device so that is accessible by the alternate path on another I/O board.
Remove multi-pathing or logical volume databases from board-resident partitions.
Remove any private regions used by volume managers.
Remove any swap files or disk partitions hosted by the card to be detached from the swap configuration.
If a suspend-unsafe device is present on the board, close all instances of the device and use the modunload command to unload the driver.
Verify that IPMP is properly configured if the PCI card to be detached has a NIC. Alternatively, move the network interface to another I/O board by using the ifconfig command.
Reconfigure the dump configuration using the dumpadm command to use a dump device on another board.
Verify that there is sufficient swap space prior to detaching an I/O board.
The location of multi-pathing databases is explicitly chosen by the user and can be changed.
By default, volume managers use a private region on each device that they control. Such devices must be removed from volume manager control before they can be detached.
All swap devices contained by the CPU/Memory board to be detached must be deleted using the swap command and removed from the /etc/vfstab file.
If you do not check these conditions prior to detaching an I/O board, a message similar to the following might be displayed:
CODE EXAMPLE 7 Detaching an I/O Board: Error Message
# cfgadm -c unconfigure IO1 cfgadm: Library error: Resource Information --------------------------------------------------------------- /devices/pci@3d,700000/pci@1/SUNW,qfe@0,1 Network interface qfe0 SUNW_network/qfe0 qfe0 hosts IP addresses: 192.168.210.255 /dev/dsk/c0t1d0s0 mounted filesystem "/" /dev/dsk/c0t1d0s1 swap area /dev/dsk/c0t1d0s1 dump device (swap) /dev/dsk/c0t1d0s7 mounted filesystem "/export/home"
On DR detach of a single PCI card, an error message similar to the following might be displayed:
CODE EXAMPLE 8 Detaching a Single PCI Card: Error Message
# cfgadm -c disconnect pcisch7:e02b1slot2 cfgadm: Component system is busy, try again: Resource Information -------------------------------- ------------------------------- /devices/pci@5d,600000/network@1 Network interface ce2 SUNW_network/ce2 ce2 hosts IP addresses: 192.
Qualified PCI Adapters
For a device to be detachable, it must support the DDI_DETACH function (that is, the device driver is detach-safe) or the device driver is not currently loaded into memory.
All Sun PCI cards, firmware, and drivers that are certified for use in Sun Fire 15K/12K (apart from HIPPI/P 1.1) are considered DR-safe.
For third-party adapters, the vendor is responsible for verifying that the adapter is suspend-safe. Before performing DR operations in a production domain, install and test the adapter in a test domain.
Alternate Paths to Storage and Network
Before attempting to detach an hsPCI board, verify for each I/O adapter being detached that an alternate path to system critical resources on another hsPCI board exists. If one of the physical paths is disabled, I/O access continues on the remaining physical path with little interruption. Alternate paths to storage and networks can be provided by multi-pathing software. The Solaris OE provides two forms of multi-pathed I/O:
For network paths: Internet Protocol Multi-pathing (IPMP)
For storage paths: Multiplexed I/O (Sun StorEdge Traffic Manager Software)
The multi-pathing software must be DR-safe. When using Sun StorEdge Traffic Manager Software, see "Detach of CPU/Memory Board with Kernel Memory" on page 35, Step 5, for detailed information.
Storage and Network Bandwidth Resource Removal
Detaching I/O boards or single PCI adapters can result in loss of bandwidth. Before detaching I/O components, verify that the bandwidth provided by a single path to storage and networks is sufficient for the running applications. This bandwidth is particularly important for I/O intensive applications.
Third Party Adapters and Driver Software
Sun does not qualify all third-party PCI adapters and driver software with DR. The vendor is responsible for verifying that the adapter, driver stack, and adapter firmware are suspend-safe. For more information about suspend-safe and suspend-unsafe devices, see Step 7 of "Detach of CPU/Memory Board with Kernel Memory" on page 35.
A conservative approach is to add third-party device drivers to the unsafe driver list when it is not known whether the device driver is suspend-safe. Before detaching the CPU/Memory board with kernel memory, the administrator can manually suspend I/O to the device and unload the driver.
Clustered Domains
Although DR is supported in Sun Java System Cluster software configurations, a few restrictions apply when detaching hsPCI boards or I/O cards. In a Sun Java System Cluster software configuration, the following operations are not supported:
DR detach of quorum device DR detach operations cannot be performed on a device that is currently configured as a quorum device.
DR detach on active devices on primary node DR detach operations on active devices in the primary node are not allowed. DR operations can be performed on non-active devices in the primary node and on any devices in the secondary node.
DR detach on active private interconnect interface DR detach operations cannot be performed on active private interconnect interfaces. A workaround is to disable and remove the interface from the active interconnect.
Refer to the Sun Cluster 3.0 System Administration Guide for details on DR in clustered Sun Fire 15K/12K domains.
Similar restrictions apply for third-party cluster products. For Veritas Cluster software, the VCS must be stopped when detaching the CPU/Memory board with the kernel memory. Refer to the Veritas Cluster Server Application Note Sun Fire 12K/15K Dynamic Reconfiguration for planning DR operations in VCS clustered domains.
Swap Space
Verify that there is enough swap space available if an I/O board is detached. When disk swap space is detached, there must be enough disk-swap space remaining in the domain to accommodate all currently running programs.
MaxCPU Board
Before detaching MaxCPU board from a domain, check whether any of the following restrictions on the CPUs exist.
Bound Processes
CPUs with bound processes cannot be detached. DR does not automatically rebind processes to other CPUs. If there are any bound processes to the CPU/Memory board being detached, the administrator must rebind the processes manually by using the pbind or psrset command.
CPU Resource Removal
Detaching a MaxCPU board reduces the domain in CPU processing power. This might have an impact on domain applications. As a general rule, do not attempt to detach a MaxCPU board if the CPU utilization is above 95 percent.
To lessen the impact of hardware removal on DR detach operations, a hot spare CPU/Memory board or MaxCPU board can be attached to the domain prior to the detach operation. The new board provides additional CPU processing power so that the performance impact of a DR detach operation can be made transparent.
DR and Applications
This section presents the restrictions of DR in clustered domains, with domains running database applications, and the interplay of DR and multi-pathing software.
Dynamic Reconfiguration in Clustered Domains
Although DR is supported in a clustered environment, some restrictions apply to perform a DR detach of a board with kernel memory in a clustered domain. In a Sun Java System Cluster software configuration, the following operations are not supported:
DR detach of a board containing permanent memory Sun Java System Cluster software does not support detach operations on permanent memory, and the system prevents such operations.
DR detach of a quorum device DR detach operations cannot be performed on a device currently configured as a quorum device.
DR detach on active devices on a primary node DR detach operations on active devices in the primary node are not allowed. DR operations can be performed on non-active devices in the primary node and on any devices in the secondary node.
DR detach on active private interconnect interface.
The following code sample shows the result of attempting a DR detach operation of the CPU/Memory board containing the kernel memory.
CODE EXAMPLE 9 DR Detach Operation of a CPU/Memory Board with Kernel Memory
# cfgadm -av | grep permanent SB15::memory connected configured ok base address 0x1e000000000, 8388608 KBytes total, 1061136 KBytes permanent # cfgadm -c disconnect SB15 System may be temporarily suspended, proceed (yes/no)? yes cfgadm: Library error: Resource Information -------- ----------- SUNW_OS Sun Cluster
To successfully perform the DR detach operation, the domain must first be configured out of the cluster as follows.
# /usr/cluster/bin/scstat scstat: not a cluster member.
Afterwards, the board can be disconnected, as shown in the following code sample.
CODE EXAMPLE 10 Disconnecting a CPU/Memory Board
# cfgadm -c disconnect SB15 System may be temporarily suspended, proceed (yes/no)? yes Oct 22 10:16:47 15k-dom dr: OS unconfigure dr@0:SB15::cpu0 Oct 22 10:16:47 15k-dom dr: OS unconfigure dr@0:SB15::cpu1 Oct 22 10:16:48 15k-dom dr: OS unconfigure dr@0:SB15::cpu2 Oct 22 10:16:48 15k-dom dr: OS unconfigure dr@0:SB15::cpu3 Oct 22 10:16:48 15k-dom dr: OS unconfigure dr@0:SB15::memory DR: suspending user threads... DR: checking devices... DR: suspending kernel daemons... DR: suspending drivers... suspending ssd@g60020f20000097a23cd9a9c9000cacf0 suspending ssd@g60020f20000097a23cd9a987000536a1 ... resuming ssd@g60020f20000097a23cd9a987000536a1 resuming ssd@g60020f20000097a23cd9a9c9000cacf0 DR: resuming user threads... DR: resume COMPLETED
Refer to the Sun Cluster 3.0 System Administration Guide for details about DR in clustered Sun Fire 15K/12K domains.
Similar restrictions apply for third-party cluster products. Note the restrictions for clustered domains are in addition to all other restrictions for non-clustered domains. For example, hardware components must be redundantly configured before a DR detach operation can be performed.
For Veritas Cluster software, the Veritas Cluster Server (VCS) must be shut down and the Global Atomic Broadcast (GAB) and Low Latency Transport (LLT) must be unconfigured before detaching the CPU/Memory board with the kernel memory. Similarly, before detaching a network interface card used for the VCS private heartbeat, the VCS must be stopped. Veritas advises to persistently freeze the service groups running in a domain and to stop VCS before a DR operation is performed.
Refer to the Veritas Cluster Server Application Note Sun Fire 12K/15K Dynamic Reconfiguration for planning DR operations in VCS clustered domains.
Dynamic Reconfiguration and Database Applications
Databases make extensive use of shared memory to cache frequently-used data (the buffer cache) and to facilitate inter-process communication. The Solaris OE provides shared memory capability in the form of ISM. The size of ISM is fixed and cannot be changed dynamically.
Performing DR detach operations on domains running database applications like ORACLE impose restrictions on the physical memory configuration. At a minimum, the physical memory provided by the CPU/Memory boards in the domain must be large enough to contain the kernel, shared memory segments such as the system global area (SGA) for ORACLE databases and database instances.
Database applications such as ORACLE use intimate shared memory that is non-pageable memory. Thus, on DR detach operations, the shared memory segments must be relocated. If there is not sufficient physical memory available for relocating the shared memory segments, the DR detach operation fails. Because the size of the ISM segments cannot be changed dynamically, large ISM segments impose a barrier to successful DR detach operations.
Changing the size of the ISM segments would require a shutdown of the database instance, which would directly impact application availability.
Additionally, ISM segments are automatically locked by the kernel when the segments are created. Hence, pages involved in I/O operations are not accessible to DR operations until the I/O operation completes. This process lengthens the DR operation on a heavily loaded domain.
The location of ISM segments can be identified by running the ipcs command. Note that the cfgadm command does not report the location of shared memory segments.
To lessen the impact of ISM pages on DR detach operations, support for Dynamic Intimate Shared Memory (DISM) was introduced in Solaris 8 OE (01/01) and Solaris 9 OE. The DISM allows active shared memory segments to be resized dynamically. A large DISM segment can be created when the database is booted up, and sections of the DISM segment can be locked or unlocked as memory requirements change.
DISM memory is locked by the application rather than by the kernel, as is the case with ISM. Because the application performs the locking, it can lock additional memory when the shared memory segments need to grow and unlock memory when the shared memory segments need to shrink in size.
Reducing the amount of ISM pages increases the probability of a successful and fast DR detach operation.
In addition to the previously mentioned minimum revisions of the Solaris OE, the database applications must also support DISM. The first major application to support DISM was Oracle 9i, where DISM is used for the dynamic SGA capability.
For more information on the interplay of DISM and DR, refer to the Sun BluePrints OnLine article titled "Dynamic Reconfiguration and Oracle 9i Dynamically Resizable SGA" and the Sun BluePrints book titled "Configuring and Tuning Databases on the Solaris Platform."
Dynamic Reconfiguration and Multi-Pathing Software
Before performing DR detach operations on I/O boards, verify that a redundant path to the network and storage resources on another I/O board exists. In the same way, detaching a single PCI card requires a redundant path via an alternative PCI card. Although the alternate PCI card need not be located on a different I/O board, it is best practice to configure alternate resources on different boards to maximize redundancy.