DR Architecture
This section describes the DR architecture of the Sun Fire 15K/12K servers.
Dynamically Reconfigurable Components
Hardware components that can be dynamically reconfigured are called domain configurable units (DCUs). The following DCUs can be dynamically reconfigured:
System board types:
CPU/Memory board
hsPCI assembly (I/O board)
MaxCPU board
System board components:
An individual CPU
All of the memory on a CPU/Memory board
Any I/O device, controller, or bus
Note that several hardware components are not DCUs, for example, expander boards and centerplane support boards. System controllers (SCs), power supplies, fan trays, and centerplane support boards are fully redundant and hot-swappable. Expander boards are not fully redundant, but can be hot-swapped once the system board or I/O board installed in the expander have been replaced under the control of DR.
DR Concepts
This section introduces the DR framework and provides definitions of DR concepts.
Attachment Points
Because DR can operate on a variety of devices on various Sun platforms, a model is used to represent logical constructs within the system where DR can operate to simplify consistent implementation across various Sun platforms. This logical construct is referred to as the attachment point. It can be thought of as the place where hardware resources can be added and removed during operation of the Solaris OE. Physical insertion or removal of hardware resources occurs at attachment points and results in a receptacle gaining or losing an occupant.
Physically and collectively, an attachment point refers to a board slot (the receptacle), a board installed in the slot, and any devices connected to the board (the occupant).
Without exception, configuration administration operations, attach or detach, are performed on attachment points in Sun Fire 15K/12K domains.
As far as domain visibility is concerned, the attachment point must have been assigned to the domain; for example, the attachment point must be in the domain available component list (ACL) and not assigned to another domain.
The state of the attachment point refers to the condition of the receptacle and occupant. You can use the cfgadm command or the Sun Management Center software DR module to retrieve the status of attachment points, occupants, and receptacles along with their condition.
The following example shows a detailed listing of all logical attachment points for a sample domain.
CODE EXAMPLE 1 Logical Attachment Points for a Domain
# cfgadm -al Ap_Id Type Receptacle Occupant Condition IO1 HPCI connected configured ok IO1::pci0 io connected configured ok IO1::pci1 io connected configured ok IO1::pci2 io connected configured ok IO1::pci3 io connected configured ok IO2 HPCI connected configured ok IO2::pci0 io connected configured ok IO2::pci1 io connected configured ok IO2::pci2 io connected configured ok IO2::pci3 io connected configured ok SB1 CPU connected configured ok SB1::cpu0 cpu connected configured ok SB1::cpu1 cpu connected configured ok SB1::cpu2 cpu connected configured ok SB1::cpu3 cpu connected configured ok SB1::memory memory connected configured ok SB2 CPU connected configured ok SB2::cpu0 cpu connected configured ok SB2::cpu1 cpu connected configured ok SB2::cpu2 cpu connected configured ok SB2::cpu3 cpu connected configured ok SB2::memory memory connected configured ok c0 scsi-bus connected configured unknown c0::dsk/c0t0d0 disk connected configured unknown c0::dsk/c0t1d0 disk connected configured unknown c1 scsi-bus connected unconfigured unknown pcisch0:e01b1slot1 pci-pci/hp connected configured ok
Receptacle: Slot
A Sun Fire 15K server has 18 full-bandwidth slot 0 and 18 half-bandwidth slot 1 slots. A Sun Fire 12K server has 9 full-bandwidth slot 0 and 9 half-bandwidth slot 1 slots.
At any time, the state of a receptacle can be as follows:
Not Visible cfgadm is not aware of the receptacle
Empty visible but unoccupied
Disconnected occupied and not part of the physical domain
Connected occupied and part of the physical domain
FIGURE 1 shows the state of a receptacle.
FIGURE 1 Receptacle State
Occupant: Board
An occupant corresponds to a board. Occupants are CPU/Memory boards (SBx), hsPCI I/O boards (IOx), and MaxCPU boards (IOx), where x represents the number of a particular board (0 through 17 for a Sun Fire 15K server and 0 through 8 for a Sun Fire 12K server). At any one time, the state of an occupant can be as follows:
Unconfigured all components on the board are unconfigured
Configured at least one component on the board is configured
FIGURE 2 shows the state of an occupant.
FIGURE 2 Occupant State
Dynamic Attachment Points
A dynamic attachment point is named relative to a base attachment point present in the system. The dynamic component is hardware-specific and generated by the corresponding hardware-specific library. Dynamic attachment points are essentially board components, for example, CPUs, memory, etc.
A CPU/Memory board provides the following five dynamic attachment points:
SB<board_id>::cpu0
SB<board_id>::cpu1
SB<board_id>::cpu2
SB<board_id>::cpu3
SB<board_id>::memory
where board_id represents the number of a board (0 through 17 for a Sun Fire 15K server and 0 through 8 for a Sun Fire 12K server).
An hsPCI assembly has four dynamic attachment points for the PCI busses:
IO<board_id>::pci0
IO<board_id>::pci1
IO<board_id>::pci2
IO<board_id>::pci3
where the board_id represents the number of a board (0 through 17 for a Sun Fire 15K server and 0 through 8 for a Sun Fire 12K server).
A MaxCPU board provides the following dynamic attachment points:
IO<board_id>::cpu0
IO<board_id>::cpu1
where <board_id> represents the number of a board (0 through 17 for a Sun Fire 15K server and 0 through 8 for a Sun Fire 12K server).
The receptacle state of a dynamic attachment point must always be connected. For example:
# cfgadm -v -c disconnect IO16::pci3 cfgadm: Library error: command not supported: disconnect pci
The receptacle state of a dynamic attachment point can never be in the disconnected state; however, its occupant state can be either unconfigured or configured.
Attach Operations
On DR attach operations, a new component (for example, a CPU/Memory board, individual CPU, memory, hsPCI board, individual PCI card, or MaxCPU board) is connected and configured into the domain while the domain is running. At the end of the DR attach operation, the new component can be used by the Solaris OE without any difficulty.
The flow of a Sun Fire 15K/12K DR attach operation is described in the following sections.
Assign
By assigning a board to a domain, the board becomes part of the logical domain. This operation is rejected if the board is not in the domain's ACL. After a slot has been assigned to a domain, it becomes visible to that domain and unavailable and invisible to any other domain.
Connect
After the slot has been assigned, DR requests the SC to power on and test the board. After the board has been tested, DR requests the SC to connect the board electronically to the system bus, which makes the board part of the physical domain. The board is then probed by Embedded Fcode Interpreter (efcode). Solaris OE device nodes are created at this time, but they are still offline.
Configure
The Solaris OE assigns functional roles to the boards. It brings the CPUs and I/O devices online, and initializes and adds memory to the system memory pool. At the end of the configure function, the CPUs and memory are ready for use. I/O devices need to be configured using the mount and ifconfig commands before they can be used.
CODE EXAMPLE 2 Using cfgadm to Add a CPU/Memory Board to a Domain
# cfgadm -v -c configure SB1 assign SB1 assign SB1 done poweron SB1 poweron SB1 done test SB1 test SB1 done connect SB1 connect SB1 done configure SB1 Sep 5 02:11:33 15k-dom dr: OS configure dr@0:SB1::cpu0 Sep 5 02:11:33 15k-dom dr: OS configure dr@0:SB1::cpu1 Sep 5 02:11:33 15k-dom dr: OS configure dr@0:SB1::cpu2 Sep 5 02:11:33 15k-dom dr: OS configure dr@0:SB1::cpu3 Sep 5 02:11:33 15k-dom dr: OS configure dr@0:SB1::memory configure SB1 done notify online SUNW_cpu/cpu32 notify online SUNW_cpu/cpu33 notify online SUNW_cpu/cpu34 notify online SUNW_cpu/cpu35 notify add capacity (4 cpus) notify add capacity (1048576 pages) notify add capacity SB1 done
Detach Operations
On DR detach operations, an existing component (for example, CPU/Memory board, individual CPU, memory, hsPCI board, individual PCI card, or MaxCPU board) is configured out of the domain then disconnected while the domain is running. At the end of the detach operation, the component can be safely removed from the domain.
Unassign
The unassign operation removes the board from the logical domain and makes the board available for use by other domains. This step is not performed by default. The cfgadm option -o unassign must be supplied.
Disconnect
In the disconnect operation, the board is deprobed using efcode. The DR framework communicates with the SC to program the interconnect so that the board is removed from the physical domain. Power to the slot is turned off. This step is always performed, unless the cfgadm option -o nopoweroff is supplied. A board can be in the disconnected state without being powered off. However, to remove a board from the platform, the board must be powered off and in the disconnected state.
CODE EXAMPLE 3 Using cfgadm to Disconnect a CPU/Memory Board
# cfgadm -v -c disconnect SB1 request delete capacity (4 cpus) request delete capacity (1048576 pages) request delete capacity SB1 done request offline SUNW_cpu/cpu32 request offline SUNW_cpu/cpu33 request offline SUNW_cpu/cpu34 request offline SUNW_cpu/cpu35 request offline SUNW_cpu/cpu32 done request offline SUNW_cpu/cpu33 done request offline SUNW_cpu/cpu34 done request offline SUNW_cpu/cpu35 done unconfigure SB1 Sep 5 02:12:41 15k-dom dr: OS unconfigure dr@0:SB1::cpu0 Sep 5 02:12:42 15k-dom dr: OS unconfigure dr@0:SB1::cpu1 Sep 5 02:12:42 15k-dom dr: OS unconfigure dr@0:SB1::cpu2 Sep 5 02:12:42 15k-dom dr: OS unconfigure dr@0:SB1::cpu3 Sep 5 02:12:43 15k-dom dr: OS unconfigure dr@0:SB1::memory unconfigure SB1 done notify remove SUNW_cpu/cpu32 notify remove SUNW_cpu/cpu33 notify remove SUNW_cpu/cpu34 notify remove SUNW_cpu/cpu35 notify remove SUNW_cpu/cpu32 done notify remove SUNW_cpu/cpu33 done notify remove SUNW_cpu/cpu34 done notify remove SUNW_cpu/cpu35 done disconnect SB1 disconnect SB1 done poweroff SB1 poweroff SB1 done unassign SB1 skipped
Unconfigure
During the unconfigure operation, the Solaris OE takes the CPUs and I/O devices offline, and drains the memory. Environmental monitoring continues, but devices on the board are not available for use by the Solaris OE. The unconfigured board is left in a disconnected state.
If the CPU/Memory board to be detached contains the kernel memory, the operating system (OS) quiescence is performed and the kernel memory is relocated to another CPU/Memory board.
Memory
Before a CPU/Memory board can be detached from a domain, the OS must relocate and release all of the pages of memory that are being used. The process of relocating pages is referred to as a drain operation. Different actions are performed during a drain, depending on what these pages of memory are being used for. The following sections describe the three basic types of memory structures that must be considered prior to allowing the drain to proceed.
Permanent Memory
Some of the virtual memory used by the OS must be fixed in physical memory. The permanent memory contains the OpenBoot PROM software and the kernel that controls the internal system and configuration information, such as page tables and device drivers. Permanent memory, also called kernel memory, cannot be paged or swapped out.
Detaching a CPU/Memory board with kernel permanent memory requires additional tasks because the memory cannot be paged or swapped out to free the CPU/Memory board's memory banks. To detach a CPU/Memory board containing kernel memory, the OS must be temporarily suspended. This suspension phase is also called OS quiescence (see "Operating System Quiescence" on page 15). During OS quiescence, the kernel memory is relocated to a new target board to free the memory of the board to be detached. Prior to the relocation operation, the target memory is cleared. The relocation process is also called a copy rename operation. After the relocation, all threads and device drivers are resumed, and normal system operation continues.
Non-Pageable, Non-Permanent Memory
Non-pageable memory segments not assigned to the kernel (that is, non-pageable, non-permanent memory) can be relocated without OS quiescence. An example of non-pageable, non-permanent memory is Intimate Shared Memory (ISM). The ISM is a Solaris kernel feature that improves the performance of programs that use shared memory.
Pageable Memory
Pageable memory can be relocated to free space on another CPU/Memory board or flushed out to disk. When a CPU/Memory board is to be detached, the pageable memory of the board to be detached can be flushed out to swap space. There are no further steps necessary for freeing the memory resources of a CPU/Memory board, which does not contain kernel permanent memory.
Examples of pageable memory are regular user job pages or Dynamic Intimate Shared Memory (DISM) pages.
Operating System Quiescence
When unconfiguring a CPU/Memory board that has permanent memory, the OS is briefly paused. This pause phase is called operating system quiescence. The length of the quiescence depends on the running workloads and the domain I/O and memory configuration. During this quiescence, permanent memory is relocated from the CPU/Memory board being detached to a target CPU/Memory board. Before it can achieve quiescence, the OS must temporarily suspend all processes (both user and kernel threads), CPUs, and device activities.
After the copy rename operation, the CPUs are started and the device activities, kernel threads, and user threads are resumed.
The following code example shows the execution of an example DR detach operation of a CPU/Memory board containing kernel permanent memory.
CODE EXAMPLE 4 Detach of a CPU/Memory Board With Permanent Memory
# cfgadm -c disconnect SB1 System may be temporarily suspended, proceed (yes/no)? y Jan 30 13:14:10 15k-dom dr: OS unconfigure dr@0:SB1::cpu0 Jan 30 13:14:10 15k-dom dr: OS unconfigure dr@0:SB1::cpu1 Jan 30 13:14:11 15k-dom dr: OS unconfigure dr@0:SB1::cpu2 Jan 30 13:14:11 15k-dom dr: OS unconfigure dr@0:SB1::cpu3 Jan 30 13:14:12 15k-dom dr: OS unconfigure dr@0:SB1::memory DR: suspending user threads... DR: checking devices... DR: suspending kernel daemons... DR: suspending drivers... suspending memory-controller@20,400000 (aka mc-us3) suspending memory-controller@21,400000 (aka mc-us3) suspending memory-controller@22,400000 (aka mc-us3) suspending memory-controller@23,400000 (aka mc-us3) suspending pci108e,abba@0 (aka ce) suspending pci108e,abba@1 (aka ce) suspending pci1000,b@2 (aka glm) suspending sd@0,0 ... resuming sd@0,0 resuming pci1000,b@2 (aka glm) resuming pci108e,abba@1 (aka ce) resuming pci108e,abba@0 (aka ce) resuming memory-controller@23,400000 (aka mc-us3) resuming memory-controller@22,400000 (aka mc-us3) resuming memory-controller@21,400000 (aka mc-us3) resuming memory-controller@20,400000 (aka mc-us3) DR: resuming user threads... DR: resume COMPLETED #
Suspend and Resume
To support DR, all I/O device drivers must implement the DDI_ATTACH, DDI_DETACH, DDI_SUSPEND, and DDI_RESUME functions.
The DDI_ATTACH and DDI_DETACH functions provide the ability to attach or detach an instance of a driver without impacting other instances already servicing separate devices. With the DDI_DETACH function, a device driver can be instructed to attempt to pause any pending I/O streams and allow the driver to release associated devices. Device drivers that support the DDI_DETACH operation are called detach-safe.
The DDI_SUSPEND and DDI_RESUME functions allow I/O to be interrupted so that the OS can be suspended and the copy rename process can progress.
Although all Sun Microsystems device drivers (apart from HIPPI/P 1.1) support these functions, other drivers might not. Before using DR with third-party I/O cards and software stacks, ensure that the card supports the DDI functions.
Suspend-Safe and Suspend-Unsafe Devices
A device is called suspend-safe if it satisfies both of the following conditions:
The device does not access memory during OS quiescence.
The device does not interrupt the system during OS quiescence.
NOTE
A driver is called DR-safe if it supports OS quiescence, that is, if it can be suspended and subsequently resumed.
A device is called suspend-unsafe if either of the following conditions apply:
The device allows memory access during OS quiescence.
The device allows system interrupts during OS quiescence.
DR uses an unsafe driver list in the dr.conf file to prevent unsafe devices from accessing memory or interrupting the system during OS quiescence. The dr.conf file is located in the /platform/SUNW,Sun-Fire-15000/kernel/drv/ directory. The unsafe driver list is a property in the dr.conf file with the following format:
unsupported-io-drivers="driver1","driver2","driver3";
DR reads the unsafe driver list when it prepares to quiesce the OS. If DR finds an active driver in the unsafe driver list, it aborts the DR operation and returns an error message identifying the active and unsafe driver. The administrator must then take manual steps to stop pending I/O transactions and release the associated devices before retrying the DR operation.
Sun does not qualify all third-party PCI adapters and driver software with DR. The third-party vendor is responsible for verifying that the adapter, driver stack, and adapter firmware are suspend-safe. A conservative approach is to add third-party device drivers to the unsafe driver list when it is not known whether the device driver is suspend-safe.
Golden IOSRAM
There is an Input/Output Static Random Access Memory (IOSRAM) device on each I/O board, which is used for SC-to-domain communication. In a multiple I/O board domain, only one of these devices is in use at any given time, which is referred to as the active or golden IOSRAM. A DR detach operation of an I/O board containing the golden IOSRAM therefore requires the ability to relocate it to another I/O board.
Hot-Plug and Hot-Swap Modules and Boards
Domain configurable units (DCUs) are hot-pluggable. Hot-plug boards and modules have special connectors that supply electrical power to the board or module before the data pins make contact. Therefore, hot-plug boards and modules can be inserted or removed while the power is still applied. When performing DR detach operations on hot-pluggable components, verify that the Removal OK LED is illuminated before the component is removed from the system.
Hot-pluggable boards and modules are called hot-swappable if the components can be inserted and removed (that is, hot-plug) and the system can be reconfigured to use the new component while the system is running. Although hot-plugging provides the electrical basis for DR, the full potential is achieved with hot-swapping. Only with hot-swapping can new devices be provided to the OS while the system is running. All components on which DR operations can be performed are hot-swappable.
DR Architecture Diagram and Daemons
This section describes the software components that reside on the SC and the domain, making DR operations possible.
Various processes and daemons on the SC and on the domain work together to accomplish DR operations. The processes and/or daemons that are used depend entirely on the point of execution of the DR operation. For instance, if you execute the DR operation from the SC, the system uses several more processes and/or daemons to accomplish the DR operation than it would if you executed the DR operation from the domain.
The two main communication paths between the SC and a domain are as follows:
Internal secure socket connection between the domain configuration agent (DCA) and the domain configuration server (DCS)
SC tunnel: IOSRAM mailboxes
FIGURE 3 shows the software modules involved in the Sun Fire 15K/12K DR for the SC and the domain.
FIGURE 3 Software Modules Involved for the DR of SC and Domain
System Controller
The system controller (SC) manages all Sun Fire 15K/12K domains both administratively and with respect to their hardware configurations.
The SC communicates to its domains over secure internal Ethernet connections and through the SC tunnel. The SC tunnel is a set of data structures and mailboxes in the IOSRAM of the domain's primary I/O board, for example, the golden IOSRAM.
Domain Configuration Agent
The domain configuration agent (DCA) daemon process runs on the SC and initiates DR requests from applications on the SC, such as the remote version of the cfgadm command (rcfgadm) and GUIs to perform a DR operation on a Sun Fire 15K/12K domain. The DCA manages DR communication between software applications running on the SC and the DCS running on the domain.
When an application on the SC requests a DR operation, the DCA daemon process establishes a session with the DCS daemon running on the domain and forwards the request to the DCS. The DCS daemon process attempts to honor the request and sends the result of the DR operation back to the DCA.
The SMS startup daemon (SSD) starts the DCA when the domain is brought up. When the DCA daemon is killed while the domain is still running, the DCA daemon is restarted by the SSD daemon. The DCA daemon is terminated when the domain is shut down.
An individual instance of the DCA runs on the SC for each configured domain in the platform.
Domain Configuration Server
The domain configuration server (DCS) is a daemon process that runs on the domain. A single instance of the DCS daemon process runs on each configured domain in the platform. The DCS daemon allows applications on the SC (for example, rcfgadm command) to control DR operations on the domain by way of the DCA daemon.
The DCS daemon process communicates with the DCA daemon running on the SC over a secure connection, IPSec. The DCS daemon accepts DR requests from the DCA daemon. After the DCS daemon accepts a DR operation, the DCS daemon interfaces with libcfgadm to translate the DCA request into cfgadm command operations. After performing the DR request, the DCS daemon returns the results to the DCA daemon.
The DCS process is started by inetd when the first DR request is received. The DCS daemon listens on the network service labeled sun-dr. The /etc/inetd.conf file contains the following entry for the DCS process.
sun-dr stream tcp wait root /usr/lib/dcs dcs sun-dr stream tcp6 wait root /usr/lib/dcs dcs
libcfgadm Library and Plug-In
The libcfgadm framework consists of the basic libcfgadm interface and accompanying plug-ins. Platforms that support DR can share the user interface, because the user interface relies on the libcfgadm framework for implementation. The libcfgadm plug-ins customize the framework to each platform.
The libcfgadm library resides in the OS. It exports the cfgadm command interface used by DR applications and provides mechanisms for plug-in modules to register with that framework. It translates DR requests through the DCS daemon as cfgadm command operations and forwards the requests to the dr driver and the RCM framework.
Platform Configuration Daemon
The platform configuration daemon (PCD) runs on the SC. The PCD's main responsibility is to manage and provide controlled access to the platform and domain configuration databases. The PCD maintains three sets of configuration data: platform configuration data, domain configuration data, and system board configuration data.
All changes to the configuration of the Sun Fire 15K/12K system must go through the PCD. For DR, the PCD and the domain server (DXS) exchange notifications of DR events that affect board availability changes. When pertinent platform configuration changes occur within the system, the PCD daemon sends out notification of the changes to clients who have registered to receive the notification.
Domain Server
The domain server (DXS) process interacts with the PCD and the IOSRAM mailboxes to communicate domain configuration changes between the SC and the drmach driver on the domain. An individual instance of the DXS runs on the SC for each domain on the Sun Fire 15K/12K system.
dr Driver
The dr driver is the platform-independent part of the system board DR driver, which is common across Sun Enterprise 10000 and Sun Fire 15K/12K platforms. The dr driver interfaces with libcfgadm to perform domain configuration events as instructed by the cfgadm command and the DCS daemon. The dr driver notifies the DXS running on the SC of DR operations through the IOSRAM mailboxes to validate the PCD.
drmach Driver
The drmach driver is the platform-specific part of the system board dr driver. It translates cfgadm board configuration state changes specific to the Sun Fire 15K/12K hardware as DR event requests. The requests are sent to the SC to update the PCD via the IOSRAM mailbox and the DXS.
Reconfiguration Coordination Manager
The reconfiguration coordination manager (RCM) is a daemon process that coordinates DR operations on resources present in the domain. The RCM daemon uses generic application program interfaces (APIs) to coordinate DR operations between DR initiators and RCM clients. Normally, the DR initiator is the cfgadm command. However, it can also be a GUI such as the Sun Management Center software. The DR clients can be any of the following:
Software layers that export high-level resources comprised of one or more hardware devices (for example, multi-pathing applications like Sun StorEdge Traffic Manager Software and IPMP)
Applications that monitor DR operations (for example, Sun Management Center software)
Entities on a remote system, such as the SC on a server
Flowchart of DR Attach Operations
If a board is newly inserted into a slot, the following functions must be performed to make the resources of the board ready to use for the OS:
Assign (availability change function)
Power-On (condition change function)
Test (condition change function)
Connect (state change function)
Configure (state change function)
Depending on the state and condition of the board at the start of the configure function, one or several of the previous functions might be skipped.
The states and conditions of the attachment point during a DR attach operation are listed in the following table.
TABLE 1 Status and Conditions for Attachment Points During DR Attach Operations
Phase |
Comments |
Receptacle State |
Occupant State |
Condition |
|
Power-on (after power-on, before board is inserted) |
Empty |
Unconfigured |
Unknown |
|
After board is physically inserted |
Disconnected |
Unconfigured |
Unknown |
Connect |
After attachment point is logically connected |
Connected |
Unconfigured |
OK |
Configure |
After configure |
Connected |
Configured |
OK |
The board is now configured into the domain and can be used by the OS. |
|
|
|
|
Flowchart of DR Detach Operations
During DR detach operations, the following functions are performed:
Unconfigure (state change function)
Disconnect (state change function)
Power-Off (condition change function)
Unassign (availability change function)
The states and conditions of the detachment point during a DR detach operation are listed in the following table.
TABLE 2 Status and Conditions for Detachment Points During DR Detach Operations
Phase |
Comments |
Receptacle State |
Occupant State |
Condition |
Configure |
In configured state |
Connected |
Configured |
OK |
Unconfigure |
After unconfigure |
Connected |
Unconfigure |
OK |
Disconnect |
Board is powered off (unless the cfgadm option -o nopoweroff was used) |
Disconnected |
Unconfigure |
Unknown |
|
The board is now disconnected and powered off (unless the cfgadm option -o nopoweroff was used) and can be safely removed from the slot. |
|
|
|