Using ORACLE Database Configurations
This section addresses Solaris OE users and groups for accessing ORACLE software and using ORACLE database configurations. You can choose Real Application Clusters or HA ORACLE data services.
Solaris OE Users and Groups
The Solaris OE user oracle owns the ORACLE software and database files, whether on raw devices or UNIX™ file system (UFS). Although the database components reside on the Sun StorEdge™ T3 arrays, the ORACLE software for each node always resides on that node's local storage, thereby providing redundancy.
The oracle user belongs to a primary group of oinstall and a secondary group named dba. Members of the oinstall group are responsible for maintaining the software on each cluster node and upgrading it as necessary. Members of the group dba create databases and maintain them.
If needed, an additional Solaris OE user can be defined with membership in group dba and not in group oinstall, thus giving the user DBA authority but withholding the ability to modify the software.
Real Application Clusters (RAC) Service
Real Application Clusters (RAC) allows two or more cluster nodes to perform transactions against a single database simultaneously. We use the term "active-active" or "scalable" to refer to this type of architecture. Multiple nodes synchronize their accesses to database objects.
Each node starts up an ORACLE database instancecomprised of the necessary background processessuch as the system monitor, process monitor, log writer, and database writer (SMON, PMON, LGWR, and DBWR, respectively). Furthermore, each node maintains its own System Global Area (SGA) in memory, including the Database Buffer Cache, the Redo Log Buffer, and the Shared Pool.
Distributed Lock Manager (DLM) processes run on each instance, synchronizing data block accesses, thus creating a memory-resident repository of lock objects equally distributed among all instances. Each instance is "mastering" a subset of the distributed resource locks. Background processes that support the DLM include the Global Enqueue Service Monitor (LMON) and the Global Enqueue Service Daemon (LMD). See the following figure.
FIGURE 2 Real Application Clusters (RAC) Background Processes and Memory Structures
On shared storage, the database instance of each node is assigned its own redo log files and rollback segments. Redo log files are the way a database recovers to a consistent state, following a system crash. These files record changes made to blocks of any object, including tables, indices, and rollback segments. These files provide a way to guarantee that all committed transactions are preserved in the event of a crash, even if the resulting data block changes are not written to data files. Rollback segments store database "undo" information. For example, they store the information needed to cancel or "roll back" a transaction, if the application needs to do so. Also, rollback segments provide a form of SQL statement isolation. A long-running query against a set of tables must only see their contents as they were at the time the query began.
As with any RAC installation, CDP 280/3 implements shared storage on top of raw volumes, in this case built using VxVM software. Using raw volumes allows each ORACLE instance to access the other instance's redo log files and rollback segments, particularly in the event of a node failure.
Recovering a Database Instance
When a node leaves the cluster, the resources it was mastering need to be remastered on the surviving node. With the improved hashing algorithm introduced in Oracle9i, locks already mastered on the surviving instance are not affected. At the same time remastering is occurring, the SMON process on the surviving node performs instance recovery. All transactions that were performed on the failed instance are recorded in its redo log files, but only those transactions committed prior to the newest checkpoint are guaranteed to be written out to data files.
Because the redo log files for both instances reside on raw devices, the instance performing recovery can access the failed instance's redo log, either committing to the data files those transactions that had committed after the final checkpoint (also known as "rolling forward"), or rolling back those that had not. If it rolls back uncommitted transactions, it does so by reading "before images" found in the failed instance's rollback segments.
SMON also frees up any resources that pending transactions may have acquired. During a roll forward period, the database is only partially available; a surviving instance can only access data blocks it currently has cached. It can not perform any I/O to the database, nor can it ask for any additional resource locks during this period. Rolling back uncommitted transactions can occur in parallel with the creation of new work.
Maintaining Availability
Using ORACLE's Transparent Application Failover (TAF), client connections performing read-only queries can continue on the surviving node, unaware that the original instance has failed. Users might experience longer response times, because the query must be restarted from the beginning against a "cold" database buffer cache on the surviving instance, which cannot commence until it has recovered the failed instance.
Applications performing data modifications must be specially coded to handle a status code returned from the ORACLE Call Interface (OCI), indicating that an instance has failed, recognizing such a return code has occurred, and reconnecting to the surviving instance for further processing of the statement.
In the unlikely event that all instances fail, the first instance restarted after the failure performs instance recovery on the redo log files of all failed instances, including its own, if necessary. This action is known as "database crash recovery."
When the nodes in a CDP 280/3 cluster are first installed via JumpStart software, a general-purpose RAC database is initialized:
First, the node that VxVM software determines is "mastering" the shared storage creates a series of mirrored, raw volumes, each with Dirty Region Logging™ (DRL) enabled.
Next it copies each component file of the database from local, file-system based storage out to its corresponding raw volume.
Using the host name chosen during the installation question-and-answer session, it configures the Oracle listener file, listener.ora, and initiates the listener process daemon.
It defines two service names in the tnsnames.ora file for the newly installed database, orcl1, for a direct connection into the instance of the first node, and orcl, which provides a load balancing scheme.
Client connections that use a service name of orcl connect to the RAC instance with the lightest load observed at the time the connections are requested.
Finally, it brings up the SGA and background processes that form the orcl1 instance on the first node.
The remaining cluster node may be thought of as the "slave," with regards to VxVM software. During its initial JumpStart software process, it must wait for the master node to complete the creation and initialization of the raw volumes that form the orcl database on shared storage.
Once per minute (for a maximum of 15 minutes), it "wakes up" to see if the volumes are ready. If they are, it begins its own listener process and instance, in this case named orcl2. No type of recovery work is necessary. The orcl2 instance begins to record the modification of database blocks to its redo log files and the prior state of those blocks to its rollback segments.
Of course, this synchronization of nodes against the creation of raw volumes only occurs the first time each node is installed in the cluster. From that point forward, if a node needs to be rebooted, it rejoins the cluster and automatically starts up its Oracle instance and listener daemon.
The orcl starter database is based upon a "general purpose" database configuration that ORACLE supplies its customers via the ORACLE Universal Installer™ (OUI). This installer implements a database with an eight-Kbytes block size, useful for either On-line Transaction Processing (OLTP) or Data Warehouse applications. The data center DBA is free to modify initialization parameters, except for block size, to tune the instances on each node. By tuning the instances, the DBA can more effectively support a particular OLTP or warehouse environment, for example, to resize SGA components3 or to alter the behavior of background processes. At this point the database appears to the DBA no different than one that the DBA might have created manually. In fact, the DBA can create additional raw volumes on the storage array and use the ORACLE Database Creation Assistant to build other RAC databases, resulting in one or more instances running on each cluster node, each instance serving a particular underlying database.
HA ORACLE Data Service
HA ORACLE data service allows only one cluster node to host transactions against a database at a time. Using JumpStart software, cluster nodes are started as a "failover resource group," to borrow from Sun Cluster 3.0 software terminology. This approach results in a highly available, ORACLE data service.
ORACLE registers its database and listener services to the cluster via Sun Cluster Resource Group Manager (RGM), which chooses one of the nodes to host them. A storage resource (representing the Sun StorEdge™ T3 arrays) and a logical host name and IP address are registered with RGM. The node chosen to host the ORACLE resources hosts the storage and logical host name too. Client sessions connect to the database service using the logical host name, which is distinct from the physical host name assigned to each node.
When it is necessary to switch ORACLE data services to another node in the cluster, the storage resource changes hosts. The logical IP address "floats" to the other node so that any packets routed to that address are then handled by the subsequent node. Note that this data services switch can occur either by request or by reason of a failed node. In the first situation (perhaps a maintenance window is required on the node currently hosting the services), ORACLE performs a graceful database shut down on the first node and simply restarts on the second node; HA ORACLE data service is unaware that it is now running on a different host.
In the case of node failure, the Sun Cluster software's "heartbeat" on the surviving node determines that the departing node has left the cluster and proceeds to rehost the logical host name, IP address, and storage. ORACLE's SMON process starts up and performs crash recovery in "roll forward" and "roll back" phases. Unlike the case of RAC, the DLM remastering does not occur, because DLM processes are not needed in a configuration such as HA ORACLE data service, where only one instance is alive at any given time. Once again, HA ORACLE data service is not aware that it is restarting on a different host, just that an instance crash occurred. In either case, client connections disconnect and reconnect to the surviving node, after it has completed failover processing.
Because HA ORACLE data service only uses one active instance, only one ORACLE license needs to be purchased. Of course, availability suffers, because any type of cluster switch involves a finite amount of downtime, in addition to the need for clients to reconnect.
It is interesting to note that the term "shared storage" is actually a misnomer in the HA ORACLE architecture, because only one node is accessing a given logical unit at any point. Raw devices are not necessary in this configuration, because there is no concept of a second instance accessing the redo logs of the first. Database files reside on a UFS on the Sun StorEdge™ T3 arrays and are mounted under /global, which is visible by either node. Per installer request, this file system can be constructed on top of either Solaris Volume Manager software or VxVM software and is mounted using the direct I/O feature of Solaris OE. The Solaris OE UFS with direct I/O demonstrates nearly the same performance as raw devices, but with the ease of management of a file system. Solaris OE UFS logging is enabled by default to improve file system recovery.
Similar to the RAC configuration, a starter database is furnished; however, in the case of HA ORACLE data service, you choose between an OLTP and a Decision Support System (DSS) version, based upon the configurations ORACLE provided within OUI. Each version features an eight-Kbytes block size; however, the initialization parameter SORT_AREA_SIZE, which limits the amount of memory used internally by ORACLE for sorting result sets, is twice as large in the DSS version as it is in OLTP. Also similar to the RAC installation, the DBA can extend and further tune the starter database as needed.