Data Synchronization
More than one copy of data is a data synchronization problem. This section describes the data synchronization issues.
Throughout this book, the concept of ownership of data is important. Ownership is a way to describe the authoritative owner of the single view of the data. Using a single, authoritative owner of data is useful for understanding the intricacies of modern clustered systems. In the event of failures, the ownership can migrate to another entity. "Synchronization" on page 99 describes how the Sun Cluster 3.0 architecture handles the complex synchronization problems and issues that the following sections describe.
Data Uniqueness
Data uniqueness poses a problem for computer system architectures or clusters that use duplication of data to enhance availability. The representation of the data to people requires uniqueness. Yet there are multiple copies of the data that are identical and represent a single view of the data, which must remain synchronized.
Complexity and Reliability
Since the first vacuum tube computers were built, the reliability of computing machinery has improved significantly. The increase in reliability resulted from technology improvements in the design and manufacturing of the devices themselves. But increases in individual component reliability also increase complexity. In general, the more complex the system is, the less reliable it is. Increasing complexity to satisfy the desire for new features causes a dilemma because it works against the desire for reliability (perfection of existing systems).
As you increase the number of components in the system, the reliability of the system tends to decrease. Another way to look at the problem of clusters is to realize that a fully redundant cluster has more than twice as many components as a single system. Thus, the cost of a clustered system is more than almost twice the cost of a single system. However, the reliability of a cluster system is less than half the reliability of a single system. Though this may seem discouraging, it is important to understand that the reliability of a system is not the same as the availability of the service provided by the system. The difference between reliability and availability is that the former only deals with one event, a failure, whereas the latter also takes recovery into account. The key is to build a system in which components fail at normal rates, but which recovers from these failures quickly.
An important technique for recovering from failures in data storage is data duplication.Data duplication occurs often in modern computer systems. The most obvious examples are backups, disk mirroring, and hierarchical storage management solutions. In general, data is duplicated to increase its availability. At the same time, duplication uses more components, thus reducing the overall system reliability. Also, duplication introduces synchronization fault opportunities. Fortunately, for most cases, the management of duplicate copies of data can be reliably implemented as processes. For example, the storage and management of backup tapes is well understood in modern data centers.
A special case of the use of duplicate data occurs in disk mirrors. Most disk mirroring software or hardware implements a policy in which writes are committed to both sides of the mirror before returning an acknowledgement of the write operation. Read operations only occur from one side of the mirror. This increases the efficiency of the system because twice as many read operations can occur for a given data set size. This duplication also introduces a synchronization failure mode, in which one side of the mirror might not actually contain the same data as the other side. This is not a problem for write operations because the data will be overwritten, but it is a serious problem for read operations.
Depending on the read policy, the side of the mirror that satisfies a given read operation may not be predictable. Two solutions are possibleperiodically check the synchronization and always check the synchronization. Using the former solution maintains the performance improvements of read operations while periodic synchronization occurs in the background, preferably during times of low utilization. The latter solution does not offer any performance benefit but ensures that all read operations are satisfied by synchronized data. This solution is more common in fault tolerant systems.
RAID 5 protection of data also represents a special case of duplication in which the copy is virtual. There is no direct, bit-for-bit copy of the original data. However, there is enough information to re-create the original data. This information is spread across the other data disks and a parity disk. The original data can be re-created by a mathematical manipulation of the other data and parity.
Synchronization Techniques
Modern computer systems use synchronization extensively. Fortunately, only a few synchronization techniques are used commonly. Thus, the topic is researched and written about extensively, and once you understand the techniques, you begin to understand how they function when components fail.
Microprocessor Cache Coherency
Microprocessors designed for multiprocessor computers must maintain a consistent view of the memory among themselves. Because these microprocessors often have caches, the synchronization is done through a cache-coherency protocol. The term coherence describes the values returned by a read operation to the same memory location. Consistency describes the congruity of a read operation returning a written value. Coherency and consistency are complementarycoherence defines the behavior of reads and writes to the same memory location and consistency defines the behavior of reads and writes with respect to accesses to other memory locations. In terms of failures, loss of either coherency or consistency is a major problem that can corrupt data and increase recovery time.
UltraSPARC™processors use two primary types of cache-coherency protocolssnooping and distributed directory-based coherency.
Snooping protocol is used by all multiprocessor SPARC implementations. No centralized state is kept. Every processor cache maintains metadata tags that describe the shared status of each cache line along with the data in the cache line
Distributed directory-based coherency protocol is used in the UltraSPARC III processor. The status of each cache line is kept in a directory that has a known location. This technique releases the restriction of the snooping protocol that requires all caches to see all address bus transactions. The distributed directory protocol scales to larger numbers of processors than the snooping protocol and allows large, multiprocessor UltraSPARC III systems to be built. The ORACLE 9i RAC database implements a distributed directory protocol for its cache synchronization. "Synchronization" on page 99 describes this protocol in more detail.
All of the caches share one or more common address buses. Each cache snoops the address bus to see which processors might need a copy of the data owned by the cache.
As demonstrated in the Sun Fire_ server, both protocols can be used concurrently. The Sun Fire server uses snooping protocol when there are four processors on board and uses directory-based coherency protocol between boards. Regardless of the cache coherency protocol, UltraSPARC processors have an atomic test-and-set operation, ldstub, which is used by the kernel. Atomic operations must be guaranteed to complete successfully or not at all. The test-and-set operation implements simple locks, including spin locks.
Kernel-Level Synchronization
The kernel is re-entrant [Vahalia96], which means that many threads can execute kernel code at the same time. The kernel uses a number of lock primitives that are built on the test-and-set operation [MM00]:
Mutual exclusion (mutex) locks provide exclusive access semantics. Mutex locks are one of the simplest locking primitives.
Reader/writer locks are used when multiple threads can read a memory location concurrently, but only one thread can write.
Kernel semaphores are based on Dijkstra's [Dijkstra65] implementation in which the semaphore is a positive integer that can be incremented or decremented by an atomic operation. If after a decrement the value is zero, the thread blocks until another thread increments the semaphore. Semaphores are used sparingly in the kernel.
Dispatcher locks allow synchronization that is protected from interrupts and is primarily used by the kernel dispatcher.
Higher-level synchronization facilities, such as condition variables (also called queuing locks), that are used to implement the traditional UNIX_ sleep/wake-up facility are built on these primitives.
Application Level Synchronization
The offers several application program interfaces (APIs) that you can use to build synchronization into multithreaded and multiprocessing programs.
The System Interface Guide [SunSIG99] introduces the API concept and describes the process control, scheduling control, file input and output, interprocess communication (IPC), memory management, and real-time interfaces. POSIX and System V IPC APIs are describe; these include message queues, semaphores, and shared memory. The System V IPC API is popular, being widely implemented on many operating systems. However, the System V IPC semaphore facility used for synchronization has more overhead than the techniques available in multithreaded programs.
The Multithreaded Programming Guide [SunMPG99] describes POSIX and Solaris OE threads APIs, programming with synchronization objects, compiling multithreaded programs, and finding analysis tools for multithreaded programs. The threads level synchronization primitives are very similar to those used by the kernel. This guide also discusses the use of shared memory for synchronizing multiple multithreaded processes.
Synchronization Consistency Failures
Condition variables offer an economical method of protecting data structures being shared by multiple threads. The data structure has an added condition variable, which is used as a lock. However, broken software may indiscriminately alter the data structure without checking the condition variables, thereby ignoring the consistency protection. This represents a software fault that may be latent and difficult to detect at runtime.
Two-Phase Commit
The protocol ensures an atomic write of a single datum to two or more different memories. This solves a problem similar to the consistency problem described previously, but applied slightly differently. Instead of multiple processors or threads synchronizing access to a single memory location, the protocol replicates a single memory location to another memory. These memories have different, independent processors operating on them. However, the copies must remain synchronized.
In phase one, the memories confirm their ability to perform the write operation. Once all of the memories have confirmed, phase two begins and the writes are committed. If a failure occurs, phase one does not complete and some type of error handling may be required. For example, the write may be discarded and an error message returned to the requestor.
The two-phase commit is one of the simplest synchronization protocols and is used widely. However, it has scalability problems. The time to complete the confirmation is based on the latency between the memories. For many systems, this is not a problem, but in a wide area network (WAN), the latency between memories may be significant. Also, as the number of memories increases, the time required to complete the confirmation tends to increase. Attempts to relax these restrictions are available in some software products, but this relaxation introduces the risk of loss of synchronization, and thus the potential for data corruption. Recovery from such a problem may be difficult and time consuming, so you must carefully consider the long term risks and impact of relaxing these restrictions. For details on how Sun Cluster 3.0 uses the two-phase commit protocol, see "Mini-Transactions" on page 60.
Systems also use the for three functionsdisk mirroring (RAID 1), mirrored cache such as in the Sun StorEdge T3 array and Sun StorEdge Network Data Replicator software, and the cluster configuration repository (CCR.)
Locks and Lock Management
Locks that are used to ensure consistency require lock management and recovery when failures occur. For node failures, the system must store the information about the locks and their current state in shared, persistent memory or communicate it through the interconnect to a shadow agent on another node.
Storing the state information in persistent memory can lead to performance and scalability issues because the latency to perform the store can affect performance negatively. These locks work best when the state of the lock does not change often. For example, locking a file tends to cause much less lock activity than locking records in the file. Similarly, locking a database table creates less lock activity than locking rows in the table. In either case, the underlying support and management of the locks does not change, but the utilization of the locks can change. High lock utilization is an indication that the service or application will have difficulty scaling.
An alternative to storing the state information in persistent memory is to use shadow agentsprocesses that receive updates on lock information from the owner of the locks. This state information is kept in volatile, main memory, which has much lower latency than shared, persistent storage. If the lock owner fails, the shadow agent already knows the state of the locks and can begin to take over the lock ownership very quickly.
Lock Performance
Most locking software and synchronization software provide a method for monitoring their utilization. For example, databases provide performance tables for monitoring lock utilization and contention. The mpstat(1m), vmstat(1m), and iostat(1m) processes give some indications of lock or synchronization activity, though this is not their specialty. The lockstat(1m) process provides detailed information on kernel lock activity, monitors lock contention events, gathers frequency and timing data on the events, and presents the data.