- Managing Program State
- Synchronization: Kinds and Techniques
- Where Are We?
- Further Reading
Synchronization: Kinds and Techniques
When shared mutable state is present, synchronization is the only remaining technique for ensuring correctness. As you might guess, given that there’s an entire chapter in this book dedicated to this topic—Chapter 11, Concurrency Hazards—implementing a properly synchronized system is complicated. In addition to ensuring correctness, synchronization often is necessary for behavioral reasons: threads in a concurrent system often depend on or communicate with other threads in order to accomplish useful functionality.
The term synchronization is admittedly overloaded and too vague on its own to be very useful. Let’s be careful to distinguish between two different, but closely related, categories of synchronization, which we’ll explore in this book:
- Data synchronization. Shared resources, including memory, must be protected so that threads using the same resource in parallel do not interfere with one another. Such interference could cause problems ranging from crashes to data corruption, and worse, could occur seemingly at random: the program might produce correct results one time but not the next. A piece of code meant to move money from one bank account to another, written with the assumption of sequential execution, for instance, would likely fail if concurrency were naively added. This includes the possibility of reaching a state in which the transferred money is in neither account! Fixing this problem often requires using mutual exclusion to ensure no two threads access data at the same time.
- Control synchronization. Threads can depend on each others’ traversal through the program’s flow of control and state space. One thread often needs to wait until another thread or set of threads have reached a specific point in the program’s execution, perhaps to rendezvous and exchange data after finishing one step in a cooperative algorithm, or maybe because one thread has assumed the role of orchestrating a set of other threads and they need to be told what to do next. In either case, this is called control synchronization.
The two techniques are not mutually exclusive, and it is quite common to use a combination of the two. For instance, we might want a producer thread to notify a consumer that some data has been made available in a shared buffer, with control synchronization, but we also have to make sure both the producer and consumer access the data safely, using data synchronization.
Although all synchronization can be logically placed into the two general categories mentioned previously, the reality is that there are many ways to implement data and control synchronization in your programs on Windows and the .NET Framework. The choice is often fundamental to your success with concurrency, mostly because of performance. Many design forces come into play during this choice: from correctness—that is, whether the choice leads to correct code—to performance—that is, the impact to the sequential performance of your algorithm—to liveness and scalability—that is, the ability of your program to ensure that, given the addition of more and more processors, the throughput of the system improves commensurately (or at least doesn’t do the inverse of this).
Because these are such large topics, we will tease them apart and review them in several subsequent chapters. In this chapter, we stick to the general ideas, providing motivating examples as we go. In Chapter 5, Windows Kernel Synchronization, we look at the foundational Windows kernel support used for synchronization, and then in Chapter 6, Data and Control Synchronization, we will explore higher level primitives available in Win32 and the .NET Framework. We won’t discuss performance and scalability in great depth until Chapter 14, Performance and Scalability, although it’s a recurring theme throughout the entire book.
Data Synchronization
The solution to the general problem of data races is to serialize concurrent access to shared state. Mutual exclusion is the most popular technique used to guarantee no two threads can be executing the sensitive region of instructions concurrently. The sequence of operations that must be serialized with respect to all other concurrent executions of that same sequence of operations is called a critical region.
Critical regions can be denoted using many mechanisms in today’s systems, ranging from language keywords to API calls, and involving such terminology as locks, mutexes, critical sections, monitors, binary semaphores, and, recently, transactions (see Further Reading, Shavit, Touitou). Each has its own subtle semantic differences. The desired effect, however, is usually roughly the same. So long as all threads use critical regions consistently to access certain data, they can be used to avoid data races.
Some regions support shared modes, for example reader/writer locks, when it is safe for many threads to be reading shared data concurrently. We’ll look at examples of this in Chapter 6, Data and Control Synchronization. We will assume strict mutual exclusion for the discussion below.
What happens if multiple threads attempt to enter the same critical region at once? If one thread wants to enter the critical region while another is already executing code inside, it must either wait until the thread leaves or it must occupy itself elsewhere in the meantime, perhaps checking back again sometime later to see if the critical region has become available. The kind of waiting used differs from one implementation to the next, ranging from busy waiting to relying on Windows’ support for waiting and signaling. We will return to this topic later.
Let’s take a brief example. Given some statement or compound statement of code, S, that depends on shared state and may run concurrently on separate threads, we can make use of a critical region to eliminate the possibility of data races.
EnterCriticalRegion(); S; LeaveCriticalRegion();
(Note that these APIs are completely fake and simply used for illustration.)
The semantics of the faux EnterCriticalRegion API are rather simple: only one thread may enter the region at a time and must otherwise wait for the thread currently inside the region to issue a call to LeaveCriticalRegion. This ensures that only one thread may be executing the statement S at once in the entire process and, hence, serializes all executions. It appears as if all executions of S happen atomically—provided there is no possibility of concurrent access to the state accessed in S outside of critical regions, and that S may not fail part-way through—although clearly S is not really atomic in the most literal sense of the word.
Using critical regions can solve both data invariant violations illustrated earlier, that is when S is (*a)++, as shown earlier. Here is the first problematic interleaving we saw, with critical regions added into the picture.
T t1 t2 0 t1(E): EnterCriticalRegion(); 1 t1(0): MOV EAX,[a] #0 2 t2(0): EnterCriticalRegion(); 3 t1(1): INC,EAX #1 4 t1(2): MOV [a],EAX #1 5 t1(L): LeaveCriticalRegion(); 6 t2(0): MOV EAX,[a] #1 7 t2(1): INC,EAX #2 8 t2(2): MOV [a],EAX #3 9 t2(L): LeaveCriticalRegion();
In this example, t2 attempts to enter the critical region at time 2. But the thread is not permitted to proceed because t1 is already inside the region and it must wait until time 5 when t1 leaves. The result is that no two threads may be operating on a simultaneously.
As alluded to earlier, any other accesses to a in the program must also be done under the protection of a critical region to preserve atomicity and correctness across the whole program. Should one thread forget to enter the critical region before writing to a, shared state can become corrupted, causing cascading failures throughout the program. For better or for worse, critical regions in today’s programming systems are very code-centric rather than being associated with the data accessed inside those regions.
A Generalization of the Idea: Semaphores
The semaphore was invented by E. W. Dijkstra in 1965 as a generalization of the general critical region idea. It permits more sophisticated patterns of data synchronization in which a fixed number of threads are permitted to be inside the critical region simultaneously.
The concept is simple. A semaphore is assigned an initial count when created, and, so long as the count remains above 0, threads may continue to decrement the count without waiting. Once the count reaches 0, however, any threads that attempt to decrement the semaphore further must wait until another thread releases the semaphore, increasing the count back above 0. The names Dijkstra invented for these operations are P, for the fictitious word prolaag, meaning to try to take, and V, for the Dutch word verhoog, meaning to increase. Since these words are meaningless to those of us who don’t speak Dutch, we’ll refer to these activities as taking and releasing, respectively.
A critical region (a.k.a. mutex) is therefore just a specialization of the semaphore in which its current count is always either 0 or 1, which is also why critical regions are often called binary semaphores. Semaphores with maximum counts of more than 1 are typically called counting semaphores. Windows and .NET both offer intrinsic support for semaphore objects. We will explore this support further in Chapter 6, Data and Control Synchronization.
Patterns of Critical Region Usage
The faux syntax shown earlier for entering and leaving critical regions maps closely to real primitives and syntax. We’ll generally interchange the terminology enter/leave, enter/exit, acquire/release, and begin/end to mean the same thing. In any case, there is a pair of operations for the critical region: one to enter and one to exit. This syntax might appear to suggest there is only one critical region for the entire program, which is almost never true. In real programs, we will deal with multiple critical regions, protecting different disjoint sets of data, and therefore, we often will have to instantiate, manage, and enter and leave specific critical regions, either by name, object reference, or some combination of both, during execution.
A thread wishing to enter some region 1 does not interfere with a separate region 2 and vice versa. Therefore, we must ensure that all threads consistently enter the correct region when accessing certain data. As an illustration, imagine we have two separate CriticalRegion objects, each with Enter and Leave methods. If two threads tried to increment a shared variable s_a, they must acquire the same CriticalRegion first. If they acquire separate regions, mutual exclusion is not guaranteed and the program has a race.
Here is an example of such a broken program.
static int a; static CriticalRegion cr1, cr2; // initialized elsewhere void f() { cr1.Enter(); s_a++; cr1.Leave(); } void g() { cr2.Enter(); s_a++; cr2.Leave(); }
This example is flawed because f acquires critical region cr1 and g acquires critical region cr2. But there are no mutual exclusion guarantees between these separate regions. If one thread runs f concurrently with another thread that is running g, we will see data races.
Critical regions are most often—but not always—associated with some static lexical scope, in the programming language sense, as shown above. The program enters the region, performs the critical operation, and exits, all occurring on the same stack frame, much like a block scope in C based languages. Keep in mind that this is just a common way to group synchronization sensitive operations under the protection of a critical region and not necessarily a restriction imposed by the mechanisms you will be using. (Many encourage it, however, like C# and VB, which offer keyword support.) It’s possible, although often more difficult and much more error prone, to write a critical region that is more dynamic about entering and leaving regions.
BOOL f() { if (...) { EnterCriticalRegion(); S0; // some critical work return TRUE; } return FALSE; } void g() { if (f()) { S1; // more critical work LeaveCriticalRegion(); } }
This style of critical region use is more difficult for a number of reasons, some of which are subtle. First, it is important to write programs that spend as little time as possible in critical regions, for performance reasons. This example inserts some unknown length of instructions into the region (i.e., the function return epilogue of f and whatever the caller decides to do before leaving). Synchronization is also difficult enough, and spreading a single region out over multiple functional units adds difficulty where it is not needed.
But perhaps the most notable problem with the more dynamic approach is reacting to an exception from within the region. Normally, programs will want to guarantee the critical region is exited, even if the region is terminated under exceptional circumstances (although not always, as this failure can indicate data corruption). Using a statically scoped block allows you to use things like try/catch blocks to ensure this.
EnterCriticalRegion(); __try { S0; S1; // critical work } __finally { LeaveCriticalRegion(); }
Achieving this control flow for failure and success becomes more difficult with more dynamism. Why might we care so much about guaranteeing release? Well, if we don’t always guarantee the lock is released, another thread may subsequently attempt to enter the region and wait indefinitely. This is called an orphaned lock and leads to deadlock.
Simply releasing the lock in the face of failure is seldom sufficient, however. Recall that our definition of atomicity specifies two things: that the effects appear instantaneously and that they happen either completely or not at all. If we release the lock immediately when a failure occurs, we may be opening up data corruption to the rest of the program. For example, say we had two shared variables x and y with some known relationship based invariant; if a region modified x but failed before it had a chance to modify y, releasing the region would expose the corrupt data and likely lead to additional failure in other parts of the program. Deadlock is generally more debuggable than data corruption, so if the code cannot be written to revert the update to x in the face of such a failure, it’s often a better idea to leave the region in an acquired state. That said we will use a try/finally type of scheme in examples to ensure the region is exited properly.
Coarse- vs. Fine-Grained Regions
When using a critical region, you must decide what data is to be protected by which critical regions. Coarse- and fine-grained regions are two extreme ends of the spectrum. At one extreme, a single critical region could be used to protect all data in the program; this would force the program to run single-threaded because only one thread could make forward progress at once. At the other extreme, every byte in the heap could be protected by its own critical region; this might alleviate scalability bottlenecks, but would be ridiculously expensive to implement, not to mention impossible to understand, ensure deadlock freedom, and so on. Most systems must strike a careful balance between these two extremes.
The critical region mechanisms available today are defined by regions of program statements in which mutual exclusion is in effect, as shown above, rather than being defined by the data accessed within such regions. The data accessed is closely related to the program logic, but not directly: any given data can be manipulated by many regions of the program and similarly any given region of the program is apt to manipulate different data. This requires many design decisions and tradeoffs to be made around the organization of critical regions.
Programs are often organized as a collection subsystems and composite data structures whose state may be accessed concurrently by many threads at once. Two reasonable and useful approaches to organizing critical regions are as follows:
- Coarse-grained. A single lock is used to protect all constituent parts of some subsystem or composite data structure. This is the simplest scheme to get right. There is only one lock to manage and one lock to acquire and release: this reduces the space and time spent on synchronization, and the decision of what comprises a critical region is driven entirely by the need of threads to access some large, easy to identify thing. Much less work is required to ensure safety. This over conservative approach may have a negative impact to scalability due to false sharing, however. False sharing prevents concurrent access to some data unnecessarily, that is it is not necessary to guard access to ensure correctness.
- Fine-grained. As a way of improving scalability, we can use a unique lock per constituent piece of data (or some groupings of data), enabling many threads to access disjoint data objects simultaneously. This reduces or eliminates false sharing, allowing threads to achieve greater degrees of concurrency and, hence, better liveness and scalability. The down side to this approach is the increase of number of locks to manage and potentially multiple lock acquisitions needed if more than one data structure must be accessed at once, both of which are bad for space and time complexity. This strategy also can lead to deadlocks if not used carefully. If there are complex invariant relationships between multiple data structures, it can also become more difficult to eliminate data races.
No single approach will be best for all scenarios. Programs will use a combination of techniques on this spectrum. But as a general rule of thumb, starting with coarse-grained locking to ensure correctness first and fine-tuning the approach to successively use finer-grained regions as scalability requirements demand is an approach that typically leads to a more maintainable, understandable, and bug-free program.
How Critical Regions Are Implemented
Before moving on, let’s briefly explore how critical regions might be implemented. There are a series of requirements for any good critical region implementation.
- The mutual exclusion property holds. That is, there can never be a circumstance in which more than one thread enters the critical region at once.
- Liveness of entrance and exit of the region is guaranteed. The system as a whole will continue to make forward progress, meaning that the algorithm can cause neither deadlock nor livelock. More formally, given an infinite amount of time, each thread that arrives at the region is guaranteed to eventually enter the region, provided that no thread stays in the region indefinitely.
- Some reasonable degree of fairness, such that a thread’s arrival time at the region somehow gives it (statistical) preference over other threads, is desirable though not strictly required. This does not necessarily dictate that there is a deterministic fairness guarantee—such as first-in, first-out—but often regions strive to be reasonably fair, probabilistically speaking.
- Low cost is yet another subjective criterion. It is important that entering and leaving the critical region be very inexpensive. Critical regions are often used pervasively in low-level systems software, such as operating systems, and thus, there is a lot of pressure on the efficiency of the implementation.
As we’ll see, there is a progression of approaches that can be taken. In the end, however, we’ll see that all modern mutual exclusion mechanisms rely on a combination of atomic compare and swap (CAS) hardware instructions and operating system support. But before exploring that, let’s see why hardware support is even necessary. In other words, shouldn’t it be easy to implement EnterCriticalRegion and LeaveCriticalRegion using familiar sequential programming constructs?
The simplest, overly naïve approach won’t work at all. We could have a single flag variable, initially 0, which is set to 1 when a thread enters the region and 0 when it leaves. Each thread attempting to enter the region first checks the flag and then, once it sees the flag at 0, sets it to 1.
int taken = 0; void EnterCriticalRegion() { while (taken != 0) /* busy wait */ ; taken = 1; // Mark the region as taken. } void LeaveCriticalRegion() { taken = 0; // Mark the region as available. }
This is fundamentally very broken. The reason is that the algorithm uses a sequence of reads and writes that aren’t atomic. Imagine if two threads read taken as 0 and, based on this information, both decide to write 1 into it. Multiple threads would each think it owned the critical region, but both would be running code inside the critical region at once. This is precisely the thing we’re trying to avoid with the use of critical regions in the first place!
Before reviewing the state of the art—that is, the techniques all modern critical regions use—we’ll take a bit of a historical detour in order to better understand the evolution of solutions to mutual exclusion during the past 40+ years.
Strict Alternation
We might first try to solve this problem with a technique called strict alternation, granting ownership to thread 0, which then grants ownership to thread 1 when it is done, which then grants ownership to 2 when it is done, and so on, for N threads, finally returning ownership back to 0 after thread N – 1 has been given ownership and finished running inside the region. This might be implemented in the form of the following code snippet:
const int N = ...; // # of threads in the system. int turn = 0; // Thread 0 gets its turn first. void EnterCriticalRegion(int i) { while (turn != i) /* busy wait */ ; // Someone gave us the turn... we own the region. } void LeaveCriticalRegion(int i) { // Give the turn to the next thread (possibly wrapping to 0). turn = (i + 1) % N; }
This algorithm ensures mutual exclusion inside the critical region for precisely N concurrent threads. In this scheme, each thread is given a unique identifier in the range [0 . . . N), which is passed as the argument i to EnterCriticalRegion. The turn variable indicates which thread is currently permitted to run inside the critical region, and when a thread tries to enter the critical region, it must wait for its turn to be granted by another thread, in this particular example by busy spinning. With this algorithm, we have to choose someone to be first, so we somewhat arbitrarily decide to give thread 0 its turn first by initializing turn to 0 at the outset. Upon leaving the region, each thread simply notifies the next thread that its turn has come up: it does this notification by setting turn, either wrapping it back around to 0, if we’ve reached the maximum number of threads, or by incrementing it by one otherwise.
There is one huge deal breaker with strict alternation: the decision to grant a thread entry to the critical region is not based in any part on the arrival of threads to the region. Instead, there is a predefined ordering: 0, then 1, then . . ., then N – 1, then 0, and so on, which is nonnegotiable and always fixed. This is hardly fair and effectively means a thread that isn’t currently in the critical region holds another thread from entering it. This can threaten the liveness of the system because threads must wait to enter the critical region even when there is no thread currently inside of it. This kind of “false contention” isn’t a correctness problem per se, but reduces the performance and scalability of any use of it. This algorithm also only works if threads regularly enter and exit the region, since that’s the only way to pass on the turn. Another problem, which we won’t get to solving for another few pages, is that the critical region cannot accommodate a varying number of threads. It’s quite rare to know a priori the number of threads a given region must serve, and even rarer for this number to stay fixed for the duration of a process’s lifetime.
Dekker’s and Dijkstra’s Algorithms (1965)
The first widely publicized general solution to the mutual exclusion problem, which did not require strict alternation, was a response submitted by a reader of a 1965 paper by E. W. Dijkstra in which he identified the mutual exclusion problem and called for solutions (see Further Reading, Dijkstra, 1965, Co-operating sequential processes). One particular reader, T. Dekker, submitted a response that met Dijkstra’s criteria but that works only for two concurrent threads. It’s referred to as “Dekker’s algorithm” and was subsequently generalized in a paper by Dijkstra, also in 1965 (see Further Reading, Dijkstra, 1965, Solution of a problem in concurrent programming control), to accommodate N threads.
Dekker’s solution works similar to strict alternation, in which turns are assigned, but extends this with the capability for each thread to note an interest in taking the critical region. If a thread desires the region but yet it isn’t its turn to enter, it may “steal” the turn if the other thread has not also noted interest (i.e., isn’t in the region).
In our sample implementation, we have a shared 2-element array of Booleans, flags, initialized to contain false values. A thread stores true into its respective element (index 0 for thread 0, 1 for thread 1) when it wishes to enter the region, and false as it exits. So long as only one thread wants to enter the region, it is permitted to do so. This works because a thread first writes into the shared flags array and then checks whether the other thread has also stored into the flags array. We can be assured that if we write true into flags and then read false from the other thread’s element that the other thread will see our true value. (Note that modern processors perform out of order reads and writes that actually break this assumption. We’ll return to this topic later.)
We must deal with the case of both threads entering simultaneously. The tie is broken by using a shared turn variable, much like we saw earlier. Just as with strict alternation, when both threads wish to enter, a thread may only enter the critical region when it sees turn equal to its own index and that the other thread is no longer interested (i.e., its flags element is false). If a thread finds that both threads wish to enter but it’s not its turn, the thread will “back off” and wait by setting its flags element to false and waiting for the turn to change. This lets the other thread enter the region. When a thread leaves the critical region, it just resets its flags element to false and changes the turn.
This entire algorithm is depicted in the following snippet.
static bool[] flags = new bool[2]; static int turn = 0; void EnterCriticalRegion(int i) // i will only ever be 0 or 1 { int j = 1 - i; // the other thread's index flags[i] = true; // note our interest while (flags[j]) // wait until the other is not interested { if (turn == j) // not our turn, we must back off and wait { flags[i] = false; while (turn == j) /* busy wait */; flags[i] = true; } } } void LeaveCriticalRegion(int i) { turn = 1 - i; // give away the turn flags[i] = false; // and exit the region }
Dijkstra’s modification to this algorithm supports N threads. While it still requires N to be determined a priori, it does accommodate systems in which fewer than N threads are active at any moment, which admittedly makes it much more practical.
The implementation is slightly different than Dekker’s algorithm. We have a flags array of size N, but instead of Booleans it contains a tri-value. Each element can take on one of three values, and in our example, we will use an enumeration: passive, meaning the thread is uninterested in the region at this time; requesting, meaning the thread is attempting to enter the region; and active, which means the thread is currently executing inside of the region.
A thread, upon arriving at the region, notes interest by setting its flag to requesting. It then attempts to “steal” the current turn: if the current turn is assigned to a thread that isn’t interested in the region, the arriving thread will set turn to its own index. Once the thread has stolen the turn, it notes that it is actively in the region. Before actually moving on, however, the thread must verify that no other thread has stolen the turn in the meantime and possibly already entered the region, or we could break mutual exclusion. This is verified by ensuring that no other thread’s flag is active. If another active thread is found, the arriving thread will back off and go back to a requesting state, continuing the process until it is able to enter the region. When a thread leaves the region, it simply sets its flag to passive.
Here is a sample implementation in C#.
const int N = ...; // # of threads that can enter the region. enum F : int { Passive, Requesting, Active } F[] flags = new F[N]; // all initialized to passive int turn = 0; void EnterCriticalRegion(int i) { int j; do { flags[i] = F.Requesting; // note our interest while (turn != i) // spin until it's our turn if (flags[turn] == F.Passive) turn = i; // steal the turn flags[i] = F.Active; // announce we're entering // Verify that no other thread has entered the region. for (j = 0; j < N && (j == i || flags[j] != F.Active); j++); } while (j < N); } void LeaveCriticalRegion(int i) { flags[i] = F.Passive; // just note we've left }
Note that just as with Dekker’s algorithm as written above this code will not work as written on modern compilers and processors due to the high likelihood of out of order execution. This code is meant to illustrate the logical sequence of steps only.
Peterson’s Algorithm (1981)
Some 16 years after the original Dekker algorithm was published, a simplified algorithm was developed by G. L. Peterson and detailed in his provocatively titled paper, “Myths about the Mutual Exclusion” (see Further Reading, Peterson). It is simply referred to as Peterson’s algorithm. In fewer than two pages, he showed a two thread algorithm alongside a slightly more complicated N thread version of his algorithm, both of which were simpler than the 15 years of previous efforts to simplify Dekker and Dijkstra’s original proposals.
For brevity’s sake, we review just the two thread version here. The shared variables are the same, that is, a flags array and a turn variable, as in Dekker’s algorithm. Unlike Dekker’s algorithm, however, a requesting thread immediately gives away the turn to the other thread after setting its flags element to true. The requesting thread then waits until either the other thread is not in its critical region or until the turn has been given back to the requesting thread.
bool[] flags = new bool[2]; int turn = 0; void EnterCriticalRegion(int i) { flags[i] = true; // note our interest in the region turn = 1 - i; // give the turn away // Wait until the region is available or it's our turn. while (flags[1 - i] && turn != i) /* busy wait */ ; } void LeaveCriticalRegion(int i) { flags[i] = false; // just exit the region }
Peterson’s algorithm, just like Dekker’s, also satisfies all of the basic mutual exclusion, fairness, and liveness properties outlined above. It is also much simpler, and so it tends to be used more frequently over Dekker’s algorithm to teach mutual exclusion.
Lamport’s Bakery Algorithm (1974)
L. Lamport also proposed an alternative algorithm, and called it the Baker’s algorithm (see Further Reading, Lamport, 1974). This algorithm nicely accommodates varying numbers of threads, but has the added benefit that the failure of one thread midway through executing the critical region entrance or exit code does not destroy liveness of the system, as is the case with the other algorithms seen so far. All that is required is the thread must reset its ticket number to 0 and move to its noncritical region. Lamport was interested in applying his algorithm to distributed systems in which such fault tolerance was obviously a critical component of any viable algorithm.
The algorithm is called the “bakery” algorithm because it works a bit like your neighborhood bakery. When a thread arrives, it takes a ticket number, and only when its ticket number is called (or more precisely, those threads with lower ticket numbers have been serviced) will it be permitted to enter the critical region. The implementation properly deals with the edge case in which multiple threads happen to be assigned the same ticket number by using an ordering among the threads themselves—for example, a unique thread identifier, name, or some other comparable property—to break the tie. Here is a sample implementation.
const int N = ...; // # of threads that can enter the region. int[] choosing = new int[N]; int[] number = new int[N]; void EnterCriticalRegion(int i) { // Let others know we are choosing a ticket number. // Then find the max current ticket number and add one. choosing[i] = 1; int m = 0; for (int j = 0; j < N; j++) { int jn = number[j]; m = jn > m ? jn : m; } number[i] = 1 + m; choosing[i] = 0; for (int j = 0; j < N; j++) { // Wait for threads to finish choosing. while (choosing[j] != 0) /* busy wait */ ; // Wait for those with lower tickets to finish. If we took // the same ticket number as another thread, the one with the // lowest ID gets to go first instead. int jn; while ((jn = number[j]) != 0 && (jn < number[i] || (jn == number[i] && j < i))) /* busy wait */ ; } // Our ticket was called. Proceed to our region... } void LeaveCriticalRegion(int i) { number[i] = 0; }
This algorithm is also unique when compared to previous efforts because threads are truly granted fair entrance into the region. Tickets are assigned on a first-come, first-served basis (FIFO), and this corresponds directly to the order in which threads enter the region.
Hardware Compare and Swap Instructions (Fast Forward to Present Day)
Mutual exclusion has been the subject of quite a bit of research. It’s easy to take it all for granted given how ubiquitous and fundamental synchronization has become, but nevertheless you may be interested in some of the references to learn more than what’s possible to describe in just a few pages (see Further Reading, Raynal).
Most of the techniques shown also share one thing in common. Aside from the bakery algorithm, each relies on the fact that reads and writes from and to natural word-sized locations in memory are atomic on all modern processors. But they specifically do not require atomic sequences of instructions in the hardware. These are truly “lock free” in the most literal sense of the phrase. However, most modern critical regions are not implemented using any of these techniques. Instead, they use intrinsic support supplied by the hardware.
One additional drawback of many of these software only algorithms is that one must know N in advance and that the space and time complexity of each algorithm depends on N. This can pose serious challenges in a system where any number of threads—a number that may only be known at runtime and may change over time—may try to enter the critical region. Windows and the CLR assign unique identifiers to all threads, but unfortunately these identifiers span the entire range of a 4-byte integer. Making N equal to 2^32 would be rather absurd.
Modern hardware supports atomic compare and swap (CAS) instructions. These are supported in Win32 and the .NET Framework where they are called interlocked operations. (There are many related atomic instructions supported by the hardware. This includes an atomic bit-test-and-set instruction, for example, which can also be used to build critical regions. We’ll explore these in more detail in Chapter 10, Memory Models and Lock Freedom.) Using a CAS instruction, software can load, compare, and conditionally store a value, all in one atomic, uninterruptible operation. This is supported in the hardware via a combination of CPU and memory subsystem support, differing in performance and complexity across different architectures.
Imagine we have a CAS API that takes three arguments: (1) a pointer to the address we are going to read and write, (2) the value we wish to place into this location, and (3) the value that must be in the location in order for the operation to succeed. It returns true if the comparison succeeded—that is, if the value specified in (3) was found in location (1), and therefore the write of (2) succeeded—or false if the operation failed, meaning that the comparison revealed that the value in location (1) was not equal to (3). With such a CAS instruction in hand, we can use an algorithm similar to the first intuitive guess we gave at the beginning of this section:
int taken = 0; void EnterCriticalRegion() { // Mark the region as taken. while (!CAS(&taken, 1, 0)) /* busy wait */ ; } void LeaveCriticalRegion() { taken = 0; // Mark the region as available. }
A thread trying to enter the critical region continuously tries to write 1 into the taken variable, but only if it reads it as 0 first, atomically. Eventually the region will become free and the thread will succeed in writing the value. Only one thread can enter the region because the CAS operation guarantees that the load, compare, and store sequence is done completely atomically.
This implementation gives us a much simpler algorithm that happens to accommodate an unbounded number of threads, and does not require any form of alternation. It does not give any fairness guarantee or preference as to which thread is given the region next, although it could clearly be extended to do so. In fact, busy waiting indefinitely as shown here is usually a bad idea, and instead, true critical region primitives are often built on top of OS support for waiting, which does have some notion of fairness built in.
Most modern primitive synchronization primitives are built on top of CAS operations. Many other useful algorithms also can be built on top of CAS. For instance, returning to our earlier motivating data race, (*a)++, we can use CAS to achieve a race-free and serializable program rather than using a first class critical region. For example:
void AtomicIncrement(int * p) { int seen; do { seen = *p; } while (!CAS(p, seen + 1, seen)); } // ... elsewhere ... int a = 0; AtomicIncrement(&a);
If another thread changes the value in location p in between the reading of it into the seen variable, the CAS operation will fail. The function responds to this failed CAS by just looping around and trying the increment again until the CAS succeeds. Just as with the lock above, there are no fairness guarantees. The thread trying to perform an increment can fail any number of times, but probabilistically it will eventually make forward progress.
The Harsh Reality of Reordering, Memory Models
The discussion leading up to this point has been fairly naïve. With all of the software-only examples of mutual exclusion algorithms above, there is a fundamental problem lurking within. Modern processors execute instructions out of order and modern compilers perform sophisticated optimizations that can introduce, delete, or reorder reads and writes. Reference has already been made to this point. But if you try to write and use a critical region as I’ve shown, it will likely not work as expected. The hardware-based version (with CAS instructions) will typically work on modern processors because CAS guarantees a certain level of read and write reordering safety.
Here are a few concrete examples where the other algorithms can go wrong.
- In the original strict alternation algorithm, we use a loop that continually rereads turn, waiting for it to become equal to the thread’s index i. Because turn is not written in the body of the loop, a compiler may conclude that turn is loop invariant and thus hoist the read into a temporary variable before the loop even begins. This will lead to an infinite loop for threads trying to enter a busy critical region. Moreover, a compiler may only do this under some conditions, like when non debug optimizations are enabled. This same problem is present in each of the algorithms shown.
- Dekker’s algorithm fundamentally demands that a thread’s write to its flags entry happens before the read of its partner’s flags variable. If this were not the case, both could read each other’s flags variable as false and proceed into the critical region, breaking the mutual exclusion guarantee. This reordering is legal and quite common on all modern processors, rendering this algorithm invalid. Similar requirements are present for many of the reads and writes within the body of the critical region acquisition sequence.
- Critical regions typically have the effect of communicating data written inside the critical region to other threads that will subsequently read the data from inside the critical region. For instance, our earlier example showed each thread executing a++. We assumed that surrounding this with a critical region meant that a thread, t2, running later in time than another thread, t1, would always read the value written by t1, resulting in the correct final value. But it’s legal for code motion optimizations in the compiler to move reads and writes outside of the critical regions shown above. This breaks concurrency safety and exposes the data race once again. Similarly, modern processors can execute individual reads and writes out of order, and modern cache systems can give the appearance that reads and writes occurred out of order (based on what memory operations are satisfied by what level of the cache).
Each of these issues invalidates one or more of the requirements we sought to achieve at the outset. All modern processors, compilers, and runtimes specify which of these optimizations and reorderings are legal and, most importantly, which are not, through a memory model. These guarantees can, in principal, then be relied on to write a correct implementation of a critical region, though it’s highly unlikely anybody reading this book will have to take on such a thread. The guarantees vary from compiler to compiler and from one processor to the next (when the compiler’s guarantees are weaker than the processor’s guarantees), making it extraordinarily difficult to write correct code that runs everywhere.
Using one of the synchronization primitives from Win32 or the .NET Framework alleviates all need to understand memory models. Those primitives should be sufficient for 99.9 percent (or more) of the scenarios most programmers face. For the cases in which these primitives are not up to the thread—which is rare, but can be the case for efficiency reasons—or if you’re simply fascinated by the topic, we will explore memory models and some lock free techniques in Chapter 10, Memory Models and Lock Freedom. If you thought that reasoning about program correctness and timings was tricky, just imagine if any of the reads and writes could happen in a randomized order and didn’t correspond at all to the order in the program’s source.
Coordination and Control Synchronization
If it’s not obvious yet, interactions between components change substantially in a concurrent system. Once you have multiple things happening simultaneously, you will eventually need a way for those things to collaborate, either via centrally managed orchestration or autonomous and distributed interactions. In the simplest form, one thread might have to notify another when an important operation has just finished, such as a producer thread placing a new item into a shared buffer for which a consumer thread is waiting. More complicated examples are certainly commonplace, such as when a single thread must orchestrate the work of many subservient threads, feeding them data and instructions to make forward progress on a larger shared problem.
Unlike sequential programs, state transitions happen in parallel in concurrent programs and are thus more difficult to reason. It’s not necessarily the fact that things are happening at once that makes concurrency difficult so much as getting the interactions between threads correct. Leslie Lamport said it very well:
- We thought that concurrent systems needed new approaches because many things were happening at once. We have learned instead that . . . the real leap is from functional to reactive systems. A functional system is one that can be thought of as mapping an input to an output. . . . A (reactive) system is one that interacts in more complex ways with its environment (see Further Reading, Lamport, 1993).
Earlier in this chapter, we saw how state can be shared in order to speed up communication between threads and the burden that implies. The patterns of communication present in real systems often build directly on top of such sharing. In the scenario with a producer thread and a consumer thread mentioned earlier, the consumer may have to wait for the producer to generate an item of interest. Once an item is available, it could be written to a shared memory location that the consumer directly accesses, using appropriate data synchronization to eliminate a class of concurrency hazards. But how does one go about orchestrating the more complex part: waiting, in the case that a consumer arrives before the producer has something of interest, and notification, in the case that a consumer has begun waiting by the time the producer creates that thing of interest? And how does one architect the system of interactions in the most efficient way? These are some topics we will touch on in this section.
Because thread coordination can take on many diverse forms and spans many specific implementation techniques, there are many details to address. As noted in the first chapter, there isn’t any “one” correct way to write a concurrent program; instead, there are certain ways of structuring and writing programs that make one approach more appropriate than another. There are quite a few primitives in Win32 and the .NET Framework and design techniques from which to choose. For now we will focus on building a conceptual understanding of the approaches.
State Dependence Among Threads
As we described earlier, programs are comprised of big state machines that are traversed during execution. Threads themselves also are composed of smaller state machines that contribute to the overall state of the program itself. Each carries around some interesting data and performs some number of activities. An activity is just some abstract operation that possibly reads and writes the data and, in doing so, also possibly transitions between states, both local to the thread and global to the program. As we already saw, some level of data synchronization often is needed to ensure invalid states are not reached during the execution of such activities.
It is also worth differentiating between internal and external states, for example, those that are just implementation details of the thread itself versus those that are meant to be observed by other threads running in a system, respectively.
Threads frequently have to interact with other threads running concurrently in the system to accomplish some work, forming a dependency. Once such a dependency exists, a dependent thread will typically have some knowledge of the (externally visible) states the depended-upon thread may transition between. It’s even common for a thread to require that another thread is in a specific state before proceeding with an operation. A thread might only transition into such a state with the passing of time, as a result of external stimuli (like a GUI event or incoming network message), via some third thread running concurrently in the system producing some interesting state itself, or some combination of these. When one thread depends on another and is affected by its state changes (such as by reading memory that it has written), the thread is said to be causally dependent on the other.
Thinking about control synchronization in abstract terms is often helpful, even if the actual mechanism used is less formally defined. As an example, imagine that there is some set of states SP in which the predicate P will evaluate to true. A thread that requires P to be true before it proceeds is actually just waiting for any of the states in SP to arise. Evaluating the predicate P is really asking the question, “Is the program currently in any such state?” And if the answer is no, then the thread must do one of three things: (1) perform some set of reads and writes to transition the program from its current state to one of those in SP, (2) wait for another concurrent thread in the system to perform this activity’ or (3) forget about the requirement and do something else instead.
The one example of waiting we’ve seen so far is that of a critical region. In the CAS based examples, a thread must wait for any state in which the taken variable is false to arise before proceeding to the critical region. Either it is already the case, or the thread trying to enter the region must wait for (2), another thread in the system to enable the state, via leaving the region.
Waiting for Something to Happen
We’ve encountered the topic of waiting a few times now. As just mentioned, a thread trying to enter a critical region that another thread is already actively running within must wait for it to leave. Many threads may simultaneously try to enter a busy critical region, but only one of them will be permitted to enter at a time. Similarly, control synchronization mechanisms require waiting, for example for an occurrence of an arbitrary event, some data of interest to become available, and so forth. Before moving on to the actual coordination techniques popular in the implementation of control synchronization, let’s discuss how it works for a moment.
Busy Spin Waiting
Until now we’ve shown nothing but busy waiting (a.k.a. spin waiting). This is the simplest (and most inefficient) way to “wait” for some condition to become true, particularly in shared memory systems. With busy waiting, the thread simply sits in a loop reevaluating the predicate until it yields the desired answer, continuously rereading shared memory locations.
For instance, if P is some arbitrary Boolean predicate statement and S is some statement that must not execute until P is true, we might do this:
while (!P) /* busy wait */ ; S;
We say that statement S is guarded by the predicate P. This is an extremely common pattern in control synchronization. Elsewhere there will be a concurrent thread that makes P evaluate to true through a series of writes to shared memory.
Although this simple spin wait is sufficient to illustrate the behavior of our guarded region—allowing many code illustrations in this chapter that would have otherwise required an up-front overview of various other platform features—it has some serious problems.
Spinning consumes CPU cycles, meaning that the thread spinning will remain scheduled on the processor until its quantum expires or until some other thread preempts it. On a single processor machine, this is a complete waste because the thread that will make P true can’t be run until the spinning thread is switched out. Even on a multiprocessor machine, spinning can lead to noticeable CPU spikes, in which it appears as if some thread is doing real work and making forward progress, but the utilization is just caused by one thread waiting for another thread to run. And the thread remains runnable during the entire wait, meaning that other threads waiting to be scheduled (to perform real work) will have to wait in line behind the waiting thread, which is really not doing anything useful. Last, if evaluating P touches shared memory that is frequently accessed concurrently, continuously re-evaluating the predicate so often will have a negative effect on the performance of the memory system, both for the processor that is actually spinning and also for those doing useful work.
Not only is spin waiting inefficient, but the aggressive use of CPU cycles, memory accesses, and frequent bus communications all consume considerable amounts of power. On battery-powered devices, embedded electronics, and in other power constrained circumstances, a large amount of spinning can be downright annoying, reducing battery time to a fraction of its normal expected range, and it can waste money. Spinning can also increase heat in data centers, increasing air conditioning costs, making it attractive to keep CPU utilization far below 100 percent.
As a simple example of a problem with spinning, I’m sitting on an airplane as I write this paragraph. Moments ago, I was experimenting with various mutual exclusion algorithms that use busy waiting, of the kind we looked at above, when I noticed my battery had drained much more quickly than usual. Why was this so? I was continuously running test case after test case that made use of many threads using busy waits concurrently. At least I was able to preempt this problem. I just stopped running my test cases. But if the developers who created my word processor of choice had chosen to use a plethora of busy waits in the background spellchecking algorithm, it’s probable that this particular word processor wouldn’t be popular among those who write when traveling. Thankfully that doesn’t appear to be the case.
Needless to say, we can do much better.
Real Waiting in the Operating System’s Kernel
The Windows OS offers support for true waiting in the form of various kernel objects. There are two kinds of event objects, for example, that allow one thread to wait and have some other thread signal the event (waking the waiter[s]) at some point in the future. There are other kinds of kernel objects, and they are used in the implementation of various other higher-level waiting primitives in Win32 and the .NET Framework. They are all described in Chapter 5, Windows Kernel Synchronization.
When a thread waits, it is put into a wait state (versus a runnable state), which triggers a context switch to remove it from the processor immediately, and ensures that the Windows thread scheduler will subsequently ignore it when considering which thread to run next. This avoids wasting CPU availability and power and permits other threads in the system to make forward progress. Imagine a fictional API WaitSysCall that allows threads to wait. Our busy wait loop from earlier might become something like this:
if (!P) WaitSysCall(); S;
Now instead of other threads simply making P true, the thread that makes P true must now take into consideration that other threads might be waiting. It then wakes them with a corresponding call to WakeSysCall.
Enable(P); // ... make P true ... WakeSysCall();
You probably have picked up a negative outlook on busy waiting altogether. Busy waiting can be used (with care) to improve performance and scalability on multiprocessor machines, particularly for fine-grained concurrency. The reason is subtle, having to do with the cost of context switching, waiting, and waking. Getting it correct requires an intelligent combination of both spinning and true waiting. There are also some architecture specific considerations that you will need to make. (If it’s not obvious by now, the spin wait as written above is apt to cause you many problems, so please don’t try to use it.) We will explore this topic in Chapter 14, Performance and Scalability.
Continuation Passing as an Alternative to Waiting
Sometimes it’s advantageous to avoid waiting altogether. This is for a number of reasons, including avoiding the costs associated with blocking a Windows thread. But perhaps more fundamentally, waiting can present scheduling challenges. If many threads wait and are awoken nearly simultaneously, they will contend for resources. The details depend heavily on the way in which threads are mapped to threads in your system of choice.
As an alternative to waiting, it is often possible to use continuation passing style (CPS), a popular technique in functional programming environments (see Further Reading, Hoare, 1974). A continuation is an executable closure that represents “the rest” of the computation. Instead of waiting for an event to happen, it is sometimes possible to package up the response to that computation in the form of a closure and to pass it to some API that then assumes responsibility for scheduling the continuation again when the wait condition has been satisfied.
Because neither Windows nor the CLR offers first-class support for continuations, CPS can be difficult to achieve in practice. As we’ll see in Chapter 8, Asynchronous Programming Models, the .NET Framework’s asynchronous programming model offers a way to pass a delegate to be scheduled in response to an activity completing, as do the Windows and CLR thread pools and various other components. In each case, it’s the responsibility of the user of the API to deal with the fact that the remainder of the computation involves a possibly deep callstack at the time of the call. Transforming “the rest” of the computation is, therefore, difficult to do and is ordinarily only a reasonable strategy for applications level programming where components are not reused in various settings.
A Simple Wait Abstraction: Events
The most basic control synchronization primitive is the event, also sometimes referred to as a latch, which is a concrete reification of our fictional WaitSysCall and WakeSysCall functions shown above. Events are a flexible waiting and notification mechanism that threads can use to coordinate among one another in a less-structured and free-form manner when compared to critical regions and semaphores. Additionally, there can be many such events in a program to wait and signal different interesting circumstances, much like there can be multiple critical regions to protect different portions of shared state.
An event can be in one of two states at a given time: signaled or nonsignaled. If a thread waits on a nonsignaled event, it does not proceed until the event becomes signaled; otherwise, the thread proceeds right away. Various kinds of events are commonplace, including those that stay signaled permanently (until manually reset to nonsignaled), those that automatically reset back to the nonsignaled state after a single thread waits on it, and so on. In subsequent chapters, we will look at the actual event primitives available to you.
To continue with the previous example of guarding a region of code by some arbitrary predicate P, imagine we have a thread that checks P and, if it is not true, wishes to wait. We can use an event E that is signaled when P is enabled and nonsignaled when it is not. That event internally uses whatever waiting mechanism is most appropriate, most likely involving some amount of spinning plus true OS waiting. Threads enabling and disabling P must take care to ensure that E’s state mirrors P correctly.
// Consuming thread: if (!P) E.Wait(); S; // Enabling thread: Enable(P); // ... make P true ... E.Set();
If it is possible for P to subsequently become false in this example and the event is not automatically reset, we must also allow a thread to reset the event.
E.Reset(); Disable(P); // ... make P false ...
Each kind of event may reasonably implement different policies for waiting and signaling. One event may decide to wake all waiting threads, while another might decide to wake one and automatically put the event back into a nonsignaled state afterward. Yet another technique may wait for a certain number of calls to Set before waking up any waiters.
As we’ll see, there are some tricky race conditions in all of these examples that we will have to address. For events that stay signaled or have some degree of synchronization built in, you can get away without extra data synchronization, but most control synchronization situations are not quite so simple.
One Step Further: Monitors and Condition Variables
Although events are a general purpose and flexible construct, the pattern of usage shown here is very common, for example to implement guarded regions. In other words, some event E being signaled represents some interesting program condition, namely some related predicate P being true, and thus the event state mirrors P’s state accordingly. To accomplish this reliably, data and control synchronization often are needed together. For instance, the evaluation of the predicate P may depend on shared state, in which case data synchronization is required during its evaluation to ensure safety. Moreover, there are data races, mentioned earlier, that we need to handle. Imagine we support setting and resetting; we must avoid the problematic timing of:
t1: Enable(P) -> t2: E.Reset() -> t2: Disable(P) -> t1: E.Set()
In this example, t1 enables the predicate P, but before it has a chance to set the event, t2 comes along and disables P. The result is that we wake up waiting threads although P is no longer true. These threads must take care to re-evaluate P after being awakened to avoid proceeding blindly. But unless they use additional data synchronization, this is impossible.
A nice codification of this relationship between state transitions and data and control synchronization was invented in the 1970s (see Further Reading, Hansen; Hoare, 1974) and is called monitors. Each monitor implicitly has a critical region and may have one or more condition variables associated with it, each representing some condition (like P evaluating to true) for which threads may wish to wait. In this sense, a condition variable is just a fancy kind of event.
All waiting and signaling of a monitor’s condition variables must occur within the critical region of the monitor itself, ensuring data race protection. When a thread decides to wait on a condition variable, it implicitly releases ownership of the monitor (i.e., leaves the critical region), waits, and then reacquires it immediately after being woken up by another thread. This release-wait sequence is done such that other threads entering the monitor are not permitted to enter until the releaser has made it known that it is waiting (avoiding the aforementioned data races). There are also usually mechanisms offered to either wake just one waiting thread or all waiting threads when signaling a condition variable.
Keeping with our earlier example, we may wish to enable threads to wait for some arbitrary predicate P to become true. We could represent this with some monitor M (with methods Enter and Leave) and a condition variable CV (with methods Wait and Set) to represent the condition in which a state transition is made that enables P. (We could have any number of predicates and associated condition variables for M, but our example happens to use only one.) Our example above, which used events, now may look something like this:
// Consuming thread: M.Enter(); while (!P) CV.Wait(); M.Leave(); S; // (or inside the monitor, depending on its contents) // Enabling thread: M.Enter(); Enable(P); CV.Set(); M.Leave(); // Disabling thread: M.Enter(); Disable(P); M.Leave();
Notice in this example that the thread that disables P has no additional requirements because it does so within the critical region. The next thread that is granted access to the monitor will re-evaluate P and notice that it has become false, causing it to wait on CV. There is something subtle in this program. The consuming thread continually re-evaluates P in a while loop, waiting whenever it sees that it is false. This re-evaluation is necessary to avoid the case where a thread enables P, setting CV, but where another thread “sneaks in” and disables P before the consuming thread has a chance to enter the monitor. There is generally no guarantee, just because the condition variable on which a thread was waiting has become signaled, that such a thread is the next one to enter the monitor’s critical region.
Structured Parallelism
Some parallel constructs hide concurrency coordination altogether, so that programs that use them do not need to concern themselves with the low-level events, condition variables, and associated coordination challenges. The most compelling example is data parallelism, where partitioning of the work is driven completely by data layout. The term structured parallelism is used to refer to such parallelism, which typically has well-defined begin and end points.
Some examples of structured parallel constructs follow.
- Cobegin, normally takes the form of a block in which each of the contained program statements may execute concurrently. An alternative is an API that accepts an array of function pointers or delegates. The cobegin statement spawns threads to run statements in parallel and returns only once all of these threads have finished, hiding all coordination behind a clean abstraction.
- Forall, a.k.a. parallel do loops, in which all iterations of a loop body can run concurrently with one another on separate threads. The statement following the loop itself runs only once all concurrent iterations have finished executing.
- Futures, in which some value is bound to a computation that may happen at an unspecified point in the future. The computation may run concurrently, and consumers of the future’s value can choose to wait for the value to be computed, without having to know that waiting and control synchronization is involved.
The languages on Windows and the .NET Framework currently do not offer direct support for these constructs, but we will build up a library of them in Chapters 12, Parallel Containers and 13, Data and Task Parallelism. This library enables higher level concurrent programs to be built with more ease. Appendix B, Parallel Extensions to .NET, also takes a look at the future of concurrency APIs on .NET which contains similar constructs.
Message Passing
In shared memory systems—the dominant concurrent programming model on Microsoft’s development platform (including native Win32 and the CLR)—there is no apparent distinction in the programming interface between state that is used to communicate between threads and state that is thread local. The language and library constructs to work with these two very different categories of memory are identical. At the same time, reads from and writes to shared state usually mean very different things than those that work with thread-private state: they are usually meant to instruct concurrent threads about the state of the system so they can react to the state change. The fact that it is difficult to identify operations that work with this special case also makes it difficult to identify where synchronization is required and, hence, to reason about the subtle interactions among concurrent threads.
In message passing systems, all interthread state sharing is encapsulated within the messages sent between threads. This typically requires that state is copied when messages are sent and normally implies handing off ownership of state at the messaging boundary. Logically, at least, this is the same as performing atomic updates in a shared memory system, but is physically quite different. (In fact, using shared memory could be viewed as an optimization for message passing, when it can be proven safe to turn message sends into writes to shared memory. Recent research in operating system design in fact has explored using such techniques [see Further Reading, Aiken, Fahndrich, Hawblitzel, Hunt, Larus].) Due to the copying, message passing in most implementations is less efficient from a performance standpoint. But the overall thread of state management is usually simplified.
The first popular message passing system was proposed by C. A. R. Hoare as his Communicating Sequential Processes (CSP) research (see Further Reading, Hoare, 1978, 1985). In a CSP system, all concurrency is achieved by having independent processes running asynchronously. As they must interact, they send messages to one another, to request or to provide information to one another. Various primitives are supplied to encourage certain communication constructs and patterns, such as interleaving results among many processes, waiting for one of many to produce data of interest, and so on. Using a system like CSP appreciably raises the level of abstraction from thinking about shared memory and informal state transitions to independent actors that communicate through well-defined interfaces.
The CSP idea has shown up in many subsequent systems. In the 1980s, actor languages evolved the ideas from CSP, mostly in the context of LISP and Scheme, for the purpose of supporting richer AI programming such as in the Act1 and Act2 systems (see Further Reading, Lieberman). It turns out that modeling agents in an AI system as independent processes that communicate through messages is not only a convenient way of implementing a system, but also leads to increased parallelism that is bounded only by the number of independent agents running at once and their communication dependencies. Actors in such a system also sometimes are called “active objects” because they are usually ordinary objects but use CSP-like techniques transparently for function calls. The futures abstraction mentioned earlier is also typically used pervasively. Over time, programming systems like Ada and Erlang (see Further Reading, Armstrong) have pushed the envelope of message passing, incrementally pushing more and more usage from academia into industry.
Many CSP-like concurrency facilities have been modeled mathematically. This has subsequently led to the development of the pi-calculus, among others, to formalize the notion of independently communicating agents. This has taken the form of a calculus, which has had recent uses outside of the domain of computer science (see Further Reading, Sangiorgi, Walker).
Windows and the .NET Framework offer only limited support for fine-grained message passing. CLR AppDomains can be used for fine-grained isolation, possibly using CLR Remoting to communicate between objects in separate domains. But the programming model is not nearly as nice as the aforementioned systems in which message passing is first class. Distributed programming systems such as Windows Communication Foundation (WCF) offer message passing support, but are more broadly used for coarse-grained parallel communication. The Coordination and Concurrency Runtime (CCR), downloadable as part of Microsoft’s Robotics SDK (available on MSDN), offers fine-grained message as a first-class construct in the programming model.
As noted in Chapter 1, Introduction, the ideal architecture for building concurrent systems demands a hybrid approach. At a coarse-grain, asynchronous agents are isolated and communicate in a mostly loosely coupled fashion; message passing is great for this. Then at a fine-grain, parallel computations share memory and use data and task parallel techniques.