Concurrent Programming on Windows: Synchronization and Time
- Managing Program State
- Synchronization: Kinds and Techniques
- Where Are We?
- Further Reading
State is an important part of any computer system. This point seems so obvious that it sounds silly to say it explicitly. But state within even a single computer program is seldom a simple thing, and, in fact, is often scattered throughout the program, involving complex interrelationships and different components responsible for managing state transitions, persistence, and so on. Some of this state may reside inside a process’s memory—whether that means memory allocated dynamically in the heap (e.g., objects) or on thread stacks—as well as files on-disk, data stored remotely in database systems, spread across one or more remote systems accessed over a network, and so on. The relationships between related parts may be protected by transactions, handcrafted semitransactional systems, or nothing at all.
The broad problems associated with state management, such as keeping all sources of state in-synch, and architecting consistency and recoverability plans all grow in complexity as the system itself grows and are all traditionally very tricky problems. If one part of the system fails, either state must have been protected so as to avoid corruption entirely (which is generally not possible) or some means of recovering from a known safe point must be put into place.
While state management is primarily outside of the scope of this book, state “in-the-small” is fundamental to building concurrent programs. Most Windows systems are built with a strong dependency on shared memory due to the way in which many threads inside a process share access to the same virtual memory address space. The introduction of concurrent access to such state introduces some tough challenges. With concurrency, many parts of the program may simultaneously try to read or write to the same shared memory locations, which, if left uncontrolled, will quickly wreak havoc. This is due to a fundamental concurrency problem called a data race or often just race condition. Because such things manifest only during certain interactions between concurrent parts of the system, it’s all too easy to be given a false sense of security—that the possibility of havoc does not exist.
In this chapter, we’ll take a look at state and synchronization at a fairly high level. We’ll review the three general approaches to managing state in a concurrent system:
- Isolation, ensuring each concurrent part of the system has its own copy of state.
- Immutability, meaning that shared state is read-only and never modified, and
- Synchronization, which ensures multiple concurrent parts that wish to access the same shared state simultaneously cooperate to do so in a safe way.
We won’t explore the real mechanisms offered by Windows and the .NET Framework yet. The aim is to understand the fundamental principles first, leaving many important details for subsequent chapters, though pseudo-code will be used often for illustration.
We also will look at the relationship between state, control flow, and the impact on coordination among concurrent threads in this chapter. This brings about a different kind of synchronization that helps to coordinate state dependencies between threads. This usually requires some form of waiting and notification. We use the term control synchronization to differentiate this from the kind of synchronization described above, which we will term data synchronization.
Managing Program State
Before discussing the three techniques mentioned above, let’s first be very precise about what the terminology shared state means. In short, it’s any state that is accessible by more than one thread at a time. It’s surprisingly difficult to pin down more precisely, and the programming languages commonly in use on the platform are not of help.
Identifying Shared vs. Private State
In object oriented systems, state in the system is primarily instance and static (a.k.a. class) fields. In procedural systems, or in languages like C++ that support a mixture of object oriented and procedural constructs, state is also held in global variables. In thread based programming systems, state may also take the form of local variables and arguments on thread stacks used during the execution and invocation of functions. There are also several other subtle sources of state distributed throughout many layers in the overall infrastructure: code, DLLs, thread local storage (TLS), runtime and OS resources, and even state that spans multiple processes (such as memory mapped files and even many OS resources).
Now the question is “What constitutes ‘shared state’ versus ‘private state?’” The answer depends on the precise mechanisms you are using to introduce concurrency into the system. Stated generally, shared state is any state that may, at any point in time, be accessed by multiple threads concurrently. In the systems we care about, that means:
- All state pointed to by a global or static field is shared.
- Any state passed during thread creation (from creator to createe) is shared.
- Any state reachable through references in said state is also shared, transitively.
As a programmer, it’s important to be very conscious of these points, particularly the last. The transitive nature of sharing and the fact that, given any arbitrary pointer, you cannot tell whether the state it refers to has been shared or not, cause tremendous difficulty in building concurrent systems on Windows. Once something becomes shared, it can be difficult to track its ownership in the system, particularly to determine precisely at what point it becomes shared and at what point it becomes unshared in the future (if at all). These can be referred to as data publication and privatization, respectively. Certain programming patterns such as producer/consumer use consistent sharing and transfer of ownership patterns, making the points of publication and privatization more apparent. Even then it’s easy to trip up and make a mistake, such as treating something private although it is still shared, causing race conditions.
It’s also important to note that the above definitions depend to some degree on modern type safety. In the .NET Framework this is generally not negotiable, whereas in systems like C++ it is highly encouraged but can be circumvented. When any part of the program can manufacture a pointer to any arbitrary address in the process’s address space, all data in the entire address space is shared state. We will ignore this loophole. But when pointer arithmetic is involved in your system, know that many of the same problems we’ll look at in this chapter can manifest. They can be even more nondeterministic and hard to debug, however.
To illustrate some of the challenges in identifying shared state, here’s a class definition in C++. It has one simple method, f, and two fields, one static (s_f) and the other instance (m_f). Despite the use of C++ here, the same principles clearly apply to managed code too.
class C { static int s_f; int m_f; public: void f(int * py) { int x; x++; // local variable s_f++; // static class member m_f++; // class member (*py)++; // pointer to something } };
The method contains four read/increment/write operations (via C++’s ++ unary operator). In a concurrent system, it is possible that multiple threads could be invoking f on the same instance of C concurrently with one another. Some of these increments will be safe to perform while others are not. Others still might only be safe if f is called in certain ways. We’ll see many detailed examples of what can go wrong with this example. Simply put, any increments of shared data are problematic. This is not strictly true because higher level programming conventions and constructs may actually prevent problematic shared interactions, but given the information above, we have no choice but to assume the worst.
By simply looking at the class definition above, how do we determine what state is shared? Unfortunately we can’t. We need more information. The answer to this question depends on how instances of C are used in addition to where the py pointer came from.
We can quickly label the operations that do not act on shared state because there are so few (just one). The only memory location not shared with other threads is the x variable, so the x++ statement doesn’t modify shared state. (Similar to the statement above about type safety, we are relying on the fact that we haven’t previously shared the address of x on the thread’s stack with another thread. Of course, another thread might have found an address to the stack through some other means and could perform address arithmetic to access x indirectly, but this is a remote possibility. Again, we will assume some reasonable degree of type safety.) Though it doesn’t appear in this example, if there was a statement to increment the value of py, i.e., py++, it would not affect shared state because py is passed by value.
The s_f++ statement affects shared state because, by the definition of static variables, the class’s static memory is visible to multiple threads running at once. Had we used a static local variable in f in the above example, it would fall into this category too.
Here’s where it becomes complicated. The m_f++ line might, at first glance, appear to act on private memory, but we don’t have enough information to know. Whether it modifies shared state or not depends on if the caller has shared the instance of C across multiple threads (or itself received the pointer from a caller that has shared the instance). Remember, m_f++ is a pointer dereference internally, (this->m_f)++. The this pointer might refer to an object allocated on the current thread’s stack or allocated dynamically on the heap and may or may not be shared among threads in either case.
class D { static C s_c; // initialized elsewhere... C m_c; // also initialized elsewhere... void g() { int x = 0; C c1(); // stack-alloc c1.f(&x); C c2 = new C(); // heap-alloc c2.f(&x); s_c.f(&x); m_c.f(&x); } }
In the case of the c1.f(&x) function call, the object is private because it was allocated on the stack. Similarly, with c2.f(&x) the object is probably private because, although allocated on the heap, the instance is not shared with other threads. (Neither case is simple: C’s constructor could publish a reference to itself to a shared location, making the object shared before the call to f happens.) When called through s_c, clearly the object is shared because it is stored in a shared static variable. And the answer for the call through m_c is “it depends.” What does it depend on? It depends on the allocation of the instance of D through which g has being invoked. Is it referred to by a static variable elsewhere, another shared object, and so forth? This illustrates how quickly the process of identifying shared state is transitive and often depends on complex, dynamically composed object graphs.
Because the member variable and explicit pointer dereference are similar in nature, you can probably guess why “it depends” for (*py)++ too. The caller of f might be passing a pointer to a private or shared piece of memory. We really have no way of telling.
Determining all of this statically is impossible without some form of type system support (which is not offered by VC++ or any mainstream .NET languages). The process of calculating the set of shared objects dynamically also is even difficult but possible. The process can be modeled much in the same way garbage collection works: by defining the set of shared roots as those objects referenced directly by static variables, we could then traverse the entire reachable set of objects beginning with only those roots, marking all objects as we encounter them (avoiding cycles). At the end, we know that all marked objects are shared. But this approach is too naïve. An object can also become shared at thread creation time by passing a pointer to it as an argument to thread creation routines. The same goes for thread pool APIs, among others. Some objects are special, such as the one global shared OutOfMemoryException object that the CLR throws when memory is very low. Some degree of compiler analysis could help. A technique called escape analysis determines when private memory “escapes” into the shared memory space, but its application is limited mostly to academic papers (see Further Reading, Choi, Gupta, Serrano, Sreedhar, Midkiff). In practice, complications, such as late bound method calls, pointer aliasing, and hidden sources of cross-thread sharing, make static analysis generally infeasible and subject to false negatives without restrictions in the programming model. There is research exploring such ideas, such as ownership types, but it is probably years from mainstream use (see Further Reading, Boyapati, Liskov, Shrira).
In the end, logically separating memory that is shared from memory that is private is of utmost importance. This is perhaps the most fundamental and crucial skill to develop when building concurrent systems in modern programming environments: accurately identifying and properly managing shared state. And, more often than not, shared state must be managed carefully and with a great eye for detail. This is also why understanding and debugging concurrent code that someone else wrote is often very difficult.
State Machines and Time
All programs are state machines. Not all people think of their programs this way, but it turns out to be a convenient mental model for concurrent programs. Even if you don’t think about your program as a state machine proper, you probably at least think about your program in terms of time and the sequence of program events on a sequential timeline: the order in which reads from and writes to variables occur, the time distance between two such events, and so on. A unique problem with concurrency thus arises. We are accustomed to reasoning about the code we write on the screen in sequential order, which is necessarily written in a sequential layout. We form mental models and conclusions about the state transitions possible with these assumptions firmly in mind. However, concurrency invalidates many such assumptions.
When state is shared, multiple concurrent threads, each of which may have been constructed with a set of sequential execution assumptions, may end up overlapping in time. And when they overlap in time, their operations become interleaved. If these operations access common memory locations, they may possibly violate the legal set of state transitions that the program’s state machine was planned and written to accommodate. Once this happens, the program may veer wildly off course, doing strange and inexplicable things that the author never intended, including performing bogus operations, corrupting memory, or crashing.
Broken Invariants and Invalid States
As an illustration, let’s say on your first day at a new programming job you were assigned the task of implementing a reusable, dynamically resizing queue data structure. You’d probably start out with a sketch of the algorithms and outline some storage alternatives. You’d end up with some fields and methods and some basic decisions having been made, perhaps such as using an array to store elements versus a linked list. If you’re really methodical, you might write down the state invariants and transitions and write them down as asserts in the code or even use a formal specification system to capture (and later verify) them. But even if you didn’t go to these lengths, those invariants still exist. Break any one of them during development, or worse after code has been embedded into a system, and you’ve got a bug.
Let’s consider a really simple invariant. The count of the queue must be less than or equal to the length of the array used to store the individual elements. (There are of course several others: the head and tail indices must be within the legal range, and so on.) If this queue was meant only to be used by sequential programs, then preserving the invariant at the entrance and exit of all public methods would be sufficient as a correctness condition. It would be trivial: only those methods that modify the fields need to be written to carefully respect the invariant. The most difficult aspect of attaining this would be dealing with failures, such as an inability to allocate memory when needed.
Things become much more difficult as soon as concurrency is added to the system. Unless another approach is used, you would have to ensure invariants held at every single line of code in your implementation. And even that might not be sufficient if some lines of code (in whatever higher level language you are programming in) were compiled into multiple instructions in the machine language. Moreover, this task becomes impossible when there are multiple variables involved in the operation (as is probably the case with our queue), leading to the requirement of some extra form of state management: i.e., isolation, immutability, or synchronization.
The fact is that it’s very easy to accidentally expose invalid program states as a result of subtle interactions between threads. These states might not exist on any legal state machine diagram we would have drawn for our data structure, but interleaving can cause them. Such problems frequently differ in symptom from one execution of your code to the next—causing new exceptions, data corruption, and so forth and depend on timing in order to manifest. The constant change in symptom and dependence on timing makes it difficult to anticipate the types of failures you will experience when more concurrency is added to the system and makes such failures incredibly hard to debug and fix.
The various solutions hinted at above can solve this problem. The simplest solutions are to avoid sharing data or to avoid updating data completely. Unfortunately, taking such an approach does not completely eliminate the need to synchronize. For instance, you must keep intermediate state changes confined within one thread until they are all complete and then, once the changes are suitable to become visible, you must use some mechanism to publish state updates to the globally visible set of memory as a single, indivisible operation (i.e., atomic operation). All other threads must cooperate by reading such state from the global memory space as a single, indivisible atomic operation.
This is not simple to achieve. Because reading and writing an arbitrary number of memory locations atomically at once are not supported by current hardware, software must simulate this effect using critical regions. A critical region ensures that only one thread executes a certain piece of code at once, eliminating problematic interleaved operations and forcing one after the other timing. This implies some threads in the system will have to wait for others to finish work before doing their own. We will discuss critical regions later. But first, let’s look at a motivating example where data synchronization is direly needed.
A Simple Data Race
Consider this deceivingly simple program statement.
int * a = ...; (*a)++;
(Forgive the C++-isms for those managed programmers reading this. (*a)++ is used instead of a++, just to make it obvious that a points to some shared memory location.)
When translated into machine code by the compiler this seemingly simple, high-level, single-line statement involves multiple machine instructions:
MOV EAX, [a] INC EAX MOV [a], EAX
Notice that, as a first step, the machine code dereferences a to get some virtual memory address and copies 4 bytes’ worth of memory starting at that address into the processor local EAX register. The code then increments the value of its private copy in EAX, and, lastly, makes yet another copy of the value, this time to copy the incremented value held in its private register back to the shared memory location referred to by a.
The multiple steps and copies involved in the ++ operator weren’t apparent in the source file at all. If you were manipulating multiple variables explicitly, the fact that there are multiple steps would be a little more apparent. In fact, it’s as though we had written:
int * a = ...; int tmp = *a; tmp++; *a = tmp;
Any software operation that requires multiple hardware instructions is nonatomic. And thus we’ve now established that ++ is nonatomic (as is ––), meaning we will have to take extra steps to ensure concurrency safety. There are some other nonobvious sources of nonatomic operations. Modern processors guarantee that single reads from and writes to memory in increments of the natural word size of the machine will be carried out atomically covering 32-bit values on 32-bit machines and 64-bit values on 64-bit machines. Conversely, reading or writing data with a size larger than the addressable unit of memory on your CPU is nonatomic. For instance, if you wrote a 64-bit value on a 32-bit machine, it will entail two move instructions from processor private to shared memory, each to copy a 4-byte segment. Similarly, reading from or writing to unaligned addresses (i.e., address ranges that span an addressable unit of memory) also require multiple memory operations in addition to some bit masking and shifting, even if the size of the value is less than or equal to the machine’s addressable memory size. Alignment is a tricky subject and is discussed in much more detail in Chapter 10, Memory Models and Lock Freedom.
So why is all of this a problem?
An increment statement is meant to monotonically increase the value held in some memory location by a delta of 1. If three increments were made to a counter with an original value 0, you’d expect the final result to be 3. It should never be possible (overflow aside) for the value of the counter to decrease from one read to the next; therefore, if a thread executes two (*a)++ operations, one after the other, you would expect that the second update always yields a higher value than the first. These are some very basic correctness conditions for our simple (*a)++ program. (Note: You shouldn’t be expecting that the two values will differ by precisely 1, however, since another thread might have snuck in and run between them.)
There’s a problem. While the actual loads and stores execute atomically by themselves, the three operation sequence of load, increment, and store is nonatomic, as we’ve already established. Imagine three threads, t1, t2, and t3, are running the compiled program instructions simultaneously.
t1 t2 t3 t1(0): MOV EAX,[a] t2(0): MOV EAX,[a] t3(0): MOV EAX,[a] t1(1): INC,EAX t2(1): INC,EAX t3(1): INC,EAX t1(2): MOV [a],EAX t2(2): MOV [a],EAX t3(2): MOV [a],EAX
Each thread is running on a separate processor. Of course, this means that each processor has its own private EAX register, but all threads see the same value in a and therefore access the same shared memory. This is where time becomes a very useful tool for explaining the behavior of our concurrent programs. Each of these steps won’t really happen “simultaneously.” Although separate processors can certainly execute instructions simultaneously, there is only one central, shared memory system with a cache coherency system that ensures a globally consistent view of memory. We can therefore describe the execution history of our program in terms of a simple, sequential time scale.
In the following time scale, the y-axis (labeled T) represents time, and the abscissa, in addition to a label of the form thread (sequence#) and the instruction itself, depicts a value in the form #n, where n is the value in the memory target of the move after the instruction has been executed.
T t1 t2 t3 0 t1(0): MOV EAX,[a] #0 1 t1(1): INC,EAX #1 2 t1(2): MOV [a],EAX #1 3 t2(0): MOV EAX,[a] #1 4 t2(1): INC,EAX #2 5 t2(2): MOV [a],EAX #2 6 t3(0): MOV EAX,[a] #2 7 t3(1): INC,EAX #3 8 t3(2): MOV [a],EAX #3
If a is an integer that begins with a value of 0 at time step 0, then after three (*a)++ operations have executed, we expect the value to be 0 + 3 = 3. Indeed, we see that this is true for this particular history: t1 runs to completion, leaving value 1 in *a, and then t2, leaving value 2, and finally, after executing the instruction at time 8 in our timeline, t3 has finished and *a contains the expected value 3.
We can compress program histories into more concise representations so that they fit on one line instead of needing a table like this. Because only one instruction executes at any time step, this is simple to accomplish. We’ll write each event in sequence, each with a thread (sequence#) label, using the notation a —> b to denote that event a happens before b. A sequence of operations is written from left to right, with the time advancing as we move from one operation to the next. Using this scheme, the above history could be written instead as follows.
t1(0)->t1(1)->t1(2)->t2(0)->t2(1)->t2(2)->t3(0)->t3(1)->t3(2)
We’ll use one form or the other depending on the level of scrutiny in which we’re interested for that particular example. The longhand form is often clearer to illustrate specific values and is better at visualizing subtle timing issues, particularly for larger numbers of threads.
No matter the notation, examining timing like this is a great way of reasoning about the execution of concurrent programs. Programmers are accustomed to thinking about programs as a sequence of individual steps. As you develop your own algorithms, writing out the concurrent threads and exploring various legal interleavings and what they mean to the state of your program, it is imperative to understanding the behavior of your concurrent programs. When you think you might have a problematic timing issue, going to the whiteboard and trying to devise some problematic history, perhaps in front of a colleague, is often an effective way to uncover concurrency hazards (or determine their absence).
Simple, noninterleaved histories pose no problems for our example. The following histories are also safe with our algorithm as written.
t1(0)->t1(1)->t1(2)->t3(0)->t3(1)->t3(2)->t2(0)->t2(1)->t2(2) t2(0)->t2(1)->t2(2)->t1(0)->t1(1)->t1(2)->t3(0)->t3(1)->t3(2) t2(0)->t2(1)->t2(2)->t3(0)->t3(1)->t3(2)->t1(0)->t1(1)->t1(2) t3(0)->t3(1)->t3(2)->t1(0)->t1(1)->t1(2)->t2(0)->t2(1)->t2(2) t3(0)->t3(1)->t3(2)->t2(0)->t2(1)->t2(2)->t1(0)->t1(1)->t1(2)
These histories yield correct results because none results in one thread’s statements interleaving amongst another’s. In each scenario, the first thread runs to completion, then another, and then the last one. In these histories, the threads are serialized with respect to one another (or the history is serializable).
But this example is working properly by virtue of sheer luck. There is nothing to prevent the other interleaved histories from occurring at runtime, where two (or more) threads overlap in time, leading to an interleaved timing and resulting race conditions. Omitting t3 from the example for a moment, consider this simple timing, written out longhand so we can emphasize the state transitions from one time step to the next.
T t1 t2 0 t1(0): MOV EAX,[a] #0 1 t2(0): MOV EAX,[a] #0 2 t2(1): INC,EAX #1 3 t2(2): MOV [a],EAX #1 4 t1(1): INC,EAX #1 5 t1(2): MOV [a],EAX #1
The value of *a starts at 0. Because two increments happen, we would expect the resulting value to be 0 + 2 = 2. However, *a ends up at 1. This clearly violates the first correctness condition of our algorithm as stated initially: for each thread that invokes the increment operator, the global counter increments by exactly 1.
This is a classic race condition, or more precisely, a data race, because, in this case, our problems are caused by a lack of data synchronization. It is called a “race” because the correctness of our code depends squarely on the outcome of multiple threads racing with one another. It’s as if each is trying to get to the finish line first, and, depending on which gets there first, the program will yield different results, sometimes correct and sometimes not. Races are just one of many issues that can arise when shared state is involved and can be a serious threat to program correctness. A thorough exploration of concurrency hazards, including races, is presented in Chapter 11, Concurrency Hazards.
Why did this race manifest? It happened because t1 and t2 each made a copy of the shared memory value in their own processor local register, one after the other, both observing the same value of 0, and then incremented their own private copies. Then both copied their new values back into the shared memory without any validation or synchronization that would prevent one from overwriting the other’s value. Both threads calculate the value 1 in their private registers, without knowledge of each other, and, in this particular case, t1 just overwrites t2’s earlier write of 1 to the shared location with the same value.
One might question how likely this is to occur. (Note that the likelihood matters very little. The mere fact that it can occur means that it is a very serious bug. Depending on the statistical improbability of such things is seriously discouraged. A program is not correct unless all possible sources of data races have been eliminated.) This interleaved history can happen quite easily, for obvious reasons, if t1 and t2 were running on separate processors. The frequency depends on the frequency with which the routine is accessed, among other things. This problem can also arise on a single processor machine, if a context switch occurred—because t1’s quantum had expired, because t2 was running at a higher priority, and so forth—right after t1 had moved the contents of a into its EAX register or after it had incremented its private value. The probability of this happening is higher on a machine with multiple processors, but just having multiple threads running on a single processor machine is enough. The only way this may be impossible is if code accessing the same shared state is never called from multiple threads simultaneously.
Other execution histories exhibit the same problem.
t1(0)->t2(0)->t1(1)->t1(2)->t2(1)->t2(2) t1(0)->t1(1)->t2(0)->t1(2)->t2(1)->t2(2) t2(0)->t1(0)->t1(1)->t1(2)->t2(1)->t2(2) ...and so on
If we add the t3 thread back into the picture, we can violate the second correctness condition of our simple increment statement, in addition to the first, all at once.
T t1 t2 t3 0 t3(0): MOV EAX,[a] #0 1 t1(0): MOV EAX,[a] #0 2 t1(1): INC,EAX #1 3 t1(2): MOV [a],EAX #1 4 t2(0): MOV EAX,[a] #1 5 t2(1): INC,EAX #2 6 t2(2): MOV [a],EAX #2 7 t3(1): INC,EAX #1 8 t3(2): MOV [a],EAX #1
In this program history, the global counter is updated to 1 by t1, and then to 2 by t2. Everything looks fine from the perspective of other threads in the system at this point in time. But as soon as t3 resumes, it wipes out t1’s and t2’s updates, “losing” two values from the counter and going backward to a value of 1. This is because t3 made its private copy of the shared value of *a before t1 and t2 even ran. The second correctness condition was that the value only ever increases; but if t2 runs again, it will see a value smaller than the one it previously published. This is clearly a problem that is apt to break whatever algorithm is involved. As we add more and more threads that are frequently running close together in time, we increase the probability of such problematic timings accordingly.
All of these histories demonstrate different kinds of hazards.
- Read/write hazard. A thread, t1, reads from a location, t2, then writes to that location, and t1 subsequently makes a decision based on its (now invalid) read of t1. This also can be referred to as a stale read.
- Write/write hazard. A thread, t1, writes to the same location as t2 in a concurrency unsafe way, leading to lost updates, as in the example given above.
- Write/read hazard. A thread, t1, writes to a location and then t2 reads from it before it is safe to do so. In some cases, t1 may decide to undo its partial update to state due to a subsequent failure, leading t2 to make decisions on an invalid snapshot of state that should have never been witnessed. This also can be referred to as an unrepeatable read.
- Read/read hazard. There is no problem with multiple concurrent threads reading the same shared data simultaneously. This property can be exploited to build a critical region variant called a reader/writer lock to provide better performance for read/read conflicts; this idea is explored more in Chapter 6, Data and Control Synchronization.
(This last point is a simplification. Normally read/read conflicts are safe in the case of simple shared memory, but there are some cases in which they are not: when a read has a side effect, like reading a stack’s guard page, or when reading some data associated with a physical device, it may be necessary to ensure no two threads try to do it concurrently.)
Very little of this discussion is specific to the ++ operator itself. It just turns out to be a convenient example because it intrinsically exhibits all of the problematic conditions that lead to these timing issues.
- Multiple threads make private copies of data from a shared location.
- Threads publish results back to shared memory, overwriting existing values.
- Compound updates may be made with the intent of establishing or preserving invariants between multiple independent shared locations.
- Threads run concurrently such that their timing overlaps and operations interleave.
There is no greater skill that differentiates great concurrent programmers from the rest than the ability to innately predict and consider various timings to some reasonable depth of complexity. With experience comes the ability to see several steps ahead and proactively identify the timings that can lead to race conditions and other hazards. This is especially important when writing sophisticated lock free algorithms, which eschew isolation, immutability, and synchronization in favor of strict discipline and reliance on hardware guarantees, which we’ll review in Chapter 10, Memory Models and Lock Freedom.
On Atomicity, Serializability, and Linearizability
A fundamental problem is that many program operations are not truly atomic because an operation consists of multiple logical steps, a certain logical step is comprised of many physical steps, or both. Atomicity, quite simply, is the property that a single operation or set of operations appear as if they happened at once. Any state modifications and side effects performed are completely instantaneous, and no other thread in the system can observe intermediary (and invalid) states that occur in the midst of such an atomic operation. Similarly, the atomic operation must not be permitted to fail part way through the update, or if it does so, there must be a corresponding roll back of state updates to the previous state.
By this definition, atomicity would seldom be practical to achieve, at least physically. Although processors guarantee single writes to aligned words of memory are truly atomic, higher level logical operations—like the execution of a single method call on an object, consisting of several statements, function calls, and reads and writes—are not so simple. In fact, sometimes the operations we’d like to make atomic can even span physical machines, perhaps interacting with a Web service or database, at which point the difficulty of ensuring atomicity is greater. System wide control mechanisms must be used to achieve atomicity except for very simple read and write operations. As already noted, critical regions can simulate atomicity for in-memory updates. Transactions, of the ilk found in databases, COM+, and the System.Transactions namespace in .NET, are also attractive solutions when multiple or persistent durable resources are involved.
When two operations are atomic, they do not appear to overlap in time. If we were to plot several atomic operations on a timeline, then we could place one before or after the other without worrying about having to interleave them. We did this earlier for individual reads and writes, and it was possible because of the guarantees made by the hardware that they are atomic. Object oriented programs are typically built from higher level atomic methods, however, and reasoning about concurrency at this level (like “puts an element in the queue,” “writes data to disk,” and so forth), and not about the individual memory reads and writes involved, is often more useful.
Serializability is when two operations happen one after the other; if a happens before b, then a serializes before b. Building your program out of atomic operations achieves serializability. It’s as though your program was executed sequentially, by a single processor, by executing each atomic operation in the sequence as it appeared in the resulting serializable order. But serializability on its own is insufficient for correctness; and it’s often in the eye of the beholder—remember that even individual reads and writes are themselves atomic. For a concurrent program to be correct, all possible serialization orders must be legal. Techniques like critical regions can be used to constrain legal serialization orders.
Linearizability is a property related to serializability and also is used to describe correctness of atomic operations (see Further Reading, Herlihy, Wing): a linearization point is a place when a batch of atomic updates becomes visible to other threads. This commonly corresponds to exiting a critical region where the updates made within suddenly become visible. It is typically easier to reason about linearization points instead of the more abstract serialization property.
Atomic operations also must be reorderable, such that having one start completely before the other still leads to a correct program schedule. That’s not to say that subsequently initiated operations will not change behavior based on the changed order of commutative operations, due to causality, but this reordering should not fundamentally alter the correctness of a program.
As software developers, we like to think of serializable schedules and atomic operations. But we’d also like to use concurrency for the reasons identified earlier in this book, for performance, responsiveness, and so on. For this reason, the Win32 and .NET Framework platforms give you a set of tools to achieve atomicity via data synchronization constructs, as implied earlier. Those familiar with relational databases will recognize a similarity: databases employ transactions to achieve serializable operations, giving the programmer an interface with atomicity, consistency, isolation, and durability (a.k.a. ACID). You will notice many similarities, but you will also notice that these properties must be achieved “by hand” in general purpose concurrent programming environments.
Isolation
An obvious approach to eliminating problematic shared state interactions is to avoid sharing state in the first place. We described how concurrent systems are typically formed out of higher level components that eschew sharing in favor of isolation, and that lower level components typically do share data for purposes of fine-grained, performance sensitive operations. This is a middle ground, but the two extremes are certainly possible: on one hand, all components in the system may share state, while, on the other hand, no components share state and instead communicate only via loosely coupled messages. And there are certainly situations in which the architecture is less clearly defined: i.e., some lower level components will use isolation, while some higher level components will share state for efficiency reasons.
When it comes to employing isolation, there are three basic techniques from which to choose.
- Process isolation. Each Windows process has a separate memory address space, ensuring that one process cannot read or write memory used by another. Hardware protection is used to absolutely guarantee that there is no chance of accidental sharing by bleeding memory references. Note that processes do share some things, like machine-wide kernel objects, the file system, memory mapped files, and so on, so even rigid process isolation can be broken. An even more extreme technique is isolating components on separate machines or inside virtualized partitions on a single machine.
- Intraprocess isolation. If you are using managed code, CLR Application Domains (AppDomains) can be used to isolate objects so that code running in one AppDomain cannot read or write an object running in another AppDomain. While hardware protection is not used to enforce this isolation, the verifiable type safety employed by the CLR ensures that no sharing will occur. There are some specific ways to circumvent this broadly stated policy, but they are generally opt-in and rare.
- By convention. When some code allocates a piece of memory or an object, either dynamically from the heap or on the stack, this data begins life as unshared, and, hence, is in effect isolated. This data remains isolated so long as care is taken to not share the data (as described previously), that is, by not storing a reference to the data in a shared location (like a static variable or object reachable through a static variable). This is the trickiest of the three approaches to implement safely because it is entirely based on programming convention and care, is not checkable in any way and has no infrastructure regulated support such as hardware isolation or type system verification.
It’s common to use isolated state as a form of cache. In other words, though some state is physically isolated, it is merely a copy of some master copy that is not isolated. Such designs require that the master copy is periodically refreshed (if updates are made to the cache) and that caches are refreshed as the master copy changes. Depending on the requirements, a more sophisticated cache coherency mechanism may be needed, to guarantee that refreshes happen in a safe and serializable way, requiring a combination of isolation and synchronization techniques.
The last mechanism, enforcement by convention, requires that programs follow some strict disciplines, each of which is cause for concern because they are informal and inherently brittle. It can be useful to think of state in terms of being “owned” by particular “agents” at any point in time. Thinking this way allows you to very clearly articulate where ownership changes for a particular piece of data, including at what point data transitions between isolated and shared.
Data Ownership
At any point in time, a particular piece of isolated data can be said to be owned by one agent in the system. Ownership is used in this context to mean that the agent understands what other components or agents may concurrently access that piece of data, and what this means for the read and write safety of its own operations. If, at any time, multiple agents believe they own the same piece of data, it is likely that the data is no longer truly isolated. Clearly there are many kinds of ownership patterns a system might employ, including shared ownership, but let’s stick to the idea of single agent ownership for a moment.
An agent may transfer ownership, but it must do so with care. For example, some agent may allocate and initialize some interesting object, but then insert it into a global shared list. This is called publication. Publication transfers ownership from the initializing agent to the global namespace; at some point in the future, an agent may remove the data from the shared list, at which point the ownership transfers from the global namespace to that agent. This is called privatization. Publication must be done such that the agent doing the transferring no longer tries to access the state as though it is the sole owner: the data is no longer confined (or isolated) within the agent. Similarly, privatization must be done such that other agents do not subsequently try to access the privatized data.
One of the more difficult aspects of ownership is that a piece of data may move between isolation and shared status over the course of its life. These publication and privatization points must be managed with care. A slight misstep, such as erroneously believing an object is private and no longer shared when in reality other threads still have outstanding references to it that they might use, can introduce all of the same kinds of race condition problems noted earlier.
Another challenge with isolation is determining where the points of escape in the program might be. Publication is not always such a clear-cut point in the program’s execution. This requires that agents attempting to control ownership of data only ever share references to this data with trusted agents. The agent is trusting that the other agents will not publish the reference so that the data becomes shared, either directly or indirectly (e.g., by passing the reference along to another untrusted agent). Similarly, an agent that receives a reference to data from an outside source must assume the worst—that the data is shared—unless an alternative arrangement is known, such as only ever being called by an agent that guarantees the data is isolated. Again, the lack of type system and verification support makes this convention notoriously tricky to implement and manage in real programs, particularly when multiple developers are involved.
Immutability
As noted earlier, read/read “hazards” are not really hazardous at all. Many threads can safely read from some shared memory location concurrently without cause for concern. Therefore, if some piece of shared state is guaranteed to be immutable—that is, read-only—then accessing it from many threads inside a concurrent system will be safe.
Proving that a piece of complex data is immutable is not terribly difficult with some discipline. Both C++ and .NET offer constructs to help make immutable types. If each of an object’s fields never changes during its lifetime, it is shallow immutable. If the object’s fields also only refer to objects whose state does not change over time, the object is deeply immutable. An entire object graph can be transitively immutable if all objects within it are themselves deeply immutable.
In the case that data transitions between private and shared throughout its lifetime, as discussed above in the context of isolation, it is sometimes useful to have a conditionally-immutable type, in which it remains immutable so long as it is shared but can be mutated while private. So, for example, a thread may remove a piece of shared state from public view, making it temporarily private, mutate it, and then later share the state again to public view.
Single Assignment
A popular technique for enforcing the immutability of data is to use single assignment variables. Many programming systems offer static verification that certain data is indeed assigned a value only once, leading to the term static single assignment, or SSA.
The CLR offers limited support for single assignment variables in its common type system through the initonly field modifier, surfaced in C# through the readonly keyword. And C++ offers the const modifier to achieve a similar effect, though it is far more powerful: pointers may be marked as being const, ensuring (statically) that the instance referred to is not modified by the user of such a pointer (though unlike readonly C++ programmers can explicitly “cast away the const-ness” of a reference, bypassing the safety guarantees). Using these constructs can be tremendously useful because it avoids having to depend on brittle and subtle programming convention and rules. Let’s look at each briefly.
CLR initonly Fields (a.k.a. C# readonly Fields)
When you mark a field as readonly in C#, the compiler emits a field with the initonly modifier in the resulting IL. The only writes to such variables that will pass the type system’s verification process are those that occur inside that type’s constructor or field initializers. This ensures that the value of such a field cannot change after the object has been constructed. While it is not a true single assignment variable, as it can be written multiple times during initialization, it is similar in spirit.
Another subtle issue can arise if a reference to an object with readonly fields escapes from its constructor. Fields are not guaranteed to have been initialized with the permanent immutable values until after the constructor has finished running and could be assigned multiple values during the construction process. If an object’s constructor shares itself before finishing initialization, then other concurrent threads in the system cannot safely depend on the readonly nature of the fields. Letting the object’s this reference escape before the object is fully constructed like this is a bad practice anyway, and is easily avoided. When a field is marked readonly, it simply means the field’s value cannot change. In other words, a type with only readonly fields is shallow immutable but not necessarily deeply immutable. If an object depends on the state of the objects it references, then those objects should be immutable also. Unfortunately, the CLR offers no type system support for building deeply immutable types. Of course they may be immutable by convention, readonly fields, or a combination of both.
There are some cases where the mutability of referenced objects does not matter. For example, if we had an immutable pair class that refers to two mutable objects but never accesses the state of those objects, then is the pair itself immutable?
class ImmutablePair<T, U> { private readonly T m_first; private readonly U m_second; public ImmutablePair(T first, U second) { m_first = first; m_second = second; } public T First { get { return m_first; } } public U Second { get { return m_second; } } }
From one perspective, the answer is yes. The ImmutablePair<T, U> implementation itself cannot tell whether the m_first or m_second objects have been mutated, since it never accesses their internal state. If it relied on a stable ToString value, then it might matter. Those who instantiate ImmutablePair<T, U> may or may not care about deep immutability, depending on whether they access the pair’s fields; they control this by the arguments they supply for T and U. So it seems shallow immutability here is sufficient. That said, if a developer desires deep immutability, they need only supply immutable types for T and U.
C++ Const
C++ const is a very powerful and feature rich-programming language construct, extending well beyond simple single assignment variable capabilities, and encompassing variables, pointers, and class members. A complete overview of the feature is outside of the scope of this book. Please refer instead to a book such as The C++ Programming Language, Third Edition (see Further Reading, Stroustrup), for a detailed overview.
Briefly, the const modifier can be a useful and effective way of relying on the C++ compiler to guarantee a certain level of immutability in your data structures, including single assignment variables. In summary:
Class fields may be marked const, which enforces that their value is assigned at initialization time in the constructor’s field initialization list and may not subsequently change. This effectively turns a field into a single assignment variable, though it may still be modified by a pointer that has been cast a certain way (as we’ll see soon).
The value of static const fields cannot depend on runtime evaluation, unlike class member fields that can involve arbitrary runtime computation to generate a value, much like CLR initonly fields. This means they are limited to compiler constants, statically known addresses, and inline allocated arrays of such things.
- Member functions may be marked const, which means that the function body must not modify any fields and ensures that other non-const member functions cannot be invoked (since they may modify fields).
- Pointers can be marked as “pointing to a constant,” via the prefix const modifier. For instance, the following declaration states that d points to a constant object of type C: const C * d. When a pointer refers to a constant, only const member functions may be called on it, and the pointer may not be passed where an ordinary non-const pointer is expected. A const pointer to an integral type cannot be written through. A non-const pointer can be supplied where a const is expected. Constant references are also possible.
As noted earlier, a pointer to a constant can be cast to a non-const pointer, which violates most of what was mentioned above. For example, the C++ compiler enforces that a pointer to a const member field also must be a pointer to const; but you can cast this to a non-const pointer and completely subvert the const guarantees protecting the field. For example, given the following class declaration, pointers may be manufactured and used in certain ways.
class C { public: const int d; C(int x) : d(x) {} }; // ... elsewhere ... C * pC = ...; const int * pCd1 = &pC->d; // ok! *pC->d = 42; // compiler error: cannot write to const int * pCd2 = &pC->d; // compiler error: non-const pointer to const field int * pCd3 = const_cast<int *>(&pC->d); // succeeds!
Casting away const is a generally frowned upon practice, but is sometimes necessary. And, a const member function can actually modify state, but only if those fields have been marked with the mutable modifier. Using this modifier is favored over casting. Despite these limitations, liberal and structured use of const can help build up a stronger and more formally checked notion of immutability in your programs. Some of the best code bases I have ever worked on have used const pervasively, and in each case, I have found it to help tremendously with the maintainability of the system, even with concurrency set aside.
Dynamic Single Assignment Verification
In most concurrent systems, single assignment has been statically enforced, and C# and C++ have both taken similar approaches. It’s possible to dynamically enforce single assignment too. You would just have to reject all subsequent attempts to set the variable after the first (perhaps via an exception), and handle the case where threads attempt to use an uninitialized variable. Implementing this does require some understanding of the synchronization topics about to be discussed, particularly if you wish the end result to be efficient; some sample implementation approaches can be found in research papers (see Further Reading, Drejhammar, Schulte).