- 13.1 Concurrentgate
- 13.2 A Brief History of Data Sharing
- 13.3 Look, Ma, No (Default) Sharing
- 13.4 Starting a Thread
- 13.5 Exchanging Messages between Threads
- 13.6 Pattern Matching with receive
- 13.7 File Copyingwith a Twist
- 13.8 Thread Termination
- 13.9 Out-of-Band Communication
- 13.10 Mailbox Crowding
- 13.11 The shared Type Qualifier
- 13.12 Operations with shared Data and Their Effects
- 13.13 Lock-Based Synchronization with synchronized classes
- 13.14 Field Typing in synchronized classes
- 13.15 Deadlocks and the synchronized Statement
- 13.16 Lock-Free Coding with shared classes
- 13.17 Summary
13.16 Lock-Free Coding with shared classes
The theory of lock-based synchronization was established in the 1960s. As early as 1972 [23], researchers started making inroads toward avoiding the slow, ham-fisted mutexes as much as possible in multithreaded programs. For example, some types were assignable atomically so people reckoned there was no ostensible need to guard such assignments with mutex acquisition. Also, some processors offered more advanced lightweight interlocked instructions such as atomic increment or test-and-set. About three decades later, in 1990, there was a definite beam of hope that some clever combination of atomic read-write registers could help avoid the tyranny of locks. At that point, a seminal piece of work had the last word in a line of work and the first word in another.
Herlihy's 1991 paper "Wait-free synchronization" [31] marked an absolutely powerful development in concurrent programming. Prior to that, it was unclear to hardware and software developers alike what kind of synchronization primitives would be best to work with. For example, a processor with atomic reads and writes for ints could intuitively be considered less powerful than one that also offers atomic +=. It may appear that one that offers atomic *= is even better; generally, the more atomic primitives one has at one's disposal, the merrier.
Herlihy blew that theory out of the water and in particular has shown that certain seemingly powerful synchronization primitives, such as test-and-set, fetch-and-add, and even one global shared FIFO queue, are virtually useless. These impossibility results were proven clearly enough to instantly disabuse anyone of the illusion that such mechanisms could provide the magic concurrency potion. Fortunately, Herlihy has also proved universality results—certain synchronization primitives may theoretically synchronize an infinite number of concurrent threads. Remarkably, the "good" primitives are not more difficult to implement than the "bad" ones and don't look particularly powerful to the naked eye. Of the useful synchronization primitives, one known as compare-and-swap has caught on and is implemented today by virtually all processors. Compare-and-swap has the following semantics:
// This function executes atomically bool cas(T)(shared(T) * here, shared(T) ifThis, shared(T) writeThis) { if (*here == ifThis) { *here = writeThis; return true; } return false; }
In plain language, cas atomically compares a memory location with a given value, and if the location is equal to that value, it stores a new value; otherwise, it does nothing. The result of the operation tells whether the store took place. The entire cas operation is atomic and must be provided as a primitive. The set of possible Ts is limited to integers of the native word size of the host machine (i.e., 32 or 64 bits). An increasing number of machines offer double-word compare-and-swap, sometimes dubbed cas2. That operation atomically manipulates 64-bit data on a 32-bit machine and 128-bit data on a 64-bit machine. In view of the increasing support for cas2 on contemporary machines, D offers double-word compare-and-swap under the same name (cas) as an overloaded intrinsic function. So in D you can cas values of types int, long, float, double, all arrays, all pointers, and all class references.
13.16.1 shared classes
Following Herlihy's universality proofs, many data structures and algorithms took off around the nascent "cas-based programming." Now, if a cas-based implementation is possible for theoretically any synchronization problem, nobody has said it's easy. Defining cas-based data structures and algorithms, and particularly proving that they work correctly, is a difficult feat. Fortunately, once such an entity is defined and encapsulated, it can be reused to the benefit of many [57].
To tap into cas-based lock-free goodness, use the shared attribute with a class or struct definition:
shared struct LockFreeStruct { ... } shared class LockFreeClass { ... }
The usual transitivity rules apply: shared propagates to the fields of the struct or class, and methods offer no special protection. All you can count on are atomic assignments, cas calls, the guarantee that the compiler and machine won't do any reordering of operations, and your unbridled confidence. But be warned—if coding were walking and message passing were jogging, lock-free programming would be no less than the Olympics.
13.16.2 A Couple of Lock-Free Structures
As a warmup exercise, let's implement a lock-free stack type. The basic idea is simple: the stack is maintained as a singly linked list, and insertions as well as removals proceed at the front of the list:
shared struct Stack(T) { private shared struct Node { T _payload; Node * _next; } private Node * _root; void push(T value) { auto n = new Node(value); shared(Node)* oldRoot; do { oldRoot = _root; n._next = oldRoot; } while (!cas(&_root, oldRoot, n)); } shared(T)* pop() { typeof(return) result; shared(Node)* oldRoot; do { oldRoot = _root; if (!oldRoot) return null; result = & oldRoot._payload; } while (!cas(&_root, oldRoot, oldRoot._next)); return result; } }
Stack is a shared struct, and as a direct consequence pretty much everything inside of it is also shared. The internal type Node has the classic payload-and-pointer structure, and the Stack itself stores the root of the list.
The do/while loops in the two primitives may look a bit odd, but they are very common; slowly but surely, they dig a deep groove in the cortex of every cas-based programming expert-to-be. The way push works is to first create a new Node that will store the new value. Then, in a loop, _root is assigned the pointer to the new node, but only if in the meantime no other thread has changed it! It's quite possible that another thread has also performed a stack operation, so push needs to make sure that the root assumed in oldRoot has not changed while the new node was being primed.
The pop method does not return by value, but instead by pointer. This is because pop may find the queue empty, which is not an exceptional condition (as it would be in a single-threaded stack). For a shared stack, checking for an element, removing it, and returning it are one organic operation. Aside from the return aspect, pop is similar in the implementation to push: _root is replaced with care such that no other thread changes it while the payload is being fetched. At the end of the loop, the extracted value is off the stack and can be safely returned to its caller.
If Stack didn't seem that complicated, let's look at actually exposing a richer singly linked interface; after all, most of the infrastructure is built inside Stack already.
Unfortunately, for a list things are bound to become more difficult. How much more difficult? Brutally more difficult. One fundamental problem is insertion and deletion of nodes at arbitrary positions in the list. Say we have a list of int containing a node with payload 5 followed by a node with payload 10, and we want to remove the 5 node. No problem here—just do the cas magic to swing _root to point to the 10 node. The problem is, if at the same time another thread inserts a new node right after the 5 node, that node will be irretrievably lost: _root knows nothing about it.
Several solutions exist in the literature; none of them is trivially simple. The implementation described below, first proposed by Harris [30] in the suggestively entitled paper "A pragmatic implementation of non-blocking linked-lists," has a hackish flavor to it because it relies on setting the unused least significant bit of the _next pointer. The idea is first to mark that pointer as "logically deleted" by setting its bit to zero, and then to excise the node entirely in a second step:
shared struct SharedList(T) { shared struct Node { private T _payload; private Node * _next; @property shared(Node)* next() { return clearlsb(_next); } bool removeAfter() { shared(Node)* thisNext, afterNext; // Step 1: set the lsb of _next for the node to delete do { thisNext = next; if (!thisNext) return false; afterNext = thisNext.next; } while (!cas(&thisNext._next, afterNext, setlsb(afterNext))); // Step 2: excise the node to delete if (!cas(&_next, thisNext, afterNext)) { afterNext = thisNext._next; while (!haslsb(afterNext)) { thisNext._next = thisNext._next.next; } _next = afterNext; } } void insertAfter(T value) { auto newNode = new Node(value); for (;;) { // Attempt to find an insertion point auto n = _next; while (n && haslsb(n)) { n = n._next; } // Found a possible insertion point, attempt insert auto afterN = n._next; newNode._next = afterN; if (cas(&n._next, afterN, newNode)) { break; } } } } private Node * _root; void pushFront(T value) { ... // Same as for Stack.push } shared(T)* popFront() { ... // Same as for Stack.pop } }
The implementation is tricky but can be understood if you keep in mind a couple of invariants. First, it's OK for logically deleted nodes (i.e., Node objects with the field _next having its least significant bit set) to hang around for a little bit. Second, a node is never inserted after a logically deleted node. That way, the list stays coherent even though nodes may appear and disappear at any time.
The implementation of clearlsb, setlsb and haslsb is as barbaric as it gets; for example:
T* setlsb(T)(T* p) { return cast(T*) (cast(size_t) p | 1); }