Writing Concurrent Systems, Part 2: Lockless Data Structures
When you start making a sequential system support concurrency, the easiest approach is to protect individual parts of it with locks. This design works, but it doesn't always provide the best solution. Locking is relatively expensive and can limit scalability, as I described in part 1 of this series.
In this article, the second in the series, we're going to take a look at some lockless data structures. These structures allow concurrent access, without the need for explicit locking, and can be significantly faster.
The Building Blocks
Lockless algorithms depend on the existence of a small number of atomic operations and on memory ordering. The key idea behind any lockless operation is that you can do it several times concurrently, as long as you do everything in the right order.
This requirement makes x86 a good platform, for once. The x86 architecture is strongly ordered, meaning that memory accesses happen in the order in which they appear in the instruction stream. A lot of other architectures are not so ordered. The CPU is free to reorder instructions, so you need to insert explicit memory barriers to prevent reordering.
At a minimum, lockless algorithms require an atomic compare-and-exchange instruction. We used one form of the GCC intrinsic for this instruction in part 1 to implement a spinlock. The instruction is quite simple. You can think of this intrinsic as being implemented like this:
type __sync_val_compare_and_swap (type *ptr, type oldval type newval) { if (*ptr == oldval) { *ptr = newval; return oldval; } return *ptr; }
Unlike this function, the intrinsic operates atomically with respect to other memory accesses. In a single-processor x86 system, it's a single instruction, so it can't be preempted. In a multicore system, the instruction will lock the cache line containing *ptr, preventing it from being modified by other cores until the instruction completes.
On RISC architectures, it's common to implement this intrinsic as a short sequence of instructions using a reservation mechanism. For example, PowerPC provides two instructions:
- lwarx loads a word and sets the reservation bit on a region of memory. The region depends on the implementationpossibly the word, probably the cache line, maybe the page.
- stwcx performs a store that succeeds only if the reservation bit for that memory index is still set. Any other write to that area of memory, by any processor, will clear the reservation bit.
You can build any single-word atomic operations from this pair of instructions simply by loading the value, doing something to it, conditionally storing it, and trying again if the conditional store failed.