Understanding C11 and C++11 Atomics
- Understanding C11 and C++11 Atomics
- New Atomic Types
- Memory Barriers
- C++11 and C11 Memory Orderings
- Acquire-and-Release Barriers
- Building Blocks
Both the C11 and C++11 specifications contain a new set of atomic types and operations. They're designed to have interoperable semantics, but they don't share quite the same syntax.
Older versions of C and C++ had no support for atomic operations at all. If you're using a GCC-compatible compiler, such as ICC or Clang, you may have come across the _sync_* family of built-in functions, which provide some support. These functions were originally designed for Itanium, so they map quite closely to other operations that are familiar to Itanium users. Everyone else had to use inline (or out-of-line) assembly code for each architecture, or (much more expensive) calls to library functions.
Atomic operations typically perform a read-modify-write sequence on a memory address. For example, an atomic increment loads a value, increments it, and stores the result in such a way that no other thread can modify the value in the middle. If you have two threads performing an atomic increment on the same value a thousand times, the result will be two thousand times greater than the initial value, irrespective of how these threads run. If they used a non-atomic increment, it would be possible for both to read the value, increment their copy (in a register) and then to store the result, so the observable result is a single increment, rather than two.
Why should you care about atomics? Because they're often very important for getting good performance out of multithreaded code. Consider the trivial example of a monotonically incrementing counter. When I wrote the Cocoa Programming Developer's Handbook, I included a little demonstration program using an unsafe counter (just incremented using ++) as a baseline, and then a version using an NSLock to protect the counter, a version using a POSIX mutex, and a version using atomic operations (as well as some other versions that aren't relevant here). The costs of these various versions, as multiples of the cost of a single-threaded version, were as shown in the following table:
Version |
Cost Multiple |
NSLock |
30 |
POSIX mutex: |
14 |
Atomic intrinsic |
6 |
These results were from my old laptop with a Core 2 Duo. Running the test again with a Core i7, I get the results shown in the next table:
Version |
Cost Multiple |
NSLock |
29 |
POSIX mutex |
15 |
Atomic intrinsic |
3 |
As the two tables show, the cost has gone down quite a lot. Recently, people have started writing a lot of code that depends on atomic operations, and it has become worth optimizing it in the silicon. Anyone these days who cares about performance is likely to use multithreaded code, and any multithreaded code needs to use atomic operations to ensure correct memory accesses.
Even if you don't use atomic operations directly, they're used to implement higher-level synchronization primitives such as mutexes, and often you can use them to implement more interesting concurrent data structures such as lockless ring buffers.