Building Blocks
Memory orderings are the complicated part of the specification; the operations that are allowed are significantly simpler. The syntax varies slightly between C and C++. In C, they're defined in <stdatomic.h> as functions, whereas in C++ they're members of the std::atomic<> family of templates in <atomic>. I'll use the C versions for the rest of this article, but the C++ versions are very similar.
Most of the functions take a memory order as the final argument. For example, a simple atomic increment looks like this:
_Atomic(int) i; // Implicitly memory_order_seq_cst atomic_fetch_add(&i, 1); atomic_fetch_add_explicit(&i, 1, memory_order_relaxed);
This is a single instruction on x86, but on other architectures it's implemented in terms of simpler building blocks. One form is exposed in the compare-and-exchange operation, which is quite complex. It loads a value from an address and compares that value against an existing one; if they're the same, it writes a new value. In some architectures, this is a single instruction. In RISC architectures, it's usually implemented by a pair of linked load and store operations, where the store fails if something has written to the memory address since the load.
The simplest compare-and-exchange in the standard is atomic_compare_exchange_strong(), which takes a pointer to the memory address, a pointer to the old value (which will be updated if the operation fails), and a third pointer to the new value. A simple use would be an atomic update loop, like this:
_Atomic(int) *v; int old = *atomic_load(v); do { int new = someCalculation(old); } while(!atomic_compare_exchange_strong(v, *old, new));
This will keep performing the calculation on the value in the memory address and trying to write it until the value in memory is the expected value.
There is also a weak version, which is allowed to fail spuriously. This version should be used if someCalculation() is very cheap. It may generate the same code as the strong version, but in some cases it can generate more-efficient code that might sometimes loop when it shouldn't.
If this example were compiled for ARM or PowerPC, the strong compare-and-exchange operation would use a load-interlocked and store-exclusive instruction pair. The weak version could transform the atomic load into a load-interlocked instruction. If another thread, between the load and the compare-and-exchange, wrote to v, but wrote back the same value, then you would get a spurious failure, but in most cases you would get the same behavior from a shorter instruction sequence.
Both versions come with variants with the _explicit suffix. Unlike most of the atomic operations, which take one memory ordering, these take two: one for the case when they succeed, and one for the case when they fail. This design is particularly useful for implementing locks. If you fail to acquire a lock, you can use a relaxed ordering, but if you succeed then you probably want acquire ordering. Again, this is most useful for RISC architectures. If the store-exclusive instruction succeeds, it will be followed by a memory-barrier instruction. If it fails, it will jump back to before the load-interlinked instruction and start again.
For other operations, you may want to insert explicit memory barriers. The atomic_thread_fence() and atomic_signal_fence() functions take a memory ordering and insert a memory barrier at the correct point. This is useful for optimization in some cases; for example, you can use memory_order_relaxed on a small sequence of operations and then insert an explicit memory_order_seq_cst fence after them, rather than making all of them sequentially consistent. Or you can insert just the fence if an operation gives a specific result.
This article has only given you a fairly broad overview of the C11 and C++11 atomic operations specifications, but hopefully it's enough to see how flexible they are. The design of the new atomic operations is my favorite bit of C11 — clearly, a lot of thought has gone into it.