Memory Barriers
If you've used gcc's __sync* family of built-ins, you've probably read the bit of the documentation telling you that each is a "full barrier." This means that no memory operation written before the barrier is allowed to complete after the barrier, or vice versa.
The C memory model, by default, is quite weak in terms of ordering constraints. Consider this code:
a = 1; b = 2; a = 3;
The compiler is free to emit several different things for this sequence. A toy C compiler would do three memory writes. A slightly cleverer one would notice that the first line is a dead store — value is never read from a — so it would just emit code for the latter two. An even more clever compiler might notice that b is used soon after, so keeping the value of b in a register is preferable, and it would move the line a = 3 to the start, with the following end result:
a = 3; b = 2;
This is no use if you're writing a device driver and you actually wanted to write those values to memory-mapped I/O locations. The volatile keyword was added to address this issue. The compiler must ensure that every write to and read from a volatile variable remains in the code, in the order in which it was written. That constraint applies only to that specific variable, however. If these were volatile memory addresses, the following examples would still be valid:
a = 1; a = 3; b = 2;
Or:
b = 2; a = 1; a = 3;
This isn't enough for multithreading; you can't make any guarantees about the order in which writes to different variables will occur. For example, consider the following simple mutex implementation:
void lock(volatile int *l) { while (*l) {} *l = 1; } void unlock(volatile int *l) { *l = 0; }
Ignore for a second the fact the potential race, where another thread can acquire the lock between the while statement and the assignment. Consider a typical case for a mutex:
lock(someLock); sharedState = someValue; unlock(someLock);
The compiler would be free to inline these, giving you code like this:
*l = 1; sharedState = someValue; *l = 0;
The value of l isn't used in the update, so the first assignment can be reordered after the call, or the second before the call. The functions acquiring and releasing the lock need to guarantee that nothing will be reordered outside those functions. With a full barrier, you have this guarantee, but you actually have a stronger guarantee than you need, which is not ideal for optimization.
It's worth noting that the compiler is not the only thing that can reorder memory accesses. Some CPUs will as well, unless explicit barrier instructions are inserted. The Alpha was particularly aggressive about this behavior, and it had a huge range of different types of memory barrier instructions.