Lightweight Locking
The locks that I've talked about so far have been mutual exclusion locks (usually abbreviated to mutexes), occasionally called non-counting semaphores. These are conceptually trivial. They support two operations: lock and unlock. If you try to lock a locked mutex, you wait until it's unlocked. Most implementations provide a few extra features, such as testing whether a mutex is locked or providing a timeout.
If you read Apple's marketing material for Grand Central Dispatch, you'll discover that the new-and-improved synchronization primitives are clever, initially using atomic operations in userspace, and only waiting in the kernel if the lock is contended. If you use pretty much any other platform, system-provided mutexes have been implemented this way for around a decade.
In general, there are three cases for a lock operation: The lock may be free, it may be locked temporarily, or it may be locked for a long time. It's difficult to tell the latter two cases apart, although you can monitor the contention on a lock and use past behavior to try to predict how the lock will behave in the future.
In well-designed code, the first case is the most common. Locks are only used to protect resources that are held briefly and then released. Therefore, it's very important that this case should be fast. Most mutex implementations use a one-word flag in memory to indicate whether the lock is held. This can be tested with an atomic test-and-set operation. If you managed to acquire the lock in this way, it was already unlocked, so you only used a few cycles. If it fails, then you can call down into the kernel and block until the lock is released. When unlocking, you need to notify the kernel to wake up any threads that were blocked while waiting for the lock. (If you're clever, you can test whether this action is needed in userspace, calling the kernel only if a wakeup call is needed.)
In some situations, you might never want to sleep. Real-time systems need to guarantee a maximum response time; one of the ways that they do this is by placing an upper bound on the amount of time during which you're allowed to hold a lock. The Symbian nanokernel is an example of this kind of design.
If your own code has some locks that are mostly likely to be uncontended, and they're never held for more than a few instructions, you might want to use a lightweight spinlock. A spinlock is a specific implementation of a mutex that repeatedly tests whether it can be locked, rather than relinquishing control to the scheduler.
Unfortunately, you can't implement a spinlock in portable C99, because it requires the use of some atomic operations. You can implement one using GCC intrinsic functions, however. The __sync_bool_compare_and_swap() function takes three arguments: The first is an address, the second is the expected value, and the third is the new value. If the value in memory at the address to which the first argument points is equal to the value of the second argument, that value is replaced by the value in the third argument.
This design is implemented using the target architecture's native compare-and-exchange operation, which will ensure that the entire operation is atomic. From this operation, you can build a simple spinlock, like this:
static inline void spinlock(volatile int *lock) { while(!__sync_bool_compare_and_swap(lock, 0, 1)) { sched_yield(); } }
This code tries to store the value 1 in the integer if the current value is 0. If this doesn't work, it calls the sched_yield() function, which is part of the pthreads API, and tells the scheduler to run other threads. You can improve this example slightly by having it try the atomic operation a few times before calling sched_yield(). To unlock the spinlock, you don't need an atomic operation:
static inline void spinunlock(volatile int *lock) { *lock = 0; }
The lock is already held, so there's no potential for races. If the write happens at the same time as an atomic test and set, then either it will sneak in just before (in which case the lock operation will succeed) or just after (in which case it will spin). It doesn't even matter if this change requires multiple instructions, because you're only changing a single bit in memory. If you wanted to expand this example into a counting semaphore, you would need an atomic operation.