Caveats
Shared memory is relatively cheap, but makes it difficult to port your code onto a system without a unified address space, such as a cluster. Newer AMD chips have a unified address space, but not a uniform memory architecture (some parts of memory are faster for some chips), so shared memory performance can be a bit harder to determine in advance.
Operations in C often are translated into multiple instructions. Just because something looks like a single operation in C doesn’t mean that it will be handled atomically.
Locks typically are very expensive. Getting or releasing a lock requires a system call, at least. The example in the preceding section had no code being executed outside of the lock body. In cases like this, where you’re likely to spend a lot of your time on the locking operations, you may be able to optimize things considerably by using atomic read > modify > write instructions. Unfortunately, this setup renders your code non-portable between CPU architectures.
Locks are also easy to forget to use, and tracking down the one place where you access a data structure without the correct locking can be a pain.
A good rule is that debugging complexity for an application using shared memory scales exponentially with the number of threads accessing a data structure. In part 2 of this series, we’ll look at using message-passing approaches to mitigate this problem.
Read Part 2 in the POSIX programming series here.