Writing Concurrent Systems, Part 1: Locking
Buying a computer with only one core is becoming increasingly difficult. The latest generation of ARM designs support up to four cores, so even most mobile phones are likely to be multicore in the next couple of years.
With this fact in mind, it's becoming increasingly important to write code that takes advantage of more than one thread of execution, if you want to make full use of the processing power of a modern computer. It's not always important. If your code uses only 10% of one core, making it concurrent won't buy you anything. You'll have more overhead and something that's harder to debug, and at best it will end up using 5% of each of two cores; more likely, it will use 67% of each of two cores.
On the other hand, if your code is using a significant proportion of one core, concurrency is important. Even if the code uses less than 100% of one core, concurrency is important if that code is likely to be on a system with other similar processes. For example, imagine that you have three single-threaded processes running on a dual-core system, each wanting 60% of the CPU. No possible arrangement allows all three to get all the CPU time they want, even though you have more than enough processing power in total. But if you split one of the processes into two threads, each of which uses only 40% of the CPU, then you can fit all of the processes onto your dual-core system.
Concurrency isn't just important for scalability; it also can help improve latency. A process that uses only 10% of the CPU may spend most of its time waiting for I/O. If you program in an asynchronous style, with or without true concurrency, it's easier for your process to do some work on the CPU while waiting for data from the network or the disk.
Over the course of this three-part series, we'll take a look at some approaches to concurrency. This first article considers the simplest form of synchronization: explicit locking.
Big Giant Lock
When people first started wanting to run UNIX kernels on multiprocessor machines, they had to take an inherently serial blob of code and make it support parallel execution. The bottom half of the kernel wasn't such a large problem; it was used to being interrupted periodically by higher-priority interrupt handlers, so true concurrency was only a slight change.
The top half, which handled system calls, was a bigger problem. An early UNIX system had one process running and able to make system calls. When running on a multiprocessor system, however, suddenly two or more processes could call into the kernel at once.
The solution was to wrap the entire system-call handler in a giant lock. When you made a system call, you acquired a lock; when you exited the system call, you released the lock. Only one process at a time could make system calls. This approach had one massive advantage over other approaches: It was very easy to implement.
The "giant lock" idea worked better than you might expect. Most processes only spend around 1050% of their time in system callsoften even less. With only 24 cores, you didn't see much of a performance problem. Because most processes were trying to make system calls at different times, few were held up waiting for the lock.
This approach is quite common even now for libraries. Before multithreading became common, a lot of libraries were designed with some global state. To support multiple threads, they use a global lock, which is acquired and released in every call that touches the global state. Only one thread can be executing in the library at once, but linked programs should be doing most of their work outside of the library.