POSIX Parallel Programming, Part 1
- There Is No Spoon
- Will You Remember Something for Me?
- One for You, One for Me
- Caveats
As I write this article, almost every new CPU has two cores. AMD and Intel are rushing to put four cores on a die; Sun is already shipping chips that can execute 32 threads at once. These days, fast code means parallel code. (See my article "The Future of CPUs: What’s After Multi-Core?" for more on upcoming CPU designs.)
Writing parallel code isn’t much more difficult than writing serial code. A lot of the time, parallelism isn’t even required, and when it is, often algorithms are available that map nicely to a parallel approach. This article describes a few of the ways of approaching the problem on a *NIX platform.
If you’re starting from scratch, choosing the correct language for the project can give you an immediate benefit. A language like Erlang, which is designed for concurrency, can enable you to write code that’s scalable to about 1,000 nodes. Unfortunately, this luxury is uncommon, so for the remainder of this article I’ll assume that the project must be completed in C.
There Is No Spoon
Traditionally, concurrency is implemented in UNIX by multiple processes. These are created with the fork(2) system call. The origins of the fork(2) mechanism date back to a time before computers had the necessary hardware for proper multitasking. A single process at a time would be memory-resident. When context-switching, the entire contents of memory would be written out, and another process image read in.
In this kind of system, creating a new process by forking is a logical mechanism. You already have a copy of the process image in memory; all you need to do is save another copy out-of-core. For modern systems, the process is slightly less sensible. In a modern operating system, the kernel must iterate across the process’ page table and mark each page as copy-on-write (CoW). Then, every time either process writes to a page, it will cause an interrupt and vector into the kernel, which will copy the page.
Consider the following piece of code, which populates an array with random numbers and then calculates the average:
#include <stdlib.h> #include <stdio.h> const unsigned int ARRAY_SIZE = 0x500000; void populate_array(int* array, unsigned int elements) { for(unsigned int i=0 ; i<elements ; i++) { array[i] = random(); } } int average_array(int* array, unsigned int elements) { long long total = 0; for(unsigned int i=0 ; i<elements ; i++) { total += array[i]; } return total / elements; } int main(void) { int * array = calloc(ARRAY_SIZE, sizeof(int)); populate_array(array, ARRAY_SIZE); printf("Average: %d\n", average_array(array, ARRAY_SIZE)); return 0; }
Note that this code is for example purposes only and contains a few issues, such as a potential overflow, that would make it unsuitable for production use (even if you were ever going to need to find the average of a set of random numbers in the real world). Now, consider what we could do to parallelize this code. To make it run on two processors, we could split the array in half and generate/average each in, and then average the averages to get the result.
Since our processes don’t share an address space, however, we have a slight problem in getting the average value from our child processes to the parent. Fortunately, POSIX has a simple mechanism to allow us to do exactly that. When a process returns, either by calling exit(3) or by returning from the main() function, the return value is kept by the operating system and can be retrieved by using the wait(2) system call. wait(2) takes a pointer to an integer and stores the result here. The return value is the process ID (pid) of the child to terminate. These days, it’s quite uncommon to use wait(3); it has been superseded largely by waitpid(3), wait3(3), and wait4(3). We’ll use it in this example, though, because it has a simple interface:
#include <stdlib.h> #include <stdio.h> #include <sys/types.h> #include <unistd.h> const unsigned int ARRAY_SIZE = 0x500000; const unsigned int PROCESSES = 2; void populate_array(int* array, unsigned int elements) { for(unsigned int i=0 ; i<elements ; i++) { array[i] = random(); } } int average_array(int* array, unsigned int elements) { long long total = 0; for(unsigned int i=0 ; i<elements ; i++) { total += array[i]; } return total / elements; } int main(void) { unsigned int array_size = ARRAY_SIZE / PROCESSES; //Fork off PROCESSES children: for(unsigned int child=0 ; child<PROCESSES ; child++) { pid_t pid = fork(); if(pid == 0) { //We are in a child process int * array = calloc(array_size, sizeof(int)); populate_array(array, array_size); exit(average_array(array, array_size)); } } //Array to store child results in: int child_averages[PROCESSES]; for(unsigned int i=0 ; i<PROCESSES ; i++) { wait(&child_averages[i]); } printf("Average: %d\n", average_array(child_averages, PROCESSES)); return 0; }
The majority of this program is the same as the single-process variant. The differences lie primarily in the main() function. Here, we begin by forking off all of the child processes. Note the test statement after the fork(2). The fork(2) function is called once, but returns twice. In the parent process, it returns the pid of the child. In the child process, it returns 0. This program checks whether we’re in the child; if we are, it allocates a smaller array and returns the average by using exit(3).