4.4 THE TASK PARALLELISM PATTERN
Problem
When the problem is best decomposed into a collection of tasks that can execute concurrently, how can this concurrency be exploited efficiently?
Context
Every parallel algorithm is fundamentally a collection of concurrent tasks. These tasks and any dependencies among them can be identified by inspection (for simple problems) or by application of the patterns in the Finding Concurrency design space. For some problems, focusing on these tasks and their interaction might not be the best way to organize the algorithm: In some cases it makes sense to organize the tasks in terms of the data (as in the Geometric Decomposition pattern) or the flow of data among concurrent tasks (as in the Pipeline pattern). However, in many cases it is best to work directly with the tasks themselves. When the design is based directly on the tasks, the algorithm is said to be a task parallel algorithm.
The class of task parallel algorithms is very large. Examples include the following.
-
Ray-tracing codes such as the medical-imaging example described in the Task Decomposition pattern: Here the computation associated with each "ray" becomes a separate and completely independent task.
-
The molecular-dynamics example described in the Task Decomposition pattern: The update of the nonbonded force on each atom is a task. The dependencies among tasks are managed by replicating the force array on each UE to hold the partial sums for each atom. When all the tasks have completed their contributions to the nonbonded force, the individual force arrays are combined (or "reduced") into a single array holding the full summation of nonbonded forces for each atom.
-
Branch-and-bound computations, in which the problem is solved by repeatedly removing a solution space from a list of such spaces, examining it, and either declaring it a solution, discarding it, or dividing it into smaller solution spaces that are then added to the list of spaces to examine. Such computations can be parallelized using this pattern by making each "examine and process a solution space" step a separate task. The tasks weakly depend on each other through the shared queue of tasks.
The common factor is that the problem can be decomposed into a collection of tasks that can execute concurrently. The tasks can be completely independent (as in the medical-imaging example) or there can be dependencies among them (as in the molecular-dynamics example). In most cases, the tasks will be associated with iterations of a loop, but it is possible to associate them with larger-scale program structures as well.
In many cases, all of the tasks are known at the beginning of the computation (the first two examples). However, in some cases, tasks arise dynamically as the computation unfolds, as in the branch-and-bound example.
Also, while it is usually the case that all tasks must be completed before the problem is done, for some problems, it may be possible to reach a solution without completing all of the tasks. For example, in the branch-and-bound example, we have a pool of tasks corresponding to solution spaces to be searched, and we might find an acceptable solution before all the tasks in this pool have been completed.
Forces
-
To exploit the potential concurrency in the problem, we must assign tasks to UEs. Ideally we want to do this in a way that is simple, portable, scalable, and efficient. As noted in Section 4.1, however, these goals may conflict. A key consideration is balancing the load, that is, ensuring that all UEs have roughly the same amount of work to do.
-
If the tasks depend on each other in some way (via either ordering constraints or data dependencies), these dependencies must be managed correctly, again keeping in mind the sometimes-conflicting goals of simplicity, portability, seal-ability, and efficiency.
Solution
Designs for task-parallel algorithms involve three key elements: the tasks and how they are defined, the dependencies among them, and the schedule (how the tasks are assigned to UEs). We discuss them separately, but in fact they are tightly coupled, and all three must be considered before final decisions are made. After these factors are considered, we look at the overall program structure and then at some important special cases of this pattern.
Tasks
Ideally, the tasks into which the problem is decomposed should meet two criteria: First, there should be at least as many tasks as UEs, and preferably many more, to allow greater flexibility in scheduling. Second, the computation associated with each task must be large enough to offset the overhead associated with managing the tasks and handling any dependencies. If the initial decomposition does not meet these criteria, it is worthwhile to consider whether there is another way of decomposing the problem into tasks that does meet the criteria.
For example, in image-processing applications where each pixel update is independent, the task definition can be individual pixels, image lines, or even whole blocks in the image. On a system with a small number of nodes connected by a slow network, tasks should be large to offset high communication latencies, so basing tasks on blocks of the image is appropriate. The same problem on a system containing a large number of nodes connected by a fast (low-latency) network, however, would need smaller tasks to make sure enough work exists to keep all the UEs occupied. Notice that this imposes a requirement for a fast network, because otherwise the smaller amount of work per task will not be enough to compensate for communication overhead.
Dependencies
Dependencies among tasks have a major impact on the emerging algorithm design. There are two categories of dependencies, ordering constraints and dependencies related to shared data.
For this pattern, ordering constraints apply to task groups and can be handled by forcing the groups to execute in the required order. For example, in a task-parallel multidimensional Fast Fourier Transform, there is a group of tasks for each dimension of the transform, and synchronization or other program constructs are used to make sure computation on one dimension completes before the next dimension begins. Alternatively, we could simply think of such a problem as a sequential composition of task-parallel computations, one for each task group.
Shared-data dependencies are potentially more complicated. In the simplest case, there are no dependencies among the tasks. A surprisingly large number of problems can be cast into this form. Such problems are often called embarrassingly parallel. Their solutions are among the simplest of parallel programs; the main considerations are how the tasks are defined (as discussed previously) and scheduled (as discussed later). When data is shared among tasks, the algorithm can be much more complicated, although there are still some common cases that can be dealt with relatively easily. We can categorize dependencies as follows.
-
Removable dependencies. In this case, the dependency is not a true dependency between tasks, but an apparent dependency that can be removed by simple code transformations. The simplest case is a temporary variable whose use is completely local to each task; that is, each task initializes the variable without reference to other tasks. This case can be handled by simply creating a copy of the variable local to each UE. In more complicated cases, iterative expressions might need to be transformed into closed-form expressions to remove a loop-carried dependency. For example, consider the following simple loop:
int ii = 0, jj = 0; for(int i = 0; i< N; i++) { ii = ii + 1; d[ii] = big_time_consuming_work(ii); jj = jj + i; a[jj] = other_big_calc(jj); }
The variables ii and jj create a dependency between tasks and prevent parallelization of the loop. We can remove this dependency by replacing ii and jj with closed-form expressions (noticing that the values of ii and i are the same and that the value of jj is the sum of the values from 0 through i):
for(int i = 0; i< N; i++){ d[i] = big_time_consuming_work(i); a[(i*i+i)/2] = other_big_calc((i*i+i)/2)); }
-
"Separable" dependencies. When the dependencies involve accumulation into a shared data structure, they can be separated from the tasks ("pulled outside the concurrent computation") by replicating the data structure at the beginning of the computation, executing the tasks, and then combining the copies into a single data structure after the tasks complete. Often the accumulation is a reduction operation, in which a collection of data elements is reduced to a single element by repeatedly applying a binary operation such as addition or multiplication.
In more detail, these dependencies can be managed as follows: A copy of the data structure used in the accumulation is created on each UE. Each copy is initialized (in the case of a reduction, to the identity element for the binary operationfor example, zero for addition and one for multiplication). Each task then carries out the accumulation into its local data structure, eliminating the shared-data dependency. When all tasks are complete, the local data structures on each UE are combined to produce the final global result (in the case of a reduction, by applying the binary operation again). As an example, consider the following loop to sum the elements of array f:
for(int i = 0; i< N; i++){ sum = sum + f(i); }
This is technically a dependency between loop iterations, but if we recognize that the loop body is just accumulating into a simple scalar variable, it can be handled as a reduction.
Reductions are so common that both MPI and OpenMP provide support for them as part of the API. Sec. 6.4.2 in the Implementation Mechanisms design space discusses reductions in more detail.
-
Other dependencies. If the shared data cannot be pulled out of the tasks and is both read and written by the tasks, data dependencies must be explicitly managed within the tasks. How to do this in a way that gives correct results and also acceptable performance is the subject of the Shared Data pattern.
Schedule
The remaining key element to consider is the schedulethe way in which tasks are assigned to UEs and scheduled for execution. Load balance (as described in Chapter 2) is a critical consideration in scheduling; a design that balances the computational load among PEs will execute more efficiently than one that does not. Fig. 4.3 illustrates the problem.
Figure 4.3 Good versus poor load balance
Two classes of schedules are used in parallel algorithms: static schedules, in which the distribution of tasks among UEs is determined at the start of the computation and does not change; and dynamic schedules, in which the distribution of tasks among UEs varies as the computation proceeds.
In a static schedule, the tasks are associated into blocks and then assigned to UEs. Block size is adjusted so each UE takes approximately the same amount of time to complete its tasks. In most applications using a static schedule, the computational resources available from the UEs are predictable and stable over the course of the computation, with the most common case being UEs that are identical (that is, the computing system is homogeneous). If the set of times required to complete each task is narrowly distributed about a mean, the sizes of the blocks should be proportional to the relative performance of the UEs (so, in a homogeneous system, they are all the same size). When the effort associated with the tasks varies considerably, a static schedule can still be useful, but now the number of blocks assigned to UEs must be much greater than the number of UEs. By dealing out the blocks in a round-robin manner (much as a deck of cards is dealt among a group of card players), the load is balanced statistically.
Dynamic schedules are used when (1) the effort associated with each task varies widely and is unpredictable and/or (2) when the capabilities of the UEs vary widely and unpredictably. The most common approach used for dynamic load balancing is to define a task queue to be used by all the UEs; when a UE completes its current task and is therefore ready to process more work, it removes a task from the task queue. Faster UEs or those receiving lighter-weight tasks will access the queue more often and thereby be assigned more tasks.
Another dynamic scheduling strategy uses work stealing, which works as follows. The tasks are distributed among the UEs at the start of the computation. Each UE has its own work queue. When the queue is empty, the UE will try to steal work from the queue on some other UE (where the other UE is usually randomly selected). In many cases, this produces an optimal dynamic schedule without incurring the overhead of maintaining a single global queue. In programming environments or packages that provide support for the construct, such as Cilk [BJK+96], Hood [BP99], or the FJTask framework [Lea00b, Lea], it is straightforward to use this approach. But with more commonly used programming environments such as OpenMP, MPI, or Java (without support such as the FJTask framework), this approach adds significant complexity and therefore is not often used.
Selecting a schedule for a given problem is not always easy. Static schedules incur the least overhead during the parallel computation and should be used whenever possible.
Before ending the discussion of schedules, we should mention again that while for most problems all of the tasks are known when the computation begins and all must be completed to produce an overall solution, there are problems for which one or both of these is not true. In these cases, a dynamic schedule is probably more appropriate.
Program structure
Many task-parallel problems can be considered to be loop-based. Loop-based problems are, as the name implies, those in which the tasks are based on the iterations of a loop. The best solutions for such problems use the Loop Parallelism pattern. This pattern can be particularly simple to implement in programming environments that provide directives for automatically assigning loop iterations to UEs. For example, in OpenMP a loop can be parallelized by simply adding a "parallel for" directive with an appropriate schedule clause (one that maximizes efficiency). This solution is especially attractive because OpenMP then guarantees that the resulting program is semantically equivalent to the analogous sequential code (within roundoff error associated with different orderings of floatingpoint operations).
For problems in which the target platform is not a good fit with the Loop Parallelism pattern, or for problems in which the model of "all tasks known initially, all tasks must complete" does not apply (either because tasks can be created during the computation or because the computation can terminate without all tasks being complete), this straightforward approach is not the best choice. Instead, the best design makes use of a task queue; tasks are placed on the task queue as they are created and removed by UEs until the computation is complete. The overall program structure can be based on either the Master/Worker pattern or the SPMD pattern. The former is particularly appropriate for problems requiring a dynamic schedule.
In the case in which the computation can terminate before all the tasks are complete, some care must be taken to ensure that the computation ends when it should. If we define the termination condition as the condition that when true means the computation is completeeither all tasks are complete or some other condition (for example, an acceptable solution has been found by one task)then we want to be sure that (1) the termination condition is eventually met (which, if tasks can be created dynamically, might mean building into it a limit on the total number of tasks created), and (2) when the termination condition is met, the program ends. How to ensure the latter is discussed in the Master/Worker and SPMD patterns.
Common idioms
Most problems for which this pattern is applicable fall into the following two categories.
Embarrassingly parallel problems are those in which there are no dependencies among the tasks. A wide range of problems fall into this category, ranging from rendering frames in a motion picture to statistical sampling in computational physics. Because there are no dependencies to manage, the focus is on scheduling the tasks to maximize efficiency. In many cases, it is possible to define schedules that automatically and dynamically balance the load among UEs.
Replicated data or reduction problems are those in which dependencies can be managed by "separating them from the tasks" as described earlierreplicating the data at the beginning of computation and combining results when the termination condition is met (usually "all tasks complete"). For these problems, the overall solution consists of three phases, one to replicate the data into local variables, one to solve the now-independent tasks (using the same techniques used for embarrassingly parallel problems), and one to recombine the results into a single result.
Examples
We will consider two examples of this pattern. The first example, an image-construction example, is embarrassingly parallel. The second example will build on the molecular dynamics example used in several of the Finding Concurrency patterns.
Image construction
In many image-construction problems, each pixel in the image is independent of all the other pixels. For example, consider the well known Mandelbrot set [Dou86]. This famous image is constructed by coloring each pixel according to the behavior of the quadratic recurrence relation
where C and Z are complex numbers and the recurrence is started with Z0 = C. The image plots the imaginary part of C on the vertical axis and the real part on the horizontal axis. The color of each pixel is black if the recurrence relation converges to a stable value or is colored depending on how rapidly the relation diverges.
At the lowest level, the task is the update for a single pixel. First consider computing this set on a cluster of PCs connected by an Ethernet. This is a coarsegrained system; that is, the rate of communication is slow relative to the rate of computation. To offset the overhead incurred by the slow network, the task size needs to be large; for this problem, that might mean computing a full row of the image. The work involved in computing each row varies depending on the number of divergent pixels in the row. The variation, however, is modest and distributed closely around a mean value. Therefore, a static schedule with many more tasks than UEs will likely give an effective statistical balance of the load among nodes. The remaining step in applying the pattern is choosing an overall structure for the program. On a shared-memory machine using OpenMP, the Loop Parallelism pattern described in the Supporting Structures design space is a good fit. On a network of workstations running MPI, the SPMD pattern (also in the Supporting Structures design space) is appropriate.
Before moving on to the next example, we consider one more target system, a cluster in which the nodes are not heterogeneousthat is, some nodes are much faster than others. Assume also that the speed of each node may not be known when the work is scheduled. Because the time needed to compute the image for a row now depends both on the row and on which node computes it, a dynamic schedule is indicated. This in turn suggests that a general dynamic load-balancing scheme is indicated, which then suggests that the overall program structure should be based on the Master/Worker pattern.
Molecular dynamics
For our second example, we consider the computation of the nonbonded forces in a molecular dynamics computation. This problem is described in Sec. 3.1.3 and in [Mat95, PH95] and is used throughout the patterns in the Finding Concurrency design space. Pseudocode for this computation is shown in . The physics in this example is not relevant and is buried in code not shown here (the computation of the neighbors list and the force function). The basic computation structure is a loop over atoms, and then for each atom, a loop over interactions with other atoms. The number of interactions per atom is computed separately when the neighbors list is determined. This routine (not shown here) computes the number of atoms within a radius equal to a preset cutoff distance. The neighbor list is also modified to account for Newton's third law: Because the force of atom i on atom j is the negative of the force of atom j on atom i, only half of the potential interactions need actually be computed. Understanding this detail is not important for understanding this example. The key is that this causes each loop over j to vary greatly from one atom to another, thereby greatly complicating the load-balancing problem. Indeed, for the purposes of this example, all that must really be understood is that calculating the force is an expensive operation and that the number of interactions per atom varies greatly. Hence, the computational effort for each iteration over i is difficult to predict in advance.
Example 4.4. Pseudocode for the nonbonded computation in a typical molecular dynamics code
function non_bonded_forces (N, Atoms, neighbors, Forces) Int const N // number of atoms Array of Real :: atoms (3,N) //3D coordinates Array of Real :: forces (3,N) //force in each dimension Array of List :: neighbors(N) //atoms in cutoff volume Real :: forceX, forceY, forceZ loop [i] over atoms loop [j] over neighbors(i) forceX = non_bond_force(atoms(1,i), atoms(1,j)) forceY = non_bond_force(atoms(2,i), atoms(2,j)) forceZ = non_bond_force(atoms(3,i), atoms(3,j)) force(1,i) += forceX; force(1,j) -= forceX; force(2,i) += forceY; force(2,j) -= forceY; force(3,i) += forceZ; force(3,j) -= forceZ; end loop [j] end loop [i] end function non_bonded_forces
Each component of the force term is an independent computation, meaning that each (i, j) pair is fundamentally an independent task. The number of atoms tends to be on the order of thousands, and squaring that gives a number of tasks that is more than enough for all but the largest parallel systems. Therefore, we can take the more convenient approach of defining a task as one iteration of the loop over i. The tasks, however, are not independent: The force array is read and written by each task. Inspection of the code shows that the arrays are only used to accumulate results from the computation, however. Thus, the full array can be replicated on each UE and the local copies combined (reduced) after the tasks complete.
After the replication is defined, the problem is embarrassingly parallel and the same approaches discussed previously apply. We will revisit this example in the Master/Worker, Loop Parallelism, and SPMD patterns. A choice among these patterns is normally made based on the target platforms.
Known uses
There are many application areas in which this pattern is useful, including the following.
Many ray-tracing programs use some form of partitioning with individual tasks corresponding to scan lines in the final image [BKS91].
Applications written with coordination languages such as Linda are another rich source of examples of this pattern [BCM+91]. Linda [CG91] is a simple language consisting of only six operations that read and write an associative (that is, content-addressable) shared memory called a tuple space. The tuple space provides a natural way to implement a wide variety of shared-queue and master/worker algorithms.
Parallel computational chemistry applications also make heavy use of this pattern. In the quantum chemistry program GAMESS, the loops over two electron integrals are parallelized with the task queue implied by the Nextval construct within TCGMSG. An early version of the distance geometry program DGEOM was parallelized with the master/worker form of this pattern. These examples are discussed in [Mat95].
PTEP (Parallel Telemetry Processor) [NBB01], developed by NASA as the downlink processing system for data from a planetary rover or lander, also makes use of this pattern. The system is implemented in Java but can incorporate components implemented in other languages. For each incoming data packet, the system determines which instrument produced the data, and then performs an appropriate sequential pipeline of processing steps. Because the incoming data packets are independent, the processing of individual packets can be done in parallel.