4.3 EXAMPLES
4.3.1 Medical Imaging
For example, consider the medical imaging problem described in Sec. 3.1.3. This application simulates a large number of gamma rays as they move through a body and out to a camera. One way to describe the concurrency is to define the simulation of each ray as a task. Because they are all logically equivalent, we put them into a single task group. The only data shared among the tasks is a large data structure representing the body, and since access to this data structure is read-only, the tasks do not depend on each other.
Because there are many independent tasks for this problem, it is less necessary than usual to consider the target platform: The large number of tasks should mean that we can make effective use of any (reasonable) number of UEs; the independence of the tasks should mean that the cost of sharing information among UEs will not have much effect on performance.
Thus, we should be able to choose a suitable structure by working through the decision tree shown previously in Fig. 4.2. Given that in this problem the tasks are independent, the only issue we really need to worry about as we select an algorithm structure is how to map these tasks onto UEs. That is, for this problem, the major organizing principle seems to be the way the tasks are organized, so we start by following the Organize By Tasks branch.
We now consider the nature of our set of taskswhether they are arranged hierarchically or reside in an unstructured or flat set. For this problem, the tasks are in an unstructured set with no obvious hierarchical structure among them, so we choose the Task Parallelism pattern. Note that in the problem, the tasks are independent, a fact that we will be able to use to simplify the solution.
Finally, we review this decision in light of possible target-plat form considerations. As we observed earlier, the key features of this problem (the large number of tasks and their independence) make it unlikely that we will need to reconsider because the chosen structure will be difficult to implement on the target platform.
4.3.2 Molecular Dynamics
As a second example, consider the molecular dynamics problem described in Sec. 3.1.3. In the Task Decomposition pattern, we identified the following groups of tasks associated with this problem:
-
Tasks that find the vibrational forces on an atom
-
Tasks that find the rotational forces on an atom
-
Tasks that find the nonbonded forces on an atom
-
Tasks that update the position and velocity of an atom
-
A task to update the neighbor list for all the atoms
The tasks within each group are expressed as the iterations of a loop over the atoms within the molecular system.
We can choose a suitable algorithm structure by working through the decision tree shown earlier in Fig. 4.2. One option is to organize the parallel algorithm in terms of the flow of data among the groups of tasks. Note that only the first three task groups (the vibrational, rotational, and nonbonded force calculations) can execute concurrently; that is, they must finish computing the forces before the atomic positions, velocities and neighbor lists can be updated. This is not very much concurrency to work with, so a different branch in Fig. 4.2 should be used for this problem.
Another option is to derive exploitable concurrency from the set of tasks within each group, in this case the iterations of a loop over atoms. This suggests an organization by tasks with a linear arrangement of tasks, or based on Fig. 4.2, the Task Parallelism pattern should be used. Total available concurrency is large (on the order of the number of atoms), providing a great deal of flexibility in designing the parallel algorithm.
The target machine can have a major impact on the parallel algorithm for this problem. The dependencies discussed in the Data Decomposition pattern (replicated coordinates on each UE and a combination of partial sums from each UE to compute a global force array) suggest that on the order of 2 · 3 · N terms (where N is the number of atoms) will need to be passed among the UEs. The computation, however, is of order n · N, where n is the number of atoms in the neighborhood of each atom and considerably less than N. Hence, the communication and computation are of the same order and management of communication overhead will be a key factor in designing the algorithm.