- Using Multiple Processes to Improve System Productivity
- Multiple Users Utilizing a Single System
- Improving Machine Efficiency Through Consolidation
- Using Parallelism to Improve the Performance of a Single Task
- Parallelization Patterns
- How Dependencies Influence the Ability Run Code in Parallel
- Identifying Parallelization Opportunities
- Summary
Identifying Parallelization Opportunities
The steps necessary to identify parallelization opportunities in codes are as follows:
- Gather a representative runtime profile of the application, and identify the regions of code where the most time is currently being spent.
- For these regions, examine the code for dependencies, and determine whether the dependencies can be broken so that the code can be performed either as multiple parallel tasks or as a loop over multiple parallel iterations. At this point, it may also be worth investigating whether a different algorithm or approach would give code that could be more easily made parallel.
- Estimate the overheads and likely performance gains from this parallelization strategy. If the approach promises close to linear scaling with the number of threads, then it is probably a good approach; if the scaling does not look very efficient, it may be worth broadening the scope of the analysis.
- Broaden the scope of the analysis by considering the routine that calls the region of interest. Is it possible to make this routine parallel?
The important point to remember is that parallelization incurs synchronization costs, so the more work that each thread performs before it needs synchronization, the better the code will scale. Consequently, it is always worth looking further up the call stack of a region of code to determine whether there is a more effective parallelization point. For example, consider the pseudocode shown in Listing 3.14.
Listing 3.14. Opportunities for Parallelization at Different Granularities
void handlePacket(packet_t *packet) { doOneTask(packet); doSecondTask(packet); } void handleStream( stream_t* stream ) { for( int i=0; i < stream->number_of_packets; i++) { handlePacket( stream->packets[i] ); } }
In this example, there are two long-running tasks; each performs some manipulation of a packet of data. It is quite possible that the two tasks, doOneTask() and doSecondTask(), could be performed in parallel. However, that would introduce one synchronization point after every packet that is processed. So, the synchronization cost would be O(N) where N is the number of packets.
Looking further up the stack, the calling routine, handleStream(), iterates over a stream of packets. So, it would probably be more appropriate to explore whether this loop could be made to run in parallel. If this was successful, then there would be a synchronization point only after an entire stream of packets had been handled, which could represent a significant reduction in the total synchronization costs.