- 4.1 Case Study
- 4.2 Creating a Naive Parallel Version
- 4.3 Performance of OpenACC Programs
- 4.4 An Optimized Parallel Version
- 4.5 Summary
- 4.6 Exercises
4.4 An Optimized Parallel Version
So far we have marked the parallel regions for acceleration. Now it is time to introduce data regions to optimize data transfers.
4.4.1 Reducing Data Movement
Now that you have identified the problem, you know you must apply some data directives. OpenACC lets you completely control the residency of the data. It has routines to set up data during program initialization, to automatically migrate data going into or out of any region or block of code, and to update at any given point in the code. So don’t worry about what OpenACC can do. Worry about what you want to do.
Pause here and see whether you can come up with a strategy to minimize data movement. What directives does that strategy translate to? Feel free to experiment with the code on your own before reading the answer, which is provided later.
In general, we want the entire simulation to be executed on the device. That is certainly the ideal case and eliminates all the data transfer costs. But most of the time you can’t achieve that objective; the entire problem may not fit in device memory, there may be portions of the code that must execute on the host, or IO may be required at some point.
But let’s start with that objective in mind. If you load your data onto the device at the beginning of the main loop, when do you next need it on the host? Think the first iteration through as a start: there is no reason for the two big arrays to return to the host between the two kernels. They can stay on the device.
What about worst_dt? It is insignificant in size, so you don’t care what it does as long as it is available when needed, as per the default kernels behavior. Once you start to use data regions, you uncouple the execution from the data regions and could prevent unnecessary data movement. Because there is no real performance gain, you won’t override the default by including it in any data directives. It will continue to be set to 0 on the host, get to a maximum in the second nested loop (actually a reduction from all of the “local maximums” found by each processing element (PE) on the device), and get copied back to the host so that it can be checked as the condition to continue the while loop every iteration. Again, this is all default kernels behavior, so we don’t worry about the details.
After that, you run into the output routine. It isn’t an issue for the first 100 iterations, so let’s ignore it for a moment and continue around the loop for the second iteration. At the start of the second iteration, you would like both big arrays to be on the device. That is just where you left them! So it looks as if you can just keep the data on the device between iterations of the while loop. The obvious data directives would be data copy clauses applied to the while loop.
// C #pragma acc data copy(Temperature_previous, Temperature) while ( worst_dt > TEMP_TOLERANCE ) { . . . ! Fortran !$acc data copy(temperature_previous, temperature) do while ( worst_dt > temp_tolerance ) . . .
This is indeed the key. It will significantly speed up the code, and you will get the right answer at the end.
However, you do need to address that track_progess() output routine that gets invoked every 100 iterations. You need for the temperature to be back on the host at that point. Otherwise, the host copy of temperature will remain at the initial condition of all zeros until the data copy happens at the termination of the while loop, which is the end of the data region. Many programmers encounter this oversight when they apply the data directives, run the code to a quick completion in the expected 3,372 iterations, and assume victory, only to notice that all of their printed output has been zeros. Make sure you understand exactly how this happens, because it is a good example of what can occur when we decouple the data and execution regions using data directives.
The fix is easy. You just need an update at that point.
// C . . . if((iteration % 100) == 0) { #pragma acc update host(Temperature) track_progress(iteration); } . . . ! Fortran . . . if( mod(iteration,100).eq.0 ) then !$acc update host(temperature) call track_progress(temperature, iteration) endif . . .
It is important to realize that all the tools for convenient data management are already in OpenACC. Once you decide how you want to manage the data conceptually, some combination of data copy, declare, enter/exit, and update clauses should allow you to accomplish that as you wish. If you find yourself fighting the scope or blocking of your code to make the directives match your wishes, take a breath and ask yourself whether the other clauses will allow you to accomplish this more naturally.
4.4.2 Extra Clever Tweaks
There is one more tweak you can apply to the code before you declare victory. If you look a little more carefully at the code, you might notice that you don’t actually need to copy both big arrays into the while loop. It happens that temperature_previous is the array that is initialized in the initialization routine, and temperature uses these values to set itself in the first iteration. So you don’t need to copy it in.
Continuing with that line of thought, you don’t need for both arrays to exit the while loop with the final data; one will suffice. Once again, temperature_previous has the correct values so that you can abandon temperature on the device. This means that temperature is really just a temporary array used on the device, and there is no need to copy it in or out. That is exactly what the data create clause is for.
Note that this last optimization is really not very important. The big win was recognizing that you were copying the large arrays needlessly every iteration. You were copying two large arrays into and out of each of the two kernels each loop:
(2 arrays) × (in and out) × (2 pairs of loops) × (3,372 iterations) = 26,976 copies
Getting rid of all those transfers with a data copy was the big win. Using data create instead of copy for the Temperature array saved one copy in at the beginning of the entire run, and one copy out at the end. It wasn’t significant. So don’t feel bad if you didn’t spot that opportunity.
Likewise, using an update for the track progress routine caused 33 transfers over the course of the run. It was a quick fix for the problem. In comparison to the original 26,876 copies, having 33 remaining is nothing. However now that you are down to one copy in and one copy out for the whole run, it does have an impact on the order of 5 percent of the new and significantly reduced total run time. Given the huge performance improvement you have achieved, you may not care, but for those of you seeking perfection, see Exercise 1 at the end of the chapter.
4.4.3 Final Result
Listing 4.7 shows the final C version of the OpenACC enabled routine, and Listing 4.8 shows the Fortran version.
Listing 4.7. Final C OpenACC Laplace code main loop
#pragma acc data copy(Temperature_previous), create(Temperature) while ( worst_dt > TEMP_TOLERANCE ) { #pragma acc kernels for(i = 1; i <= HEIGHT; i++) { for(j = 1; j <= WIDTH; j++) { Temperature[i][j] = 0.25 * (Temperature_previous[i+1][j] + Temperature_previous[i-1][j] + Temperature_previous[i][j+1] + Temperature_previous[i][j-1]); } } worst_dt = 0.0; #pragma acc kernels for(i = 1; i <= HEIGHT; i++){ for(j = 1; j <= WIDTH; j++){ worst_dt = fmax( fabs(Temperature[i][j]- Temperature_previous[i][j]), worst_dt); Temperature_previous[i][j] = Temperature[i][j]; } } if((iteration % 100) == 0) { #pragma acc update host(Temperature) track_progress(iteration); } iteration++; }
Listing 4.8. Final Fortran OpenACC Laplace code main loop
!$acc data copy(temperature_previous), create(temperature) do while ( worst_dt > temp_tolerance ) !$acc kernels do j=1,width do i=1,height temperature(i,j) =0.25*(temperature_previous(i+1,j)& + temperature_previous(i-1,j)& + temperature_previous(i,j+1)& + temperature_previous(i,j-1)) enddo enddo !$acc end kernels worst_dt=0.0 !$acc kernels do j=1,width do i=1,height worst_dt = max( abs(temperature(i,j) – & temperature_previous(i,j)),& worst_dt ) temperature_previous(i,j) = temperature(i,j) enddo enddo !$acc end kernels if( mod(iteration,100).eq.0 ) then !$acc update host(temperature) call track_progress(temperature, iteration) endif iteration = iteration+1 enddo !$acc end data
You compile exactly as before. If you again use the compiler verbose information option (-Minfo=acc for PGI), you see that the generated copies are now outside the while loop, as intended. Here is the result.
. . . . . . ---------- Iteration number: 3200 ------------ . . .[998,998]: 99.18 [999,999]: 99.56 [1000,1000]: 99.86 ---------- Iteration number: 3300 ------------ . . .[998,998]: 99.19 [999,999]: 99.56 [1000,1000]: 99.87 Max error at iteration 3372 was 0.009995 Total time was 1.054768 seconds.
This is much better. Table 4.1 sums it up. With only a handful of directives, you have managed to speed up the serial code more than 20 times. But you had to think about your data migration in order to get there. This is typical of accelerator development.
Table 4.1. Laplace code performance
OPTIMIZATION | TIME (SECONDS) | SPEEDUP |
Serial | 21.3 | |
kernels directive | 35.2 | 0.60 |
data directives | 1.05 | 20.3 |
To review, you looked for the large loops and placed kernels directives there. Then (prompted by terrible performance) you thought about how the data should really flow between the host and the device. Then you used the appropriate data directives to make that happen. Further performance improvements are possible (see the exercises), but you have achieved the lion’s share of what can be done.