- 4.1 Case Study
- 4.2 Creating a Naive Parallel Version
- 4.3 Performance of OpenACC Programs
- 4.4 An Optimized Parallel Version
- 4.5 Summary
- 4.6 Exercises
4.2 Creating a Naive Parallel Version
In many other types of parallel programming, you would be wise to stare at your code and plot various approaches and alternative algorithms before you even consider starting to type. With OpenACC, the low effort and quick feedback allow you to dive right in and try some things without much risk of wasted effort.
4.2.1 Find the Hot Spot
Almost always the first thing to do is find the hot spot: the point of highest numerical intensity in your code. A profiler like those you’ve read about will quickly locate and rank these spots. Often, as is the case here, it is obvious where to start. A large loop is a big flag, and you have two of them within the main loop. This is where we focus.
4.2.2 Is It Safe to Use kernels?
The biggest hammer in your toolbox is the kernels directive. Refer to Chapter 1 for full details on kernels. Don’t resist the urge to put it in front of some large, nested loop. One nice feature about this directive is that it is safe out of the box; until you start to override its default behavior with additional directives, the compiler will be able to see whether there are any code-breaking dependencies, and it will make sure that the device has access to all the required data.
4.2.3 OpenACC Implementations
Let’s charge ahead and put kernels directives in front of the two big loops. The C and Fortran codes become the code shown in Listings 4.5 and 4.6.
Listing 4.5. C Laplace code main loop with kernels directives
while ( worst_dt > TEMP_TOLERANCE ) { #pragma acc kernels for(i = 1; i <= HEIGHT; i++) { for(j = 1; j <= WIDTH; j++) { Temperature[i][j] = 0.25 * (Temperature_previous[i+1][j] + Temperature_previous[i-1][j] + Temperature_previous[i][j+1] + Temperature_previous[i][j-1]); } } worst_dt = 0.0; #pragma acc kernels for(i = 1; i <= HEIGHT; i++){ for(j = 1; j <= WIDTH; j++){ worst_dt = fmax( fabs(Temperature[i][j]- Temperature_previous[i][j]), worst_dt); Temperature_previous[i][j] = Temperature[i][j]; } } if((iteration % 100) == 0) { track_progress(iteration); } iteration++; }
Listing 4.6. Fortran Laplace code main loop with kernels directives
do while ( worst_dt > temp_tolerance ) !$acc kernels do j=1,width do i=1,height temperature(i,j) =0.25*(temperature_previous(i+1,j)& + temperature_previous(i-1,j)& + temperature_previous(i,j+1)& + temperature_previous(i,j-1)) enddo enddo !$acc end kernels worst_dt=0.0 !$acc kernels do j=1,width do i=1,height worst_dt = max( abs(temperature(i,j) – & temperature_previous(i,j)),& worst_dt ) temperature_previous(i,j) = temperature(i,j) enddo enddo !$acc end kernels if( mod(iteration,100).eq.0 ) then call track_progress(temperature, iteration) endif iteration = iteration+1 enddo
The compilation is also straightforward. All you do is activate the directives using, for example, the PGI compiler, for the C version:
pgcc –acc laplace.c
Or for the Fortran version:
pgf90 –acc laplace.f90
If you do this, the executable pops right out and you can be on your way. However, you probably want to verify that your directives actually did something. OpenACC’s defense against compiling a loop with dependencies or other issues is to simply ignore the directives and deliver a “correct,” if unaccelerated, executable. With the PGI compiler, you can request feedback on the C OpenACC compilation by using this:
pgcc –acc -Minfo=acc laplace.c
Here it is for Fortran:
pgf90 –acc -Minfo=acc laplace.f90
Similar options are available for other compilers. Among the informative output, you see the “Accelerator kernel generated” message for both of your kernels-enabled loops. You may also notice that a reduction was automatically generated for worst_dt. It was nice of the compiler to catch that and generate the reduction automatically. So far so good.
If you run this executable, you will get something like this:
. . . . . . ---------- Iteration number: 3200 ------------ . . .[998,998]: 99.18 [999,999]: 99.56 [1000,1000]: 99.86 ---------- Iteration number: 3300 ------------ . . .[998,998]: 99.19 [999,999]: 99.56 [1000,1000]: 99.87 Max error at iteration 3372 was 0.009995 Total time was 35.258830 seconds.
This was executed on an NVIDIA K80, the fastest GPU available at the time of this writing. For our efforts thus far, we have managed to slow down the code by about 70 percent, which is not impressive at all.