- Understanding the GPU
- Setting Up OpenCL
- Running the Game
- OpenCL and OpenGL
Running the Game
The final step is to run the kernel on all of the data. This step has two parts. The first is setting up the input and output buffers and the kernel arguments, which only needs to happen once:
void prepareKernel(void) { input = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(board), NULL, NULL); output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(board), NULL, NULL); if (!input || !output) { fail("Unable to create buffers"); } int err = 0; err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input); err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &output); err |= clSetKernelArg(kernel, 2, sizeof(unsigned int), &height); err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), &width); if (err != CL_SUCCESS) { fail("Unable to set arguments"); } }
The two buffers, input and output, are regions of GPU memory (or OpenCL-managed CPU memory) that are used for the first two arguments of the kernel. Because the kernel runs in parallel and depends on data from adjacent cells, we must use separate input and output. The other two arguments, the width and height of the kernel, are constants.
Because the kernel is stored in the OpenCL stack as some intermediate representation, it may be compiled to native code when you run it, after using a constant propagation-optimization pass to remove the arguments. This feature is useful because the values of the arguments are the same for every invocation of the kernel.
The real work happens in the runGame() function of the example, which copies the data to the GPU, runs the kernel, and then gets the data back again:
void runGame(int iterations) { if (iterations == 0) { return; } int err; size_t workgroup_size; err = clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &workgroup_size, NULL); if (err != CL_SUCCESS) { fail("Unable to get kernel work-group size"); } // Send the board to the OpenCL stack err = clEnqueueWriteBuffer(queue, input, CL_TRUE, 0, sizeof(board), board, 0, NULL, NULL); if (err != CL_SUCCESS) { fail("Unable to enqueue buffer"); } for (unsigned int i=0 ; i<iterations ; i++) { // Run the kernel on every cell in the board err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &board_size, &workgroup_size, 0, NULL, NULL); if (err) { fail("Unable to enqueue kernel"); } if (i < iterations - 1) { // Copy the output to the input for the next iteration err = clEnqueueCopyBuffer(queue, output, input, 0, 0, sizeof(board), 0, NULL, NULL); if (err) { fail("Unable to enqueue copy"); } } } err = clEnqueueReadBuffer(queue, output, CL_TRUE, 0, sizeof(board), board, 0, NULL, NULL ); if (err != CL_SUCCESS) { fail("Unable to read results"); } }
We use clEnqueue*() functions, all of which are asynchronous operations, except for the last one. Setting the third argument of clEnqueueReadBuffer() to CL_TRUE makes this read synchronous. The read instruction will be added to the queue, but this call won't return until after the read operation has finished.
If you want to flush a set of OpenCL commands explicitly, you can use the clFinish() function, which blocks until all previously queued commands have completed.
We copy the data directly from the output buffer into the input buffer if we're running another iteration of the game without displaying the results. Using clEnqueueCopyBuffer() for this purpose is much more efficient than reading the data out and writing it back again, which often involves copying across the PCIe (or AGP) bus, whereas copying in OpenCL just copies in GPU memory. Ideally, you would just allocate both buffers read-write and then swap the order around, eliminating the copy altogether. You might want to try modifying this example to use that technique.