- Understanding the GPU
- Setting Up OpenCL
- Running the Game
- OpenCL and OpenGL
Setting Up OpenCL
void createQueue(void) { int err; err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, 1, &device_id, NULL); if (err != CL_SUCCESS) { fail("Unable to enumerate device IDs"); } context = clCreateContext(0, 1, &device_id, NULL, NULL, &err); if (!context) { fail("Unable to create context"); } queue = clCreateCommandQueue(context, device_id, 0, &err); if (!queue) { fail("Unable to create command queue"); } }
Basic OpenCL use requires three objects: a device, a context, and a queue. The device can be a CPU, a GPU, or something more esoteric such as a Cell SPU or a DSP. For this example, we'll just get the first device that the clGetDeviceIDs() returns, irrespective of its type. You can also request just GPU or CPU devices.
Once you have one or more devices, you need to create a context, which is an object that manages a group of devices and memory resources. You can think of a context as a virtual device. You copy data to and from contexts and run kernels on contexts. In turn, the context runs kernels on devices.
Finally, we need to create a command queue. As I said earlier, all OpenCL interactions happen via a queue. You submit operations to the queue, and those operations are completed asynchronously in the background. By default, queues are first-in, first-out. Each command you submit to the queue will execute in order, although individual commands such as executing a kernel may be parallel. You can also enable out-of-order mode for queues and use explicit synchronization between kernels.
Writing an OpenCL Kernel
const char *kernel_source = "__kernel void life( \n" " constant bool*input, \n" " global bool* output, \n" " const unsigned int height, \n" " const unsigned int width) \n" "{ \n" " int i = get_global_id(0); \n" " int rowUp = i - width; \n" " int rowDown = i + width; \n" " bool outOfBounds = (i < width); \n" " outOfBounds |= (i > (width * (height-1))); \n" " outOfBounds |= (i % width == 0); \n" " outOfBounds |= (i % width == width-1); \n" " if (outOfBounds) { output[i] = false; return; } \n" " int neighbours = input[rowUp-1] + input[rowUp] + input[rowUp+1]; \n" " neighbours += input[i-1] + input[i+1]; \n" " neighbours += input[rowDown-1] + input[rowDown] + input[rowDown+1]; \n" " if (neighbours == 3 || (input[i] && neighbours == 2)) \n" " output[i] = true; \n" " else \n" " output[i] = false; \n" "} \n";
OpenCL C is a language based on C99, used for writing OpenCL kernels. In this article's example, you can see it embedded in a C source file as a string, which we'll compile later.
Most of the code in this kernel should look familiar to anyone who has written C. Notice the global and constant qualifiers on the two pointer arguments. These qualifiers specify the address space on the argument. All pointers in OpenCL must be qualified in this way.
When you run a kernel, all of its instances will receive the same arguments. The get_global_id() built-in function returns the work-item ID for this invocation of the kernel. Our example uses a one-dimensional partitioning, but it's possible to use more partitioning and get the other dimensions by passing a number greater than 0 to the get_global_id() function.
For this implementation, I'm explicitly ignoring the outer border of the board, in order to demonstrate one common source of problems with OpenCL kernels: Bounds are not checked, so you must clamp your kernel's memory accesses explicitly. It's also a good idea to try to reduce the number of branches in OpenCL code, because GPUs are notoriously bad at branching.
This example doesn't use vectors at all. Like GLSL, OpenCL C has native support for vector types. You can perform the same operations on vectors as on scalars, and a lot of built-in functions are available for common vector operations.
Loading an OpenCL Kernel
Once you have the device and context set up, the next thing to do is load the kernel. The simplest approach is to load it from source:
cl_kernel createKernelFromSource(cl_device_id device_id, cl_context context, const char *source, const char *name) { int err; // Load the source cl_program program = clCreateProgramWithSource(context, 1, &source , NULL, &err); if (err != CL_SUCCESS) { fail("Unable to create program"); } // Compile it. err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); if (err != CL_SUCCESS) { size_t len; char buffer[2048]; clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, sizeof(buffer), buffer, &len); fprintf(stderr, "%s\n", buffer); fail("Unable to build program"); } // Load it. cl_kernel kernel = clCreateKernel(program, name, &err); if (!kernel || err != CL_SUCCESS) { fail("Unable to create kernel"); } clReleaseProgram(program); return kernel; }
The clCreateProgramWithSource() function creates a new program object from the source. An OpenCL program can contain multiple kernels (and subroutines), just as a C program can contain multiple functions.
The program created initially is not compiled. You need to call clBuildProgram() to compile this program for the target architecture(s). When we created the program from the source, we provided the context as an argument. When we compile it, we get a binary for each of the devices in the context.
If the compilation in this example fails, we get the messages that the compiler generated. If you introduce a mistake into the kernel code, you get something like this:
$ ./a.out <program source>:4:19: error: expected ')' wibble global float* output, ^ Unable to build program
If you've ever used clang to compile code, the format of this error may look familiar, because Apple's implementation of OpenCL uses clang to compile the OpenCL sources. clang can parse all of OpenCL C, but the implementations of the built-in functions are proprietary.
If the code is compiled correctly, we can get the kernel from the resulting program. Note that we release the program at the end of this function, because we won't use it again. OpenCL uses the same retain/release mechanisms as in Objective-C and Core Foundation.