10.2 Texture Memory
Before describing the features of the fixed-function texturing hardware, let’s spend some time examining the underlying memory to which texture references may be bound.
CUDA can use texture to read from either device memory or CUDA arrays.
10.2.1 Device Memory
In device memory, the textures are addressed in row-major order. A 1024x768 texture might look like this:

Figure 10-1. 1024x768 image.
where Offset is the offset (in elements) from the base pointer of the image:
(Equation 10-1) Offset=Y*Width+X
For a byte offset, multiply by the size of the elements:
(Equation 10-2) ByteOffset=sizeof is signed integer, unsigned integer, or floating point:
enum cudaChannelFormatKind { cudaChannelFormatKindSigned = 0, cudaChannelFormatKindUnsigned = 1, cudaChannelFormatKindFloat = 2, cudaChannelFormatKindNone = 3 };
Developers can create cudaChannelFormatDesc structures using the cudaCreateChannelDesc function:
cudaChannelFormatDesc cudaCreateChannelDesc(int x, int y, int z, int w, cudaChannelFormatKind kind);
Alternatively, a templated family of functions can be invoked as follows:
template<class T> cudaCreateChannelDesc<T>();
where T may be any of the native formats supported by CUDA. Here are two examples of the specializations of this template:
template<> __inline__ __host__ cudaChannelFormatDesc cudaCreateChannelDesc<float>(void) { int e = (int)sizeof(float) * 8; return cudaCreateChannelDesc(e, 0, 0, 0, cudaChannelFormatKindFloat); } template<> __inline__ __host__ cudaChannelFormatDesc cudaCreateChannelDesc<uint2>(void) { int e = (int)sizeof(unsigned int) * 8; return cudaCreateChannelDesc(e, e, 0, 0, cudaChannelFormatKindUnsigned); }
3D CUDA arrays may be allocated with cudaMalloc3DArray():
cudaError_t cudaMalloc3DArray(struct cudaArray** array, const struct cudaChannelFormatDesc* desc, struct cudaExtent extent, unsigned int flags __dv(0));
Rather than taking width, height and depth parameters, cudaMalloc3DArray() takes a cudaExtent structure:
struct cudaExtent { size_t width; size_t height; size_t depth; };
The flags parameter, like that of cudaMallocArray(), must be cudaArraySurfaceLoadStore if the CUDA array will be used for surface read/write operations.
Driver API
The driver API equivalents of cudaMallocArray() and cudaMalloc3DArray() are cuArrayCreate() and cuArray3DCreate(), respectively:
CUresult cuArrayCreate(CUarray *pHandle, const CUDA_ARRAY_DESCRIPTOR *pAllocateArray); CUresult cuArray3DCreate(CUarray *pHandle, const CUDA_ARRAY3D_DESCRIPTOR *pAllocateArray); cuArray3DCreate()
can be used to allocate 1D or 2D CUDA arrays by specifying 0 as the height or depth, respectively. The CUDA_ARRAY3D_DESCRIPTOR structure is as follows:
typedef struct CUDA_ARRAY3D_DESCRIPTOR_st { size_t Width; /**< Width of 3D array */ size_t Height; /**< Height of 3D array */ size_t Depth; /**< Depth of 3D array */ CUarray_format Format; /**< Array format */ unsigned int NumChannels; /**< Channels per array element */ unsigned int Flags; /**< Flags */ } CUDA_ARRAY3D_DESCRIPTOR;
Together, the Format and NumChannels members describe the size of each element of the CUDA array: NumChannels may be 1, 2 or 4, and Format specifies the channels’ type, as follows:
typedef enum CUarray_format_enum { CU_AD_FORMAT_UNSIGNED_INT8 = 0x01, CU_AD_FORMAT_UNSIGNED_INT16 = 0x02, CU_AD_FORMAT_UNSIGNED_INT32 = 0x03, CU_AD_FORMAT_SIGNED_INT8 = 0x08, CU_AD_FORMAT_SIGNED_INT16 = 0x09, CU_AD_FORMAT_SIGNED_INT32 = 0x0a, CU_AD_FORMAT_HALF = 0x10, CU_AD_FORMAT_FLOAT = 0x20 } CUarray_format;
Sometimes, CUDA array handles are passed to subroutines that need to query the dimensions and/or format of the input array. The cuArray3DGetDescriptor() function is provided for that purpose:
CUresult cuArray3DGetDescriptor(CUDA_ARRAY3D_DESCRIPTOR *pArrayDescriptor, CUarray hArray);
Note that this function may be called on 1D and 2D CUDA arrays, even ones that were created with cuArrayCreate().
10.2.3 Device Memory v. CUDA Arrays
For applications that exhibit sparse access patterns, especially patterns with dimensional locality (for example, computer vision applications), CUDA arrays are a clear win. For applications with regular access patterns, especially ones with little to no reuse or whose reuse can be explicitly managed by the application in shared memory, device pointers are the obvious choice.
Some applications, such as image processing applications, fall into a gray area where the choice between device pointers and CUDA arrays is not obvious. All other things being equal, device memory is probably preferable to CUDA arrays; but the following considerations may be used to help in the decision-making process.
- Until CUDA 3.2, CUDA kernels could not write to CUDA arrays - they were only able to read from them via texture intrinsics. CUDA 3.2 added the ability for Fermi-class hardware to access 2D CUDA arrays via “surface read/write” intrinsics.
- CUDA arrays do not consume any CUDA address space.
- On WDDM drivers (Windows Vista and later), the system can automatically manage the residence of CUDA arrays: they can be swapped into and out of device memory transparently, depending on whether they are needed by the CUDA kernels that are executing. In contrast, WDDM requires all device memory to be resident in order for any kernel to execute.
- CUDA arrays can reside only in device memory, and the GPU can convert between the two representations while transferring the data across the bus. For some applications, keeping a pitch representation in host memory and a CUDA array representation in device memory is the best fit.