8.2. Integer Support
The SMs have the full complement of 32-bit integer operations.
- Addition with optional negation of an operand for subtraction
- Multiplication and multiply-add
- Integer division
- Logical operations
- Condition code manipulation
- Conversion to/from floating point
- Miscellaneous operations (e.g., SIMD instructions for narrow integers, population count, find first zero)
CUDA exposes most of this functionality through standard C operators. Nonstandard operations, such as 24-bit multiplication, may be accessed using inline PTX assembly or intrinsic functions.
8.2.1. Multiplication
Multiplication is implemented differently on Tesla- and Fermi-class hardware. Tesla implements a 24-bit multiplier, while Fermi implements a 32-bit multiplier. As a consequence, full 32-bit multiplication on SM 1.x hardware requires four instructions. For performance-sensitive code targeting Tesla-class hardware, it is a performance win to use the intrinsics for 24-bit multiply.8 Table 8.4 shows the intrinsics related to multiplication.
Table 8.4 Multiplication Intrinsics
INTRINSIC |
DESCRIPTION |
__[u]mul24 |
Returns the least significant 32 bits of the product of the 24 least significant bits of the integer parameters. The 8 most significant bits of the inputs are ignored. |
__[u]mulhi |
Returns the most significant 32 bits of the product of the inputs. |
__[u]mul64hi |
Returns the most significant 64 bits of the products of the 64-bit inputs. |
8.2.2. Miscellaneous (Bit Manipulation)
The CUDA compiler implements a number of intrinsics for bit manipulation, as summarized in Table 8.5. On SM 2.x and later architectures, these intrinsics map to single instructions. On pre-Fermi architectures, they are valid but may compile into many instructions. When in doubt, disassemble and look at the microcode! 64-bit variants have “ll” (two ells for “long long”) appended to the intrinsic name __clzll(), ffsll(), popcll(), brevll().
Table 8.5 Bit Manipulation Intrinsics
INTRINSIC |
SUMMARY |
DESCRIPTION |
__brev(x) |
Bit reverse |
Reverses the order of bits in a word |
__byte_perm(x,y,s) |
Permute bytes |
Returns a 32-bit word whose bytes were selected from the two inputs according to the selector parameter s |
__clz(x) |
Count leading zeros |
Returns number of zero bits (0–32) before most significant set bit |
__ffs(x) |
Find first sign bit |
Returns the position of the least significant set bit.The least significant bit is position 1. For an input of 0,__ffs() returns 0. |
__popc(x) |
Population count |
Returns the number of set bits |
__[u]sad(x,y,z) |
Sum of absolute differences |
Adds |x-y| to z and returns the result |
8.2.3. Funnel Shift (SM 3.5)
GK110 added a 64-bit “funnel shift” instruction that concatenates two 32-bit values together (the least significant and most significant halves are specified as separate 32-bit inputs, but the hardware operates on an aligned register pair), shifts the resulting 64-bit value left or right, and then returns the most significant (for left shift) or least significant (for right shift) 32 bits.
Funnel shift may be accessed with the intrinsics given in Table 8.6. These intrinsics are implemented as inline device functions (using inline PTX assembler) in sm_35_intrinsics.h. By default, the least significant 5 bits of the shift count are masked off; the _lc and _rc intrinsics clamp the shift value to the range 0..32.
Table 8.6 Funnel Shift Intrinsics
INTRINSIC |
DESCRIPTION |
__funnelshift_l(hi, lo, sh) |
Concatenates [hi:lo] into a 64-bit quantity, shifts it left by (sh&31)bits, and returns the most significant 32 bits |
__funnelshift_lc(hi, lo, sh) |
Concatenates [hi:lo] into a 64-bit quantity, shifts it left by min(sh,32) bits, and returns the most significant 32 bits |
__funnelshift_r(hi, lo, sh) |
Concatenates [hi:lo] into a 64-bit quantity, shifts it right by (sh&31) bits, and returns the least significant 32 bits |
__funnelshift_rc(hi, lo, sh) |
Concatenates [hi:lo] into a 64-bit quantity, shifts it right by min(sh,32) bits, and returns the least significant 32 bits |
Applications for funnel shift include the following.
- Multiword shift operations
- Memory copies between misaligned buffers using aligned loads and stores
- Rotate
To right-shift data sizes greater than 64 bits, use repeated __funnelshift_r() calls, operating from the least significant to the most significant word. The most significant word of the result is computed using operator>>, which shifts in zero or sign bits as appropriate for the integer type. To left-shift data sizes greater than 64 bits, use repeated __funnelshift_l() calls, operating from the most significant to the least significant word. The least significant word of the result is computed using operator<<. If the hi and lo parameters are the same, the funnel shift effects a rotate operation.