- 8.1. Memory
- 8.2. Integer Support
- 8.3. Floating-Point Support
- 8.4. Conditional Code
- 8.5. Textures and Surfaces
- 8.6. Miscellaneous Instructions
- 8.7. Instruction Sets
8.3. Floating-Point Support
Fast native floating-point hardware is the raison d’être for GPUs, and in many ways they are equal to or superior to CPUs in their floating-point implementation. Denormals are supported at full speed,9 directed rounding may be specified on a per-instruction basis, and the Special Function Units deliver high-performance approximation functions to six popular single-precision transcendentals. In contrast, x86 CPUs implement denormals in microcode that runs perhaps 100x slower than operating on normalized floating-point operands. Rounding direction is specified by a control word that takes dozens of clock cycles to change, and the only transcendental approximation functions in the SSE instruction set are for reciprocal and reciprocal square root, which give 12-bit approximations that must be refined with a Newton-Raphson iteration before being used.
Since GPUs’ greater core counts are offset somewhat by their lower clock frequencies, developers can expect at most a 10x (or thereabouts) speedup on a level playing field. If a paper reports a 100x or greater speedup from porting an optimized CPU implementation to CUDA, chances are one of the above-described “instruction set mismatches” played a role.
8.3.1. Formats
Figure 8.2 depicts the three (3) IEEE standard floating-point formats supported by CUDA: double precision (64-bit), single precision (32-bit), and half precision (16-bit). The values are divided into three fields: sign, exponent, and mantissa. For double, single, and half, the exponent fields are 11, 8, and 5 bits in size, respectively; the corresponding mantissa fields are 52, 23, and 10 bits.
Figure 8.2 Floating-point formats.
The exponent field changes the interpretation of the floating-point value. The most common (“normal”) representation encodes an implicit 1 bit into the mantissa and multiplies that value by 2e-bias, where bias is the value added to the actual exponent before encoding into the floating-point representation. The bias for single precision, for example, is 127.
Table 8.7 summarizes how floating-point values are encoded. For most exponent values (so-called “normal” floating-point values), the mantissa is assumed to have an implicit 1, and it is multiplied by the biased value of the exponent. The maximum exponent value is reserved for infinity and Not-A-Number values. Dividing by zero (or overflowing a division) yields infinity; performing an invalid operation (such as taking the square root or logarithm of a negative number) yields a NaN. The minimum exponent value is reserved for values too small to represent with the implicit leading 1. As the so-called denormals10 get closer to zero, they lose bits of effective precision, a phenomenon known as gradual underflow. Table 8.8 gives the encodings and values of certain extreme values for the three formats.
Table 8.7 Floating-Point Representations
DOUBLE PRECISION |
|||
EXPONENT |
MANTISSA |
VALUE |
CASE NAME |
0 |
0 |
±0 |
Zero |
0 |
Nonzero |
±2-1022[0.mantissa] |
Denormal |
1 to 2046 |
Any |
±2e-1023[1.mantissa] |
Normal |
2047 |
0 |
±∞ |
Infinity |
2047 |
Nonzero |
Not-A-Number |
|
SINGLE PRECISION |
|||
0 |
0 |
±0 |
Zero |
0 |
Nonzero |
±-126[0.mantissa] |
Denormal |
1 to 254 |
Any |
±2e-127[1.mantissa] |
Normal |
255 |
0 |
±∞ |
Infinity |
255 |
Nonzero |
Not-A-Number |
|
HALF PRECISION |
|||
0 |
0 |
±0 |
Zero |
0 |
Nonzero |
±-14[0.mantissa] |
Denormal |
1 to 30 |
Any |
±2e-15[1.mantissa] |
Normal |
31 |
0 |
±∞ |
Infinity |
31 |
Nonzero |
Not-A-Number |
Table 8.8 Floating-Point Extreme Values
DOUBLE PRECISION |
||
HEXADECIMAL |
EXACT VALUE |
|
Smallest denormal |
0...0001 |
2-1074 |
Largest denormal |
000F...F |
2-1022[1-2-52] |
Smallest normal |
0010...0 |
2-1022 |
1.0 |
3FF0...0 |
1 |
Maximum integer |
4340...0 |
253 |
Largest normal |
7F7FFFFF |
21024[1-2-53] |
Infinity |
7FF00000 |
Infinity |
SINGLE PRECISION |
||
Smallest denormal |
00000001 |
2-149 |
Largest denormal |
007FFFFF |
2-126[1-2-23] |
Smallest normal |
00800000 |
2-126 |
1.0 |
00800000 |
1 |
Maximum integer |
4B800000 |
224 |
Largest normal |
7F7FFFFF |
2128[1-2-24] |
Infinity |
7F800000 |
Infinity |
HALF PRECISION |
||
Smallest denormal |
0001 |
2-24 |
Largest denormal |
07FF |
2-14[1-2-10] |
Smallest normal |
0800 |
2-14 |
1.0 |
3c00 |
1 |
Maximum integer |
3c00 |
211 |
Largest normal |
7BFF |
216[1-2-11] |
Infinity |
7C00 |
Infinity |
Rounding
The IEEE standard provides for four (4) round modes.
- Round-to-nearest-even (also called “round-to-nearest”)
- Round toward zero (also called “truncate” or “chop”)
- Round down (or “round toward negative infinity”)
- Round up (or “round toward positive infinity”)
Round-to-nearest, where intermediate values are rounded to the nearest representable floating-point value after each operation, is by far the most commonly used round mode. Round up and round down (the “directed rounding modes”) are used for interval arithmetic, where a pair of floating-point values are used to bracket the intermediate result of a computation. To correctly bracket a result, the lower and upper values of the interval must be rounded toward negative infinity (“down”) and toward positive infinity (“up”), respectively.
The C language does not provide any way to specify round modes on a per-instruction basis, and CUDA hardware does not provide a control word to implicitly specify rounding modes. Consequently, CUDA provides a set of intrinsics to specify the round mode of an operation, as summarized in Table 8.9.
Table 8.9 Intrinsics for Rounding
INTRINSIC |
OPERATION |
__fadd_[rn|rz|ru|rd] |
Addition |
__fmul_[rn|rz|ru|rd] |
Multiplication |
__fmaf_[rn|rz|ru|rd] |
Fused multiply-add |
__frcp_[rn|rz|ru|rd] |
Recriprocal |
__fdiv_[rn|rz|ru|rd] |
Division |
__fsqrt_[rn|rz|ru|rd] |
Square root |
__dadd_[rn|rz|ru|rd] |
Addition |
__dmul_[rn|rz|ru|rd] |
Multiplication |
__fma_[rn|rz|ru|rd] |
Fused multiply-add |
__drcp_[rn|rz|ru|rd] |
Reciprocal |
__ddiv_[rn|rz|ru|rd] |
Division |
__dsqrt_[rn|rz|ru|rd] |
Square root |
Conversion
In general, developers can convert between different floating-point representations and/or integers using standard C constructs: implicit conversion or explicit typecasts. If necessary, however, developers can use the intrinsics listed in Table 8.10 to perform conversions that are not in the C language specification, such as those with directed rounding.
Table 8.10 Intrinsics for Conversion
INTRINSIC |
OPERATION |
__float2int_[rn|rz|ru|rd] |
float to int |
__float2uint_[rn|rz|ru|rd] |
float to unsigned int |
__int2float_[rn|rz|ru|rd] |
int to float |
__uint2float_[rn|rz|ru|rd] |
unsigned int to float |
__float2ll_[rn|rz|ru|rd] |
float to 64-bit int |
__ll2float_[rn|rz|ru|rd] |
64-bit int to float |
__ull2float_[rn|rz|ru|rd] |
unsigned 64-bit int to float |
__double2float_[rn|rz|ru|rd] |
double to float |
__double2int_[rn|rz|ru|rd] |
double to int |
__double2uint_[rn|rz|ru|rd] |
double to unsigned int |
__double2ll_[rn|rz|ru|rd] |
double to 64-bit int |
__double2ull_[rn|rz|ru|rd] |
double to 64-bit unsigned int |
__int2double_rn |
int to double |
__uint2double_rn |
unsigned int to double |
__ll2double_[rn|rz|ru|rd] |
64-bit int to double |
__ull2double_[rn|rz|ru|rd] |
unsigned 64-bit int to double |
Because half is not standardized in the C programming language, CUDA uses unsigned short in the interfaces for __half2float() and __float2half(). __float2half() only supports the round-to-nearest rounding mode.
float __half2float( unsigned short ); unsigned short __float2half( float );
8.3.2. Single Precision (32-Bit)
Single-precision floating-point support is the workhorse of GPU computation. GPUs have been optimized to natively deliver high performance on this data type,11 not only for core standard IEEE operations such as addition and multiplication, but also for nonstandard operations such as approximations to transcendentals such as sin() and log(). The 32-bit values are held in the same register file as integers, so coercion between single-precision floating-point values and 32-bit integers (with __float_as_int() and __int_as_float()) is free.
Addition, Multiplication, and Multiply-Add
The compiler automatically translates +, –, and * operators on floating-point values into addition, multiplication, and multiply-add instructions. The __fadd_rn() and __fmul_rn() intrinsics may be used to suppress fusion of addition and multiplication operations into multiply-add instructions.
Reciprocal and Division
For devices of compute capability 2.x and higher, the division operator is IEEE-compliant when the code is compiled with --prec-div=true. For devices of compute capability 1.x or for devices of compute capability 2.x when the code is compiled with --prec-div=false, the division operator and __fdividef(x,y) have the same accuracy, but for 2126<y<2128, __fdividef(x,y) delivers a result of zero, whereas the division operator delivers the correct result. Also, for 2126<y<2128, if x is infinity, __fdividef(x,y) returns NaN, while the division operator returns infinity.
Transcendentals (SFU)
The Special Function Units (SFUs) in the SMs implement very fast versions of six common transcendental functions.
- Sine and cosine
- Logarithm and exponential
- Reciprocal and reciprocal square root
Table 8.11, excerpted from the paper on the Tesla architecture12summarizes the supported operations and corresponding precision. The SFUs do not implement full precision, but they are reasonably good approximations of these functions and they are fast. For CUDA ports that are significantly faster than an optimized CPU equivalent (say, 25x or more), the code most likely relies on the SFUs.
Table 8.11 SFU Accuracy
FUNCTION |
ACCURACY (GOOD BITS) |
ULP ERROR |
1/x |
24.02 |
0.98 |
1/sqrt(x) |
23.40 |
1.52 |
2x |
22.51 |
1.41 |
log2x |
22.57 |
n/a |
sin/cos |
22.47 |
n/a |
The SFUs are accessed with the intrinsics given in Table 8.12. Specifying the --fast-math compiler option will cause the compiler to substitute conventional C runtime calls with the corresponding SFU intrinsics listed above.
Table 8.12 SFU Intrinsics
INTRINSIC |
OPERATION |
__cosf(x) |
cos x |
__exp10f(x) |
10x |
__expf(x) |
ex |
__fdividef(x,y) |
x.y |
__log2f(x) |
log2 x |
__log10f(x) |
log10 x |
__powf(x,y) |
xy |
__sinf(x) |
sin x |
__sincosf(x,sptr,cptr) |
*s=sin(x);*c=cos(x); |
__tanf(x) |
tan x |
Miscellaneous
__saturate(x) returns 0 if x<0, 1 if x>1, and x otherwise.
8.3.3. Double Precision (64-Bit)
Double-precision floating-point support was added to CUDA with SM 1.3 (first implemented in the GeForce GTX 280), and much improved double-precision support (both functionality and performance) became available with SM 2.0. CUDA’s hardware support for double precision features full-speed denormals and, starting in SM 2.x, a native fused multiply-add instruction (FMAD), compliant with IEEE 754 c. 2008, that performs only one rounding step. Besides being an intrinsically useful operation, FMAD enables full accuracy on certain functions that are converged with the Newton-Raphson iteration.
As with single-precision operations, the compiler automatically translates standard C operators into multiplication, addition, and multiply-add instructions. The __dadd_rn() and __dmul_rn() intrinsics may be used to suppress fusion of addition and multiplication operations into multiply-add instructions.
8.3.4. Half Precision (16-Bit)
With 5 bits of exponent and 10 bits of significand, half values have enough precision for HDR (high dynamic range) images and can be used to hold other types of values that do not require float precision, such as angles. Half precision values are intended for storage, not computation, so the hardware only provides instructions to convert to/from 32-bit.13 These instructions are exposed as the __halftofloat() and __floattohalf() intrinsics.
float __halftofloat( unsigned short ); unsigned short __floattohalf( float );
These intrinsics use unsigned short because the C language has not standardized the half floating-point type.
8.3.5. Case Study: float→half Conversion
Studying the float→half conversion operation is a useful way to learn the details of floating-point encodings and rounding. Because it’s a simple unary operation, we can focus on the encoding and rounding without getting distracted by the details of floating-point arithmetic and the precision of intermediate representations.
When converting from float to half, the correct output for any float too large to represent is half infinity. Any float too small to represent as a half (even a denormal half) must be clamped to 0.0. The maximum float that rounds to half 0.0 is 0x32FFFFFF, or 2.98-8, while the smallest float that rounds to half infinity is 65520.0. float values inside this range can be converted to half by propagating the sign bit, rebiasing the exponent (since float has an 8-bit exponent biased by 127 and half has a 5-bit exponent biased by 15), and rounding the float mantissa to the nearest half mantissa value. Rounding is straightforward in all cases except when the input value falls exactly between the two possible output values. When this is the case, the IEEE standard specifies rounding to the “nearest even” value. In decimal arithmetic, this would mean rounding 1.5 to 2.0, but also rounding 2.5 to 2.0 and (for example) rounding 0.5 to 0.0.
Listing 8.3 shows a C routine that exactly replicates the float-to-half conversion operation, as implemented by CUDA hardware. The variables exp and mag contain the input exponent and “magnitude,” the mantissa and exponent together with the sign bit masked off. Many operations, such as comparisons and rounding operations, can be performed on the magnitude without separating the exponent and mantissa.
The macro LG_MAKE_MASK, used in Listing 8.3, creates a mask with a given bit count: #define LG_MAKE_MASK(bits) ((1<<bits)-1). A volatile union is used to treat the same 32-bit value as float and unsigned int; idioms such as *((float *) (&u)) are not portable. The routine first propagates the input sign bit and masks it off the input.
After extracting the magnitude and exponent, the function deals with the special case when the input float is INF or NaN, and does an early exit. Note that INF is signed, but NaN has a canonical unsigned value. Lines 50–80 clamp the input float value to the minimum or maximum values that correspond to representable half values and recompute the magnitude for clamped values. Don’t be fooled by the elaborate code constructing f32MinRInfin and f32MaxRf16_zero; those are constants with the values 0x477ff000 and 0x32ffffff, respectively.
The remainder of the routine deals with the cases of output normal and denormal (input denormals are clamped in the preceding code, so mag corresponds to a normal float). As with the clamping code, f32Minf16Normal is a constant, and its value is 0x38ffffff.
To construct a normal, the new exponent must be computed (lines 92 and 93) and the correctly rounded 10 bits of mantissa shifted into the output. To construct a denormal, the implicit 1 must be OR’d into the output mantissa and the resulting mantissa shifted by the amount corresponding to the input exponent. For both normals and denormals, the rounding of the output mantissa is accomplished in two steps. The rounding is accomplished by adding a mask of 1’s that ends just short of the output’s LSB, as seen in Figure 8.3.
Figure 8.3 Rounding mask (half).
This operation increments the output mantissa if bit 12 of the input is set; if the input mantissa is all 1’s, the overflow causes the output exponent to correctly increment. If we added one more 1 to the MSB of this adjustment, we’d have elementary school–style rounding where the tiebreak goes to the larger number. Instead, to implement round-to-nearest even, we conditionally increment the output mantissa if the LSB of the 10-bit output is set (Figure 8.4). Note that these steps can be performed in either order or can be reformulated in many different ways.
Figure 8.4 Round-to-nearest-even (half).
Listing 8.3. ConvertToHalf().
/* * exponent shift and mantissa bit count are the same. * When we are shifting, we use [f16|f32]ExpShift * When referencing the number of bits in the mantissa, * we use [f16|f32]MantissaBits */ const int f16ExpShift = 10; const int f16MantissaBits = 10; const int f16ExpBias = 15; const int f16MinExp = -14; const int f16MaxExp = 15; const int f16SignMask = 0x8000; const int f32ExpShift = 23; const int f32MantissaBits = 23; const int f32ExpBias = 127; const int f32SignMask = 0x80000000; unsigned short ConvertFloatToHalf( float f ) { /* * Use a volatile union to portably coerce * 32-bit float into 32-bit integer */ volatile union { float f; unsigned int u; } uf; uf.f = f; // return value: start by propagating the sign bit. unsigned short w = (uf.u >> 16) & f16SignMask; // Extract input magnitude and exponent unsigned int mag = uf.u & ~f32SignMask; int exp = (int) (mag >> f32ExpShift) - f32ExpBias; // Handle float32 Inf or NaN if ( exp == f32ExpBias+1 ) { // INF or NaN if ( mag & LG_MAKE_MASK(f32MantissaBits) ) return 0x7fff; // NaN // INF - propagate sign return w|0x7c00; } /* * clamp float32 values that are not representable by float16 */ { // min float32 magnitude that rounds to float16 infinity unsigned int f32MinRInfin = (f16MaxExp+f32ExpBias) << f32ExpShift; f32MinRInfin |= LG_MAKE_MASK( f16MantissaBits+1 ) << (f32MantissaBits-f16MantissaBits-1); if (mag > f32MinRInfin) mag = f32MinRInfin; } { // max float32 magnitude that rounds to float16 0.0 unsigned int f32MaxRf16_zero = f16MinExp+f32ExpBias (f32MantissaBits-f16MantissaBits-1); f32MaxRf16_zero <<= f32ExpShift; f32MaxRf16_zero |= LG_MAKE_MASK( f32MantissaBits ); if (mag < f32MaxRf16_zero) mag = f32MaxRf16_zero; } /* * compute exp again, in case mag was clamped above */ exp = (mag >> f32ExpShift) - f32ExpBias; // min float32 magnitude that converts to float16 normal unsigned int f32Minf16Normal = ((f16MinExp+f32ExpBias)<< f32ExpShift); f32Minf16Normal |= LG_MAKE_MASK( f32MantissaBits ); if ( mag >= f32Minf16Normal ) { // // Case 1: float16 normal // // Modify exponent to be biased for float16, not float32 mag += (unsigned int) ((f16ExpBias-f32ExpBias)<< f32ExpShift); int RelativeShift = f32ExpShift-f16ExpShift; // add rounding bias mag += LG_MAKE_MASK(RelativeShift-1); // round-to-nearest even mag += (mag >> RelativeShift) & 1; w |= mag >> RelativeShift; } else { /* * Case 2: float16 denormal */ // mask off exponent bits - now fraction only mag &= LG_MAKE_MASK(f32MantissaBits); // make implicit 1 explicit mag |= (1<<f32ExpShift); int RelativeShift = f32ExpShift-f16ExpShift+f16MinExp-exp; // add rounding bias mag += LG_MAKE_MASK(RelativeShift-1); // round-to-nearest even mag += (mag >> RelativeShift) & 1; w |= mag >> RelativeShift; } return w; }
In practice, developers should convert float to half by using the __floattohalf() intrinsic, which the compiler translates to a single F2F machine instruction. This sample routine is provided purely to aid in understanding floating-point layout and rounding; also, examining all the special-case code for INF/NAN and denormal values helps to illustrate why these features of the IEEE spec have been controversial since its inception: They make hardware slower, more costly, or both due to increased silicon area and engineering effort for validation.
In the code accompanying this book, the ConvertFloatToHalf() routine in Listing 8.3 is incorporated into a program called float_to_float16.cu that tests its output for every 32-bit floating-point value.
8.3.6. Math Library
CUDA includes a built-in math library modeled on the C runtime library, with a few small differences: CUDA hardware does not include a rounding mode register (instead, the round mode is encoded on a per-instruction basis),14 so functions such as rint() that reference the current rounding mode always round-to-nearest. Additionally, the hardware does not raise floating-point exceptions; results of aberrant operations, such as taking the square root of a negative number, are encoded as NaNs.
Table 8.13 lists the math library functions and the maximum error in ulps for each function. Most functions that operate on float have an “f” appended to the function name—for example, the functions that compute the sine function are as follows.
double sin( double angle ); float sinf( float angle );
Table 8.13 Math Library
ULP ERROR |
||||
FUNCTION |
OPERATION |
EXPRESSION |
32 |
64 |
x+y |
Addition |
x+y |
01 |
0 |
x/y |
Multiplication |
x*y |
01 |
0 |
x/y |
Division |
x/y |
22 |
0 |
1/x |
Reciprocal |
1/x |
12 |
0 |
acos[f](x) |
Inverse cosine |
cos–1 x |
3 |
2 |
acosh[f](x) |
Inverse hyperbolic cosine |
4 |
2 |
|
asin[f](x) |
Inverse sine |
sin–1x |
4 |
2 |
asinh[f](x) |
Inverse hyperbolic sine |
3 |
2 |
|
atan[f](x) |
Inverse tangent |
tan–1x |
2 |
2 |
atan2[f](y,x) |
Inverse tangent of y/x |
3 |
2 |
|
atanh[f](x) |
Inverse hyperbolic tangent |
tanh–1 |
3 |
2 |
cbrt[f](x) |
Cube root |
1 |
1 |
|
ceil[f](x) |
“Ceiling,” nearest integer greater than or equal to x |
⌊x⌋ |
0 |
|
copysign[f](x,y) |
Sign of y, magnitude of x |
n/a |
||
cos[f](x) |
Cosine |
cos x |
2 |
1 |
cosh[f](x) |
Hyperbolic cosine |
2 |
||
cospi[f](x) |
Cosine, scaled byΠ |
cosΠx |
2 |
|
erf[f](x) |
Error function |
3 |
2 |
|
erfc[f](x) |
Complementary error function |
6 |
4 |
|
erfcinv[f](y) |
Inverse complementary error function |
Return x for which y=1-erff(x) |
7 |
8 |
erfcx[f](x) |
Scaled error function |
ex2[erff(x)) |
6 |
3 |
erfinv[f](y) |
Inverse error function |
Return x for which y=erff(x) |
3 |
5 |
exp[f](x) |
Natural exponent |
ex |
2 |
1 |
exp10[f](x) |
Exponent (base 10) |
10x |
2 |
1 |
exp2[f](x) |
Exponent (base 2) |
2x |
2 |
1 |
expm1[f](x) |
Natural exponent, minus one |
ex–1 |
1 |
1 |
fabs[f](x) |
Absolute value |
|x| |
0 |
0 |
fdim[f](x,y) |
Positive difference |
x–1,x>y+0,x≤NAN,X or y NAN |
0 |
0 |
floor[f](x) |
“Floor,” nearest integer less than or equal to x |
⌊x⌋ |
0 |
0 |
fma[f](x,y,z) |
Multiply-add |
xy + z |
0 |
0 |
fmax[f](x,y) |
Maximum |
x, x y or isNaN(y)y, otherwise |
0 |
0 |
fmin[f](x,y) |
Minimum |
x, x y or isNaN(y)y, otherwise |
0 |
0 |
fmod[f](x,y) |
Floating-point remainder |
0 |
0 |
|
frexp[f](x,exp) |
Fractional component |
0 |
0 |
|
hypot[f](x,y) |
Length of hypotenuse |
3 |
2 |
|
ilogb[f](x) |
Get exponent |
0 |
0 |
|
isfinite(x) |
Nonzero if x is not ±INF |
n/a |
||
isinf(x) |
Nonzero if x is ±INF |
n/a |
||
isnan(x) |
Nonzero if x is a NaN |
n/a |
||
j0[f](x) |
Bessel function of the first kind (n=0) |
J0(x) |
93 |
73 |
j1[f](x) |
Bessel function of the first kind (n=1) |
J1(x) |
93 |
73 |
jn[f](n,x) |
Bessel function of the first kind |
Jn(x) |
* |
|
ldexp[f](x,exp) |
Scale by power of 2 |
x2exp |
0 |
0 |
lgamma[f](x) |
Logarithm of gamma function |
ln(Γ(x)) |
64 |
44 |
llrint[f](x) |
Round to long long |
0 |
0 |
|
llround[f](x) |
Round to long long |
0 |
0 |
|
lrint[f](x) |
Round to long |
0 |
* |
|
lround[f](x) |
Round to long |
0 |
0 |
|
log[f](x) |
Natural logarithm |
ln(x) |
1 |
1 |
log10[f](x) |
Logarithm (base 10) |
log10 x |
3 |
1 |
log1p[f](x) |
Natural logarithm of x+1 |
ln(x + 1) |
2 |
1 |
log2[f](x) |
Logarithm (base 2) |
log2 x |
3 |
1 |
logb[f](x) |
Get exponent |
0 |
0 |
|
modff(x,iptr) |
Split fractional and integer parts |
0 |
0 |
|
nan[f](cptr) |
Returns NaN |
NaN |
n/a |
|
nearbyint[f](x) |
Round to integer |
0 |
0 |
|
nextafter[f](x,y) |
Returns the FP value closest to x in the direction of y |
n/a |
||
normcdf[f](x) |
Normal cumulative distribution |
6 |
5 |
|
normcdinv[f](x) |
Inverse normal cumulative distribution |
5 |
8 |
|
pow[f](x,y) |
Power function |
xy |
8 |
2 |
rcbrt[f](x) |
Inverse cube root |
2 |
1 |
|
remainder[f](x,y) |
Remainder |
0 |
0 |
|
remquo[f](x,y,iptr) |
Remainder (also returns quotient) |
0 |
0 |
|
rsqrt[f](x) |
rsqrt[f](x) Reciprocal |
2 |
1 |
|
rint[f](x) |
Round to nearest int |
0 |
0 |
|
round[f](x) |
Round to nearest int |
0 |
0 |
|
scalbln[f](x,n) |
Scale x by 2n (n is long int) |
x2n |
0 |
0 |
scalbn[f](x,n) |
Scale x by 2n (n is int) |
x2n |
0 |
0 |
signbit(x) |
Nonzero if x is negative |
n/a |
0 |
|
sin[f](x) |
Sine |
sin x |
2 |
1 |
sincos[f](x,s,c) |
Sine and cosine |
*s=sin(x);*c=cos(x); |
2 |
1 |
sincospi[f](x,s,c) |
Sine and cosine |
*s=sin(Πx);*c=cos(Πx); |
2 |
1 |
sinh[f](x) |
Hyperbolic sine |
3 |
1 |
|
Sine, scaled by Π |
sin Πx |
2 |
1 |
|
sqrt[f](x) |
Square root |
36 |
0 |
|
tan[f](x) |
Tangent |
tan x |
4 |
2 |
tanh[f](x) |
Hyperbolic tangent |
2 |
1 |
|
tgamma[f](x) |
True gamma function |
γ(x) |
11 |
8 |
trunc[f](x) |
Truncate (round to integer toward zero) |
0 |
0 |
|
y0[f](x) |
Bessel function of the second kind (n=0) |
Y0(x) |
93 |
73 |
y1[f](x) |
Bessel function of the second kind (n=1) |
Y1(x) |
93 |
73 |
yn[f](n,x) |
Bessel function of the second kind |
Yn(x) |
.. |
* For the Bessel functions jnf(n,x) and jn(n,x), for n=128the maximum absolute error is 2.2x10-6 and 5x10-12, respectively.
** For the Bessel function ynf(n,x), the error is ⌈2 2.5n⌉ for |x|; otherwise, the maximum absolute error is 2.2x10-6for n=128. For yn(n,x), the maximum absolute error is 5X10-12.
- On SM 1.x class hardware, the precision of addition and multiplication operation that are merged into FMAD instructions will suffer due to truncation of the intermediate mantissa.
- On SM 2.x and later hardware, developers can reduce this error rate to 0 ulps by specifying --prec-div=true.
- For float, the error is 9 ulps for |x|<8; otherwise, the maximum absolute error is 2.2X10-6. For double, the error is 7 ulps for |x|<8; otherwise, the maximum absolute error is 5×10-12.
- The error for lgammaf() is greater than 6 inside the interval –10.001, –2.264. The error for lgamma() is greater than 4 inside the interval –11.001, –2.2637.
- On SM 2.x and later hardware, developers can reduce this error rate to 0 ulps by specifying --prec-sqrt=true.
These are denoted in Table 8.13 as, for example, sin[f].
Conversion to Integer
According to the C runtime library definition, the nearbyint() and rint() functions round a floating-point value to the nearest integer using the “current rounding direction,” which in CUDA is always round-to-nearest-even. In the C runtime, nearbyint() and rint() differ only in their handling of the INEXACT exception. But since CUDA does not raise floating-point exceptions, the functions behave identically.
round() implements elementary school–style rounding: For floating-point values halfway between integers, the input is always rounded away from zero. NVIDIA recommends against using this function because it expands to eight (8) instructions as opposed to one for rint() and its variants. trunc() truncates or “chops” the floating-point value, rounding toward zero. It compiles to a single instruction.
Fractions and Exponents
float frexpf(float x, int *eptr);
frexpf() breaks the input into a floating-point significand in the range [0.5, 1.0) and an integral exponent for 2, such that
x = Significand · 2Exponent
float logbf( float x );
logbf() extracts the exponent from x and returns it as a floating-point value. It is equivalent to floorf(log2f(x)), except it is faster. If x is a denormal, logbf() returns the exponent that x would have if it were normalized.
float ldexpf( float x, int exp ); float scalbnf( float x, int n ); float scanblnf( float x, long n );
ldexpf(), scalbnf(), and scalblnf() all compute x2n by direct manipulation of floating-point exponents.
Floating-Point Remainder
modff() breaks the input into fractional and integer parts.
float modff( float x, float *intpart );
The return value is the fractional part of x, with the same sign.
remainderf(x,y) computes the floating-point remainder of dividing x by y. The return value is x-n*y, where n is x/y, rounded to the nearest integer. If |x –ny| = 0.5, n is chosen to be even.
float remquof(float x, float y, int *quo);
computes the remainder and passes back the lower bits of the integral quotient x/y, with the same sign as x/y.
Bessel Functions
The Bessel functions of order n relate to the differential equation
n can be a real number, but for purposes of the C runtime, it is a nonnegative integer.
The solution to this second-order ordinary differential equation combines Bessel functions of the first kind and of the second kind.
The math runtime functions jn[f]() and yn[f]() compute Jn(x) and Yn(x), respectively. j0f(), j1f(), y0f(), and y1f() compute these functions for the special cases of n=0 and n=1.
Gamma Function
The gamma function Γ is an extension of the factorial function, with its argument shifted down by 1, to real numbers. It has a variety of definitions, one of which is as follows.
The function grows so quickly that the return value loses precision for relatively small input values, so the library provides the lgamma() function, which returns the natural logarithm of the gamma function, in addition to the tgamma() (“true gamma”) function.
8.3.7. Additional Reading
Goldberg’s survey (with the captivating title “What Every Computer Scientist Should Know About Floating Point Arithmetic”) is a good introduction to the topic.
http://download.oracle.com/docs/cd/E19957-01/806-3568/ncg_goldberg.html
Nathan Whitehead and Alex Fit-Florea of NVIDIA have coauthored a white paper entitled “Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs.”
http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf
Increasing Effective Precision
Dekker and Kahan developed methods to almost double the effective precision of floating-point hardware using pairs of numbers in exchange for a slight reduction in exponent range (due to intermediate underflow and overflow at the far ends of the range). Some papers on this topic include the following.
Dekker, T.J. Point technique for extending the available precision. Numer. Math. 18, 1971, pp. 224–242.
Linnainmaa, S. Software for doubled-precision floating point computations. ACM TOMS 7, pp. 172–283 (1981).
Shewchuk, J.R. Adaptive precision floating-point arithmetic and fast robust geometric predicates. Discrete & Computational Geometry 18, 1997, pp. 305–363.
Some GPU-specific work on this topic has been done by Andrew Thall, Da Graça, and Defour.
Guillaume, Da Graça, and David Defour. Implementation of float-float operators on graphics hardware, 7th Conference on Real Numbers and Computers, RNC7 (2006).
http://hal.archives-ouvertes.fr/docs/00/06/33/56/PDF/float-float.pdf
Thall, Andrew. Extended-precision floating-point numbers for GPU computation. 2007.