8.7. Instruction Sets
NVIDIA has developed three major architectures: Tesla (SM 1.x), Fermi (SM 2.x), and Kepler (SM 3.x). Within those families, new instructions have been added as NVIDIA updated their products. For example, global atomic operations were not present in the very first Tesla-class processor (the G80, which shipped in 2006 as the GeForce GTX 8800), but all subsequent Tesla-class GPUs included them. So when querying the SM version via cuDeviceComputeCapability(), the major and minor versions will be 1.0 for G80 and 1.1 (or greater) for all other Tesla-class GPUs. Conversely, if the SM version is 1.1 or greater, the application can use global atomics.
Table 8.16 gives the SASS instructions that may be printed by cuobjdump when disassembling microcode for Tesla-class (SM 1.x) hardware. The Fermi and Kepler instruction sets closely resemble each other, with the exception of the instructions that support surface load/store, so their instruction sets are given together in Table 8.17. In both tables, the middle column specifies the first SM version to support a given instruction.
Table 8.16 SM 1.x Instruction Set
OPCODE |
SM |
DESCRIPTION |
FLOATING POINT |
||
COS |
1.0 |
Cosine |
DADD |
1.3 |
Double-precision floating-point add |
DFMA |
1.3 |
Double-precision floating-point fused multiply-add |
DMAX |
1.3 |
Double-precision floating-point maximum |
DMIN |
1.3 |
Double-precision floating-point minimum |
DMUL |
1.3 |
Double-precision floating-point multiply |
DSET |
1.3 |
Double-precision floating-point condition set |
EX2 |
1.0 |
Exponential (base 2) |
FADD/FADD32/FADD32I |
1.0 |
Single-precision floating-point add |
FCMP |
1.0 |
Single-precision floating-point compare |
FMAD/FMAD32/FMAD32I |
1.0 |
Single-precision floating-point multiply-add |
FMAX |
1.0 |
Single-precision floating-point maximum |
FMIN |
1.0 |
Single-precision floating-point minimum |
FMUL/FMUL32/FMUL32I |
1.0 |
Single-precision floating-point multiply |
FSET |
1.0 |
Single-precision floating-point conditional set |
LG2 |
1.0 |
Single-precision floating-point logarithm (base 2) |
RCP |
1.0 |
Single-precision floating-point reciprocal |
RRO |
1.0 |
Range reduction operator (used before SIN/COS) |
RSQ |
1.0 |
Reciprocal square root |
SIN |
1.0 |
Sine |
FLOW CONTROL |
||
BAR |
1.0 |
Barrier synchronization/ __syncthreads() |
BRA |
1.0 |
Conditional branch |
BRK |
1.0 |
Conditional break from loop |
BRX |
1.0 |
Fetch an address from constant memory and branch to it |
C2R |
1.0 |
Condition code to data register |
CAL |
1.0 |
Unconditional subroutine call |
RET |
1.0 |
Conditional return from subroutine |
SSY |
1.0 |
Set synchronization point; used before potentially divergent instructions |
DATA CONVERSION |
||
F2F |
1.0 |
Copy floating-point value with conversion to floating point |
F2I |
1.0 |
Copy floating-point value with conversion to integer |
I2F |
1.0 |
Copy integer value to floating-point with conversion |
I2I |
1.0 |
Copy integer value to integer with conversion |
INTEGER |
||
IADD/ IADD32/ IADD32I |
1.0 |
Integer addition |
IMAD/ IMAD32/ IMAD32I |
1.0 |
Integer multiply-add |
IMAX |
1.0 |
Integer maximum |
IMIN |
1.0 |
Integer minimum |
IMUL/ IMUL32/ IMUL32I |
1.0 |
Integer multiply |
ISAD/ ISAD32 |
1.0 |
Integer sum of absolute difference |
ISET |
1.0 |
Integer conditional set |
SHL |
1.0 |
Shift left |
SHR |
1.0 |
Shift right |
MEMORY OPERATIONS |
||
A2R |
1.0 |
Move address register to data register |
ADA |
1.0 |
Add immediate to address register |
G2R |
1.0 |
Move from shared memory to register. The .LCK suffix, used to implement shared memory atomics, causes the bank to be locked until an R2G.UNL has been performed. |
GATOM.IADD/ EXCH/ CAS/IMIN/ IMAX/ INC/ DEC/IAND/ IOR/ IXOR |
1.2 |
Global memory atomic operations; performs an atomic operation and returns the original value. |
GLD |
1.0 |
Load from global memory |
GRED.IADD/ IMIN/ IMAX/INC/ DEC/ IAND/ IOR/ IXOR |
1.2 |
Global memory reduction operations; performs an atomic operation with no return value. |
GST |
1.0 |
Store to global memory |
LLD |
1.0 |
Load from local memory |
LST |
1.0 |
Store to local memory |
LOP |
1.0 |
Logical operation (AND/OR/XOR) |
MOV/MOV32 |
1.0 |
Move source to destination |
MVC |
1.0 |
Move from constant memory |
MVI |
1.0 |
Move immediate |
R2A |
1.0 |
Move register to address register |
R2C |
1.0 |
Move data register to condition code |
R2G |
1.0 |
Store to shared memory. When used with the .UNL suffix,releases a previously held lock on that shared memory bank. |
MISCELLANEOUS |
||
NOP |
1.0 |
No operation |
TEX/ TEX32 |
1.0 |
Texture fetch |
VOTE |
1.2 |
Warp-vote primitive. |
S2R |
1.0 |
Move special register (e.g., thread ID) to register |
Table 8.17 SM 2.x and SM 3.x Instruction Sets
OPCODE |
SM |
DESCRIPTION |
FLOATING POINT |
||
DADD |
2.0 |
Double-precision add |
DMUL |
2.0 |
Double-precision multiply |
DMNMX |
2.0 |
Double-precision minimum/maximum |
DSET |
2.0 |
Double-precision set |
DSETP |
2.0 |
Double-precision predicate |
DFMA |
2.0 |
Double-precision fused multiply-add |
FFMA |
2.0 |
Single-precision fused multiply-add |
FADD |
2.0 |
Single-precision floating-point add |
FCMP |
2.0 |
Single-precision floating-point compare |
FMUL |
2.0 |
Single-precision floating-point multiply |
FMNMX |
2.0 |
Single-precision floating-point minimum/maximum |
FSWZ |
2.0 |
Single-precision floating-point swizzle |
FSET |
2.0 |
Single-precision floating-point set |
FSETP |
2.0 |
Single-precision floating-point set predicate |
MUFU |
2.0 |
MultiFunk (SFU) operator |
RRO |
2.0 |
Range reduction operator (used before MUFU sin/cos) |
INTEGER |
||
BFE |
2.0 |
Bit field extract |
BFI |
2.0 |
Bit field insert |
FLO |
2.0 |
Find leading one |
IADD |
2.0 |
Integer add |
ICMP |
2.0 |
Integer compare and select |
IMAD |
2.0 |
Integer multiply-add |
IMNMX |
2.0 |
Integer minimum/maximum |
IMUL |
2.0 |
Integer multiply |
ISAD |
2.0 |
Integer sum of absolute differences |
ISCADD |
2.0 |
Integer add with scale |
ISET |
2.0 |
Integer set |
ISETP |
2.0 |
Integer set predicate |
LOP |
2.0 |
Logical operation (AND/OR/XOR) |
SHF |
3.5 |
Funnel shift |
SHL |
2.0 |
Shift left |
SHR |
2.0 |
Shift right |
POPC |
2.0 |
Population count |
DATA CONVERSION |
||
F2F |
2.0 |
Floating point to floating point |
F2I |
2.0 |
Floating point to integer |
I2F |
2.0 |
Integer to floating point |
I2I |
2.0 |
Integer to integer |
SCALARVIDEO |
||
VABSDIFF |
2.0 |
Scalar video absolute difference |
VADD |
2.0 |
Scalar video add |
VMAD |
2.0 |
Scalar video multiply-add |
VMAX |
2.0 |
Scalar video maximum |
VMIN |
2.0 |
Scalar video minimum |
VSET |
2.0 |
Scalar video set |
VSHL |
2.0 |
Scalar video shift left |
VSHR |
2.0 |
Scalar video shift right |
VSUB |
2.0 |
Scalar video subtract |
VECTOR(SIMD) VIDEO |
||
VABSDIFF2(4) |
3.0 |
Vector video 2x16-bit (4x8-bit) absolute difference |
VADD2(4) |
3.0 |
Vector video 2x16-bit (4x8-bit) addition |
VAVRG2(4) |
3.0 |
Vector video 2x16-bit (4x8-bit) average |
VMAX2(4) |
3.0 |
Vector video 2x16-bit (4x8-bit) maximum |
VMIN2(4) |
3.0 |
Vector video 2x16-bit (4x8-bit) minimum |
VSET2(4) |
3.0 |
Vector video 2x16-bit (4x8-bit) set |
VSUB2(4) |
3.0 |
Vector video 2x16-bit (4x8-bit) subtraction |
DATA MOVEMENT |
||
MOV |
2.0 |
Move |
PRMT |
2.0 |
Permute |
SEL |
2.0 |
Select (conditional move) |
SHFL |
3.0 |
Warp shuffle |
PREDICATE/CONDITION CODES |
||
CSET |
2.0 |
Condition code set |
CSETP |
2.0 |
Condition code set predicate |
P2R |
2.0 |
Predicate to register |
R2P |
2.0 |
Register to predicate |
PSET |
2.0 |
Predicate set |
PSETP |
2.0 |
Predicate set predicate |
TEXTURE |
||
TEX |
2.0 |
Texture fetch |
TLD |
2.0 |
Texture load |
TLD4 |
2.0 |
Texture load 4 texels |
TXQ |
2.0 |
Texture query |
MEMORY OPERATIONS |
||
ATOM |
2.0 |
Atomic memory operation |
CCTL |
2.0 |
Cache control |
CCTLL |
2.0 |
Cache control (local) |
LD |
2.0 |
Load from memory |
LDC |
2.0 |
Load constant |
LDG |
3.5 |
Noncoherence global load (reads via texture cache) |
LDL |
2.0 |
Load from local memory |
LDLK |
2.0 |
Load and lock |
LDS |
2.0 |
Load from shared memory |
LDSLK |
2.0 |
Load from shared memory and lock |
LDU |
2.0 |
Load uniform |
LD_LDU |
2.0 |
Combines generic load LD with a load uniform LDU |
LDS_LDU |
2.0 |
Combines shared memory load LDS with a load uniform LDU |
MEMBAR |
2.0 |
Memory barrier |
RED |
2.0 |
Atomic memory reduction operation |
ST |
2.0 |
Store to memory |
STL |
2.0 |
Store to local memory |
STUL |
2.0 |
Store and unlock |
STS |
2.0 |
Store to shared memory |
STSUL |
2.0 |
Store to shared memory and unlock |
SURFACE MEMORY (FERMI) |
||
SULD |
2.0 |
Surface load |
SULEA |
2.0 |
Surface load effective address |
SUQ |
2.0 |
Surface query |
SURED |
2.0 |
Surface reduction |
SUST |
2.0 |
Surface store |
SURFACE MEMORY (KEPLER) |
||
SUBFM |
3.0 |
Surface bit field merge |
SUCLAMP |
3.0 |
Surface clamp |
SUEAU |
3.0 |
Surface effective address |
SULDGA |
3.0 |
Surface load generic address |
SUSTGA |
3.0 |
Surface store generic address |
FLOW CONTROL |
||
BRA |
2.0 |
Branch to relative address |
BPT |
2.0 |
Breakpoint/trap |
BRK |
2.0 |
Break from loop |
BRX |
2.0 |
Branch to relative indexed address |
CAL |
2.0 |
Call to relative address |
CONT |
2.0 |
Continue in loop |
EXIT |
2.0 |
Exit program |
JCAL |
2.0 |
Call to absolute address |
JMP |
2.0 |
Jump to absolute address |
JMX |
2.0 |
Jump to absolute indexed address |
LONGJMP |
2.0 |
Long jump |
PBK |
2.0 |
Pre?break relative address |
PCNT |
2.0 |
Pre?continue relative address |
PLONGJMP |
2.0 |
Pre?long jump relative address |
PRET |
2.0 |
Pre?return relative address |
RET |
2.0 |
Return from call |
SSY |
2.0 |
Set synchronization point; used before potentially divergent instructions |
MISCELLANEOUS |
||
B2R |
2.0 |
Barrier to register |
BAR |
2.0 |
Barrier synchronization |
LEPC |
2.0 |
Load effective program counter |
NOP |
2.0 |
No operation |
S2R |
2.0 |
Special register to register (used to read, for example, the thread or block ID) |
VOTE |
2.0 |
Query condition across warp |