Some information relating to NV50 CUDA and shader code, for lack of better page name.
See http://www.nvidia.com/object/cuda_develop.html for general information about CUDA.
NOTE: tesla == NV50TCL, NV54TCL, NVA0TCL, or NVA8TCL: the object used for 3d rendering. turing == NV50_COMPUTE, the object used for launching standalone computations. tesla can launch blocks of code as one of 3 types of shaders in 3d rendering pipeline, turing launches them as standalone programs.
There were four revisions of CUDA, all backwards-compatible:
Version |
Cards |
New ISA stuff |
New non-ISA stuff |
1.0 |
NV50 |
Original revision |
|
1.1 |
NV8x, NV9x |
Atomic 32-bit instructions on g[], breakpoints |
Debugging support |
1.2 |
NVA5, NVA8, some other NVAx? |
Atomic 64-bit instructions on g[], atomic 32-bit instructions on s[], vote instructions |
32 warps and 16384 registers per MP |
1.3 |
NVA0, some other NVAx? |
Double-precision floating-point instructions |
|
They're also known as sm_10, sm_11, sm_12, sm_13. Note that 1.2 was actually after 1.3 and is just 1.3 with double-precision thrown out.
Types of programs you can run:
- VP: Vertex Program. Has inputs in a[], outputs in $o. Launched as first programmable stage of tesla rendering pipeline.
- GP: Geomatry Program. Has inputs in a[] and p[], outputs in $o. Optionally launched as second programmable stage of tesla rendering pipeline.
- FP: Fragment Program. Has inputs in v[], outputs in $r. Launched as third programmable stage of tesla rendering pipeline, when rasterization is enabled.
- CP: Compute Program. Has inputs in s[], and is able to access s[] and g[]. Launched as standalone programs by turing.
General setup
tesla |
turing |
name |
description |
||
VP |
GP |
FP |
CP |
||
0x198 |
0x1c0 |
DMA_CODE_CB |
DMA segment for CBs and program code |
||
0xf7c |
0xf70 |
0xfa4 |
0x210 |
[VGFC]P_ADDRESS_HIGH |
Base address of code segment |
0xf80 |
0xf74 |
0xfa8 |
0x214 |
[VGFC]P_ADDRESS_LOW |
|
0x140c |
0x1410 |
0x1414 |
0x3b4 |
[VGFC]P_START_ID |
Entry point: initial value of PC |
0x16b0 |
0x17a0 |
0x198c |
0x2c0 |
[VGFC]P_REG_ALLOC_TEMP |
Number of allocated $r registers |
0x16b8 |
0x17a8 |
- |
- |
[VG]P_REG_ALLOC_RESULT |
Number of allocated $o registers |
??? |
0x380 |
CODE_CB_FLUSH |
Writing 0 flushes on-GPU caches of code and CB data. Needed after changing code segment contents. |
||
All code addresses, including entry points, bra/call/joinat target fields, and the values pushed on stack, need to be aligned to 4-byte units and are relative to code segment base.
TODO: need to figure out if CB/code cache is per-MP or per-TP or what.
ISA
For now look at nv50dis sources for all your NV50 ISA needs, it'll get proper documentation one day.
Thread hierarchy
What |
Contained in |
How many per parent |
Notes |
TP |
Device |
1-10. Read PMC reg 0x1540 & 0xffff and count bits. |
Tile/texture processor. A block containing several MPs, 8 texturing units, and some code/const cache. Name doesn't make sense, but |
MP |
TP |
1-3. Read PMC reg 0x1540 & 0x0f000000 and count bits. |
Multiprocessor. Contains register and shared memory pools divided between warps/blocks. 8192 registers for <NVA0, 16384 for >=NVA0; 16kB of shared memory. Also known as SM [Scalable Multiprocessor]. |
Block |
MP |
Variable |
CP-only: A single block of warps that share s[] memory and barriers. They need to fit into a single MP [because of s[]], but can be spread out across several warps, and you can have several blocks on single MP at the same time if they all fit. |
SP |
MP |
8 |
Scalar Processor: A single execuction unit of MP. Takes an active warp each cycle and executes next insn from it. Mapping to handled warps is unknown. Actually, nvidia specs and CUDAs doc are the only sources that say they actually exist. |
Warp |
MP |
24 for <NVA0, 32 for >= NVA0 |
Warp: A single block of threads sharing program counter and execution units. If threads within a warp diverge, only one branch can remain active, the other lays dormant until the active branch exits or decides to rejoin. Also the level of granularity for the vote insn. |
Quad |
Warp |
8 |
A group of 4 threads. In FP, they render a 2x2 square. Texture instructions assume this geometry for computing implicit derivatives, so you have to hack a bit to use them in non-FP. Quads are treated specially by some instructions. |
Lane |
Warp |
32 |
A single thread. |
Quad |
4 |
The quad of TP id, MP id, warp id, lane id unambigously identifies a single thread context contained in a GPU. This quad is contained in a special read-only register called physid that contains the following:
physid&0x0000001f: lane id
physid&0x00001f00: warp id
physid&0x00030000: MP id
physid&0x00f00000: TP id
Environment
Available registers
Name |
Indices |
Size |
Type |
Available in |
Description |
$r? |
$r0-$r<rnum-1> |
32-bit |
RW |
All |
General-purpose registers. You need to tell tesla/turing the needed number for your particular program. rnum can be up to 128. |
$r<rnum>-$r127 |
Out-of-bounds GPRs. They usually read as 0 and can be used for that, but do something weird when you try to store them into mem. |
||||
$r?l |
$r0l-$r63l |
16-bit |
Low 16-bit half of given $r register, usable in 16-bit insns. |
||
$r?h |
$r0h-$r63h |
High 16-bit half of given $r register, usable in 16-bit insns. |
|||
$r?d |
$r0d-$r126d, number divisible by 2 |
64-bit |
Pair consisting of $r<num+1> [high] and $r<num> [low] used as a single 64-bit register. Usable in l[] and g[] load/stores and f64 insns. |
||
$r?q |
$r0q-$r124q, number divisible by 4 |
128-bit |
Quad consisting of $r<num+3>, $r<num+2>, $r<num+1>, $r<num> used as a single 128-bit register. Usable in l[] and g[] load/stores. |
||
$o? |
$o0-$o126 |
32-bit |
WO |
VP,GP |
Output registers. They're write-only. Like $r, you need to tell tesla how many you need. Can be up to 128, but you won't be able to use that last one. |
$o127 |
All |
Bit-bucket register. Writes here get ignored, even if you declared output 127 [yes, this makes output 127 useless]. |
|||
$o?l |
$o0l-$o63l |
16-bit |
VP,GP |
16-bit halves of output registers. They don't really work: writing to any half will duplicate the value into both halves of given output. Probably useless. |
|
$o?h |
$o0h-$o62h |
||||
$o63h |
All |
Bit-bucket register. Writes here get ignored, even if you declared output 63 [so you can use it as bit-bucket safely without disturbing $o63]. |
|||
$a? |
$a1-$a4 |
16-bit |
RW |
All |
Address registers: can be used for addressing all memory spaces except g[]. These 4 are used by ptxas. |
$a0 |
Special address register hardcoded to 0. Absolute addressing in many modes is actually addressing with this reg. |
||||
$a5-$a6 |
These registers also seem to be hardcoded to 0, but for no good reason. Avoid. |
||||
$a7 |
This register, otoh, seems to work, but isn't used by ptxas for some reason. |
||||
$c? |
$c0-$c3 |
4-bit |
Conditional registers. A lot of instructions can be told to set one of them according to insn result. They contain 4 different 1-bit flags. These flags and their various combinations can be used for conditional execution. |
||
(Special registers) |
0:physid |
32-bit |
RO |
Identifies the physical place of the thread on the GPU, see thread hierarchy info above. |
|
1:clock |
Counts clock ticks. Probably. |
||||
2:??? |
Seems to always be 0. |
||||
3:??? |
Seems to always be 0x20. |
||||
4-7:pm0-pm3 |
Performance monitoring registers. Value can be changed directly by SET_PM[] methods, and you can set them to count some stuff by PM_MODE methods. Mostly unknown. |
Bits in condition registers
Bit |
Name |
Description |
0 |
Zero |
Set if result is 0 or NaN. |
1 |
Sign |
Set if result has highest bit set [integer] or is negative or NaN [float] |
2 |
Carry |
Set for integer addition if carry out of highest bit happened |
3 |
Overflow |
Set for integer addition if high bit of destination doesn't match the "true" sign of result. Calculated before saturation if insn does that, so you can check if saturation happened. |
Memory spaces
Name |
Size |
Type |
Available in |
Description |
c0-c15 |
Up to 64kiB |
RO cached VRAM |
All |
Constant spaces: a locally-cached chunk of VRAM assumed to be constant by the card. First access can be slow, subsequent accesses fast. |
l |
??? CUDA doesn't want to go above 4kiB, but I got 64kiB on tesla just fine |
RW VRAM |
All |
Local space: a per-thread chunk of VRAM. Has quite funny address translation applied to get real address. Just like g[], but with funky address mangling. And just as slow. |
g0-g15 |
Up to 4GB |
RW VRAM |
CP |
Global spaces: a directly-maped writable and readable chunk of VRAM. Supports some atomic ops on >=sm_11. Slow compared to others. |
s |
Up to 16kB??? |
RW on-MP |
CP |
Shared space: a block of fast memory on the SM itself, shared between threads in a single block. Since sm_12 supports some atomic ops. First 0x10 bytes contain grid layout info, subsequent space contains parameters passed from host. |
a |
??? |
RO on-MP??? |
VP,GP |
Attribute space. Contains attributes/inputs to VP, pointers to primitives in p[] for GP. Probably like s[]. Not much is known. |
p |
??? |
GP |
Primitive space. Contains attributes/inputs to GP. Probably like s[]. Not much is known. |
|
v |
??? |
FP |
Varying space. Contains interpolated inputs to FP. Looks like there are at least 3 different ones for flat/normal/centroid varyings. Not much is known. |
Addressing modes: for everything but g[], you use [$a+offset] and addresses are 16-bit. For g[] you use g[$r] and addresses are 32-bit.
Stack and local memory
Return addresses for call insn and rejoin points are stored on a stack. The stack is per-warp. A single stack entry is two 32-bit words. Stack entries are grouped into blocks of 4 entries. The MP can hold up to 3 blocks per warp [tested on NV86] inside itself, then it starts spilling the blocks to memory, one block at a time. Format of offset inside the stack segment for my card seems to be:
&0x00000007: offset in a single entry
&0x00000018: entry number inside a block
&0x000007e0: MP&warp id. this field is probably larger for cards with more MPs.
&0xfffff800: block number. it's unknown how large the stack can be.
Local memory is simply an area of VRAM with some space carved out for each physid. Format of offset inside local segment seems to be [addr == address as used in NV50 code]:
&0x00000003: addr&3
&0x0000003c: laneid&0xf
&0x000000c0: addr&0xc
Further bit assignments are variable in at least some cases, but seem to include the following, in this order:
laneid&0x10, disabled at least sometimes on turing when 16 lanes selected, but not on tesla
MP id, seems always included [with >1 MP]
- warpid, usually included, but some combination of 0xf44-0xf50 regs on tesla was seen to disable that
TP id, seems always included [with >1 TP]
addr&0xfff0
To make matters even more fun, looks like MP id, warpid, TP id are massaged into a single field with multiplication by 3 instead of bit shifting on >= NVA0 [in the only test I tried, that_combined_bitfield = TP*32*3 + warpid*3 + MP]. Probably because that machine has 3 MPs.
Stack & local segments are specified in the following methods:
tesla |
turing |
what |
0x194 |
0x1bc |
stack DMA segment |
0x190 |
0x1b8 |
local DMA segment |
0xd94 |
0x218 |
stack address, high |
0xd98 |
0x21c |
stack address, low |
0xd9c |
0x220 |
log2(stack size/some_const), maybe? |
0x12d8 |
0x294 |
local address, high |
0x12dc |
0x298 |
local address, low |
0x12e0 |
0x29c |
log2(local size/some_const). some_const seems to be 4 for turing, 8 for tesla, at least sometimes. Weird. |
0xf44-0xf50 |
0x2fc-0x308 |
Look somehow related to stack and/or local |
The stack grows up from position 0. It is empty when execution starts.
Format for a single stack entry seems to be:
word0 & 0x003fffff: return or rejoin address, shifted right by 2.
word0 & 0x00c00000: a copy of bits 2 and 1 of this entry's number. [why?]
word0 & 0x1f000000: a copy of warp id for some reason.
word0 & 0xe0000000: type of entry. known types: 010 == call without 0x40 in second word, 011 == call with 0x40 in second word, 110 == joinat.
- word1: bitmask of active warps.
All joinat does is pushing an entry onto the stack [doesn't validate address nor anything].
To check: exact behavior of join, call, and ret. Will ret/join complain about mismatched types? Can I manipulate the stack by mapping the same area with g[] and forcing spilling? That would enable very hackish indirect jumping in CPs...
Const spaces
There are 16 const spaces. Each of them can be independently bound to one of 128 CBs [const buffers]. To set up a const buffer, write its address to CB_DEF_ADDRESS_*, then write its number and size to CB_SET_DEF. CBs share their DMA object with program code. To bind a buffer to a c[] space in a particular type of program, write to SET_PROGRAM_CB. The data in CBs is cached and needs to be explicitly flushed by poking 0 to CB_FLUSH when you change it externally.
There's also an upload function, which lets you upload data directly to a CB buffer using tesla/turing, and automatically updates cache [you don't need to CB_FLUSH]. To use it, just write offset and buffer id to CB_ADDR, then throw data at CB_DATA. It doesn't matter which CB_DATA you use, they're all aliases [probably made so you can upload <=64 bytes with a single packet...]. The address behind CB_ADDR is autoincremented for each CB_DATA access.
The methods:
tesla |
turing |
name |
description |
0x198 |
0x1c0 |
DMA_CODE_CB |
DMA segment for CBs and program code |
0xf00 |
0x238 |
CB_ADDR |
Sets the upload address and buffer for subsequent CB_DATA upload. &0x7f: buffer id, &0x003fff00: target address, shifted right by 2 [or, in units of 32-bit words]. |
0xf04-0xf40 |
0x23c-0x278 |
CB_DATA[0-15] |
Upload method: anything written to any of these methods is stored at current upload address, then upload address is autoincremented by 4. It's an error to upload after the address overflows. |
0x1280 |
0x2a4 |
CB_DEF_ADDRESS_HIGH |
Address of CB, to be used by next CB_DEF_SET. |
0x1284 |
0x2a8 |
CB_DEF_ADDRESS_LOW |
|
0x1288 |
0x2ac |
CB_DEF_SET |
Sets address and size for a given CB id. Address taken from CB_DEF_ADDRESS_*. &0x7f0000: buffer id, &0xffff: size. Size needs to be multiple of 0x100 bytes. Setting size field to 0 is special and really means size 65536. |
???? |
0x380 |
CODE_CB_FLUSH |
Writing 0 flushes on-SM caches of code and CB data. Needed after changing CB contents with anything other than CB_ADDR/CB_DATA. |
0x1694 |
0x3c8 |
SET_PROGRAM_CB |
Binds CB to a c[] space in a program. &0x7f000: CB id, &0xf00: c[] space number, &0xf0: program type for tesla [0 - VP, 2 - GP, 3 - FP], doesn't apply for turing, &0x1: unknown flag, must be set to 1. |
To check: flushes and that unknown flag.
CP
CPs are launched using turing. They're launched in grids consisting of blocks consisting of threads. Blocks are independent units of computation and new ones start execution as soon as there's enough free warps to hold them, giving partly-parallel partly-serial execution. Each block in a grid is identified by its x, y coordinates, and they appear to start execution in (y,x) lexicographical order.
A block, in turn, contains threads identified by their x, y, z coordinates. All threads in a block are guaranteed to be executed in parallel and have access to a barrier instruction that stops thread's execution until all threads in the block have reached it. Each block is also assigned its own s[] memory space that can be accessed by all its threads.
The CP-specific methods are:
0x0388 |
GRIDID |
Grid ID: freeform 16-bit value |
0x03a4 |
GRIDDIM |
Grid dimensions: y<<16 | x, both x and y in 0-65535 range. |
0x03a8 |
SHARED_SIZE |
Size of shared memory per block. Needs to be in units of 0x40 bytes. |
0x03ac |
BLOCKDIM_XY |
Block dimensions: y<<16 | x |
0x03b0 |
BLOCKDIM_Z |
Block dimensions: z |
0x02b4 |
THREADS_PER_BLOCK |
Threads per block |
0x02b8 |
- |
Lane enable 1: accepts 0 and 1. 1 is needed for 32 lanes. |
0x03b8 |
- |
Lane enable 2: accepts 1 and 2. 2 is needed for 32 lanes. |
0x0374 |
USER_PARAM_COUNT |
Parameter count, shifted left by 8 bits. Max 64. |
0x0600-0x06fc |
USER_PARAM[0-63] |
Parameters. |
0x02f8 |
- |
Unknown purpose, but you need to put 1 here after setting up all of the above, otherwise 0x368 causes DATA_ERROR. |
0x0368 |
LAUNCH |
Write 0 here to actually launch the grid. |
CPs use 32 lanes if 1 is written to 0x02b8 and 2 to 0x03b8, 16 lanes otherwise.
For each launched block, all (z,y,x) tuples in range (0,0,0) through (blockdim.z, blockdim.y, blockdim.x) are created, sorted lexicographically. Then first THREADS_PER_BLOCK of them are taken and assigned to sequential lanes, spanning multiple warps if needed. It is a DATA_ERROR if you LAUNCH with THREADS_PER_BLOCK > blockdim.x*blockdim.y*blockdim.z. So for blockdim.x == 2, blockdim.y == 3, blockdim.z == 4, THREADS_PER_BLOCK == 21, and 16 enabled lanes, threads in a block are assigned like this:
tid.x |
tid.y |
tid.z |
warp id |
lane id |
0 |
0 |
0 |
a |
0 |
1 |
0 |
0 |
a |
1 |
0 |
1 |
0 |
a |
2 |
1 |
1 |
0 |
a |
3 |
0 |
2 |
0 |
a |
4 |
1 |
2 |
0 |
a |
5 |
0 |
0 |
1 |
a |
6 |
1 |
0 |
1 |
a |
7 |
0 |
1 |
1 |
a |
8 |
1 |
1 |
1 |
a |
9 |
0 |
2 |
1 |
a |
0xa |
1 |
2 |
1 |
a |
0xb |
0 |
0 |
2 |
a |
0xc |
1 |
0 |
2 |
a |
0xd |
0 |
1 |
2 |
a |
0xe |
1 |
1 |
2 |
a |
0xf |
0 |
2 |
2 |
b |
0 |
1 |
2 |
2 |
b |
1 |
0 |
0 |
3 |
b |
2 |
1 |
0 |
3 |
b |
3 |
0 |
1 |
3 |
b |
4 |
1 |
1 |
3 |
cut off by THREADS_PER_BLOCK |
|
0 |
2 |
3 |
||
1 |
2 |
3 |
||
When your block starts, shared memory for the block contains the following:
Address |
Size |
PTX name |
Description |
0x0 |
16-bit |
gridid |
Taken straight from GRIDID method. |
0x2 |
ntid.x |
Block dimensions, taken from BLOCKDIM* |
|
0x4 |
ntid.y |
||
0x6 |
ntid.z |
||
0x8 |
nctaid.x |
Grid dimensions, taken from GRIDDIM |
|
0xa |
nctaid.y |
||
0xc |
ctaid.x |
Coordinates of this block inside the grid, 0 through GRIDDIM.[XY]-1 |
|
0xe |
ctaid.y |
||
0x10 |
USER_PARAM_COUNT*4 |
- |
Parameters as specified in USER_PARAM |
Also, at CP start, $r0 contains coordinates of the current thread inside its block:
& 0x0000ffff: tid.x
& 0x03ff0000: tid.y
& 0xfc000000: tid.z
Global memory
In CPs, there are 16 segments of global memory available. Each of them is independent, and can be either a linear area of VRAM, or a normal 2D surface. In the linear case, address used in g[] reference is simply used as an offset wrt GLOBAL_BASE_*.
However, when bound to a 2D surface, addressing is funnier. The high 16 bits of address passed to g[] are taken as y coordinate, low 16 as byte offset inside a line. Tiling is applied onto that according to specified tiling mode, with GLOBAL_PITCH used as tile pitch [distance between start of consecutive rows of tiles]. The resulting offset is then added to GLOBAL_BASE_* and bastardised with tile flags 0x7000.
Or, if you prefer a formula, for g[REG]:
X = REG&0xffff
Y = REG>>16
- TILE_SHIFT = TILE_MODE+2
TILE_MASK = (1<<TILE_SHIFT) - 1
OFFS = (X&0x3f) + ((Y&TILE_MASK)<<6) + ((X>>6)<<(6+TILE_SHIFT))
ADDR = fuckmeharder_0x7000( (GLOBAL_BASE_HIGH[i]<<32) + GLOBAL_BASE_LOW[i] + OFFS + (Y>>TILE_SHIFT)*GLOBAL_PITCH[i] )
Accessible area is limited by the GLOBAL_LIMIT method. For linear memory, it's set to size-1, where size needs to be a multiple of 0x100 bytes. A write is out of bound if REG > GLOBAL_LIMIT. For 2D surfaces, GLOBAL_LIMIT is a bitfield with limits for x and y parts. Access is considered out of bounds if (REG>>16) > (GLOBAL_LIMIT>>16) || (REG&0xffff) > (GLOBAL_LIMIT&0xffff).
So, methods [i is index of gi[] space you're setting up]:
0x1a0 |
DMA_GLOBAL |
DMA segment used for all g[] spaces |
0x400+(i<<5) |
GLOBAL_BASE_HIGH |
The base address of gi[] segment |
0x404+(i<<5) |
GLOBAL_BASE_LOW |
|
0x408+(i<<5) |
GLOBAL_PITCH |
Surface pitch for 2D surface, ignored for linear. Must be multiple of 0x100, max 0x800000. |
0x40c+(i<<5) |
GLOBAL_LIMIT |
Highest allowed address. Needs to be (multiple of 0x100)-1. Has separate x and y parts for 2D surfaces. |
0x410+(i<<5) |
GLOBAL_MODE |
Bit 0: 0 == 2D surface, 1 == linear buffer. Bits 8-10: tile mode if 2D surface. |

