CUDA

Some information relating to NV50 CUDA and shader code, for lack of better page name.

See http://www.nvidia.com/object/cuda_develop.html for general information about CUDA.

NOTE: tesla == NV50TCL, NV54TCL, NVA0TCL, or NVA8TCL: the object used for 3d rendering. turing == NV50_COMPUTE, the object used for launching standalone computations. tesla can launch blocks of code as one of 3 types of shaders in 3d rendering pipeline, turing launches them as standalone programs.

There were four revisions of CUDA, all backwards-compatible:

Version

Cards

New ISA stuff

New non-ISA stuff

1.0

NV50

Original revision

1.1

NV8x, NV9x

Atomic 32-bit instructions on g[], breakpoints

Debugging support

1.2

NVA5, NVA8, some other NVAx?

Atomic 64-bit instructions on g[], atomic 32-bit instructions on s[], vote instructions

32 warps and 16384 registers per MP

1.3

NVA0, some other NVAx?

Double-precision floating-point instructions

They're also known as sm_10, sm_11, sm_12, sm_13. Note that 1.2 was actually after 1.3 and is just 1.3 with double-precision thrown out.

Types of programs you can run:

General setup

tesla

turing

name

description

VP

GP

FP

CP

0x198

0x1c0

DMA_CODE_CB

DMA segment for CBs and program code

0xf7c

0xf70

0xfa4

0x210

[VGFC]P_ADDRESS_HIGH

Base address of code segment

0xf80

0xf74

0xfa8

0x214

[VGFC]P_ADDRESS_LOW

0x140c

0x1410

0x1414

0x3b4

[VGFC]P_START_ID

Entry point: initial value of PC

0x16b0

0x17a0

0x198c

0x2c0

[VGFC]P_REG_ALLOC_TEMP

Number of allocated $r registers

0x16b8

0x17a8

-

-

[VG]P_REG_ALLOC_RESULT

Number of allocated $o registers

???

0x380

CODE_CB_FLUSH

Writing 0 flushes on-GPU caches of code and CB data. Needed after changing code segment contents.

All code addresses, including entry points, bra/call/joinat target fields, and the values pushed on stack, need to be aligned to 4-byte units and are relative to code segment base.

TODO: need to figure out if CB/code cache is per-MP or per-TP or what.

ISA

For now look at nv50dis sources for all your NV50 ISA needs, it'll get proper documentation one day.

Thread hierarchy

What

Contained in

How many per parent

Notes

TP

Device

1-10. Read PMC reg 0x1540 & 0xffff and count bits.

Tile/texture processor. A block containing several MPs, 8 texturing units, and some code/const cache. Name doesn't make sense, but

MP

TP

1-3. Read PMC reg 0x1540 & 0x0f000000 and count bits.

Multiprocessor. Contains register and shared memory pools divided between warps/blocks. 8192 registers for <NVA0, 16384 for >=NVA0; 16kB of shared memory. Also known as SM [Scalable Multiprocessor].

Block

MP

Variable

CP-only: A single block of warps that share s[] memory and barriers. They need to fit into a single MP [because of s[]], but can be spread out across several warps, and you can have several blocks on single MP at the same time if they all fit.

SP

MP

8

Scalar Processor: A single execuction unit of MP. Takes an active warp each cycle and executes next insn from it. Mapping to handled warps is unknown. Actually, nvidia specs and CUDAs doc are the only sources that say they actually exist.

Warp

MP

24 for <NVA0, 32 for >= NVA0

Warp: A single block of threads sharing program counter and execution units. If threads within a warp diverge, only one branch can remain active, the other lays dormant until the active branch exits or decides to rejoin. Also the level of granularity for the vote insn.

Quad

Warp

8

A group of 4 threads. In FP, they render a 2x2 square. Texture instructions assume this geometry for computing implicit derivatives, so you have to hack a bit to use them in non-FP. Quads are treated specially by some instructions.

Lane

Warp

32

A single thread.

Quad

4

The quad of TP id, MP id, warp id, lane id unambigously identifies a single thread context contained in a GPU. This quad is contained in a special read-only register called physid that contains the following:

Environment

Available registers

Name

Indices

Size

Type

Available in

Description

$r?

$r0-$r<rnum-1>

32-bit

RW

All

General-purpose registers. You need to tell tesla/turing the needed number for your particular program. rnum can be up to 128.

$r<rnum>-$r127

Out-of-bounds GPRs. They usually read as 0 and can be used for that, but do something weird when you try to store them into mem.

$r?l

$r0l-$r63l

16-bit

Low 16-bit half of given $r register, usable in 16-bit insns.

$r?h

$r0h-$r63h

High 16-bit half of given $r register, usable in 16-bit insns.

$r?d

$r0d-$r126d, number divisible by 2

64-bit

Pair consisting of $r<num+1> [high] and $r<num> [low] used as a single 64-bit register. Usable in l[] and g[] load/stores and f64 insns.

$r?q

$r0q-$r124q, number divisible by 4

128-bit

Quad consisting of $r<num+3>, $r<num+2>, $r<num+1>, $r<num> used as a single 128-bit register. Usable in l[] and g[] load/stores.

$o?

$o0-$o126

32-bit

WO

VP,GP

Output registers. They're write-only. Like $r, you need to tell tesla how many you need. Can be up to 128, but you won't be able to use that last one.

$o127

All

Bit-bucket register. Writes here get ignored, even if you declared output 127 [yes, this makes output 127 useless].

$o?l

$o0l-$o63l

16-bit

VP,GP

16-bit halves of output registers. They don't really work: writing to any half will duplicate the value into both halves of given output. Probably useless.

$o?h

$o0h-$o62h

$o63h

All

Bit-bucket register. Writes here get ignored, even if you declared output 63 [so you can use it as bit-bucket safely without disturbing $o63].

$a?

$a1-$a4

16-bit

RW

All

Address registers: can be used for addressing all memory spaces except g[]. These 4 are used by ptxas.

$a0

Special address register hardcoded to 0. Absolute addressing in many modes is actually addressing with this reg.

$a5-$a6

These registers also seem to be hardcoded to 0, but for no good reason. Avoid.

$a7

This register, otoh, seems to work, but isn't used by ptxas for some reason.

$c?

$c0-$c3

4-bit

Conditional registers. A lot of instructions can be told to set one of them according to insn result. They contain 4 different 1-bit flags. These flags and their various combinations can be used for conditional execution.

(Special registers)

0:physid

32-bit

RO

Identifies the physical place of the thread on the GPU, see thread hierarchy info above.

1:clock

Counts clock ticks. Probably.

2:???

Seems to always be 0.

3:???

Seems to always be 0x20.

4-7:pm0-pm3

Performance monitoring registers. Value can be changed directly by SET_PM[] methods, and you can set them to count some stuff by PM_MODE methods. Mostly unknown.

Bits in condition registers

Bit

Name

Description

0

Zero

Set if result is 0 or NaN.

1

Sign

Set if result has highest bit set [integer] or is negative or NaN [float]

2

Carry

Set for integer addition if carry out of highest bit happened

3

Overflow

Set for integer addition if high bit of destination doesn't match the "true" sign of result. Calculated before saturation if insn does that, so you can check if saturation happened.

Memory spaces

Name

Size

Type

Available in

Description

c0-c15

Up to 64kiB

RO cached VRAM

All

Constant spaces: a locally-cached chunk of VRAM assumed to be constant by the card. First access can be slow, subsequent accesses fast.

l

??? CUDA doesn't want to go above 4kiB, but I got 64kiB on tesla just fine

RW VRAM

All

Local space: a per-thread chunk of VRAM. Has quite funny address translation applied to get real address. Just like g[], but with funky address mangling. And just as slow.

g0-g15

Up to 4GB

RW VRAM

CP

Global spaces: a directly-maped writable and readable chunk of VRAM. Supports some atomic ops on >=sm_11. Slow compared to others.

s

Up to 16kB???

RW on-MP

CP

Shared space: a block of fast memory on the SM itself, shared between threads in a single block. Since sm_12 supports some atomic ops. First 0x10 bytes contain grid layout info, subsequent space contains parameters passed from host.

a

???

RO on-MP???

VP,GP

Attribute space. Contains attributes/inputs to VP, pointers to primitives in p[] for GP. Probably like s[]. Not much is known.

p

???

GP

Primitive space. Contains attributes/inputs to GP. Probably like s[]. Not much is known.

v

???

FP

Varying space. Contains interpolated inputs to FP. Looks like there are at least 3 different ones for flat/normal/centroid varyings. Not much is known.

Addressing modes: for everything but g[], you use [$a+offset] and addresses are 16-bit. For g[] you use g[$r] and addresses are 32-bit.

Stack and local memory

Return addresses for call insn and rejoin points are stored on a stack. The stack is per-warp. A single stack entry is two 32-bit words. Stack entries are grouped into blocks of 4 entries. The MP can hold up to 3 blocks per warp [tested on NV86] inside itself, then it starts spilling the blocks to memory, one block at a time. Format of offset inside the stack segment for my card seems to be:

Local memory is simply an area of VRAM with some space carved out for each physid. Format of offset inside local segment seems to be [addr == address as used in NV50 code]:

Further bit assignments are variable in at least some cases, but seem to include the following, in this order:

To make matters even more fun, looks like MP id, warpid, TP id are massaged into a single field with multiplication by 3 instead of bit shifting on >= NVA0 [in the only test I tried, that_combined_bitfield = TP*32*3 + warpid*3 + MP]. Probably because that machine has 3 MPs.

Stack & local segments are specified in the following methods:

tesla

turing

what

0x194

0x1bc

stack DMA segment

0x190

0x1b8

local DMA segment

0xd94

0x218

stack address, high

0xd98

0x21c

stack address, low

0xd9c

0x220

log2(stack size/some_const), maybe?

0x12d8

0x294

local address, high

0x12dc

0x298

local address, low

0x12e0

0x29c

log2(local size/some_const). some_const seems to be 4 for turing, 8 for tesla, at least sometimes. Weird.

0xf44-0xf50

0x2fc-0x308

Look somehow related to stack and/or local

The stack grows up from position 0. It is empty when execution starts.

Format for a single stack entry seems to be:

All joinat does is pushing an entry onto the stack [doesn't validate address nor anything].

To check: exact behavior of join, call, and ret. Will ret/join complain about mismatched types? Can I manipulate the stack by mapping the same area with g[] and forcing spilling? That would enable very hackish indirect jumping in CPs...

Const spaces

There are 16 const spaces. Each of them can be independently bound to one of 128 CBs [const buffers]. To set up a const buffer, write its address to CB_DEF_ADDRESS_*, then write its number and size to CB_SET_DEF. CBs share their DMA object with program code. To bind a buffer to a c[] space in a particular type of program, write to SET_PROGRAM_CB. The data in CBs is cached and needs to be explicitly flushed by poking 0 to CB_FLUSH when you change it externally.

There's also an upload function, which lets you upload data directly to a CB buffer using tesla/turing, and automatically updates cache [you don't need to CB_FLUSH]. To use it, just write offset and buffer id to CB_ADDR, then throw data at CB_DATA. It doesn't matter which CB_DATA you use, they're all aliases [probably made so you can upload <=64 bytes with a single packet...]. The address behind CB_ADDR is autoincremented for each CB_DATA access.

The methods:

tesla

turing

name

description

0x198

0x1c0

DMA_CODE_CB

DMA segment for CBs and program code

0xf00

0x238

CB_ADDR

Sets the upload address and buffer for subsequent CB_DATA upload. &0x7f: buffer id, &0x003fff00: target address, shifted right by 2 [or, in units of 32-bit words].

0xf04-0xf40

0x23c-0x278

CB_DATA[0-15]

Upload method: anything written to any of these methods is stored at current upload address, then upload address is autoincremented by 4. It's an error to upload after the address overflows.

0x1280

0x2a4

CB_DEF_ADDRESS_HIGH

Address of CB, to be used by next CB_DEF_SET.

0x1284

0x2a8

CB_DEF_ADDRESS_LOW

0x1288

0x2ac

CB_DEF_SET

Sets address and size for a given CB id. Address taken from CB_DEF_ADDRESS_*. &0x7f0000: buffer id, &0xffff: size. Size needs to be multiple of 0x100 bytes. Setting size field to 0 is special and really means size 65536.

????

0x380

CODE_CB_FLUSH

Writing 0 flushes on-SM caches of code and CB data. Needed after changing CB contents with anything other than CB_ADDR/CB_DATA.

0x1694

0x3c8

SET_PROGRAM_CB

Binds CB to a c[] space in a program. &0x7f000: CB id, &0xf00: c[] space number, &0xf0: program type for tesla [0 - VP, 2 - GP, 3 - FP], doesn't apply for turing, &0x1: unknown flag, must be set to 1.

To check: flushes and that unknown flag.

CP

CPs are launched using turing. They're launched in grids consisting of blocks consisting of threads. Blocks are independent units of computation and new ones start execution as soon as there's enough free warps to hold them, giving partly-parallel partly-serial execution. Each block in a grid is identified by its x, y coordinates, and they appear to start execution in (y,x) lexicographical order.

A block, in turn, contains threads identified by their x, y, z coordinates. All threads in a block are guaranteed to be executed in parallel and have access to a barrier instruction that stops thread's execution until all threads in the block have reached it. Each block is also assigned its own s[] memory space that can be accessed by all its threads.

The CP-specific methods are:

0x0388

GRIDID

Grid ID: freeform 16-bit value

0x03a4

GRIDDIM

Grid dimensions: y<<16 | x, both x and y in 0-65535 range.

0x03a8

SHARED_SIZE

Size of shared memory per block. Needs to be in units of 0x40 bytes.

0x03ac

BLOCKDIM_XY

Block dimensions: y<<16 | x

0x03b0

BLOCKDIM_Z

Block dimensions: z

0x02b4

THREADS_PER_BLOCK

Threads per block

0x02b8

-

Lane enable 1: accepts 0 and 1. 1 is needed for 32 lanes.

0x03b8

-

Lane enable 2: accepts 1 and 2. 2 is needed for 32 lanes.

0x0374

USER_PARAM_COUNT

Parameter count, shifted left by 8 bits. Max 64.

0x0600-0x06fc

USER_PARAM[0-63]

Parameters.

0x02f8

-

Unknown purpose, but you need to put 1 here after setting up all of the above, otherwise 0x368 causes DATA_ERROR.

0x0368

LAUNCH

Write 0 here to actually launch the grid.

CPs use 32 lanes if 1 is written to 0x02b8 and 2 to 0x03b8, 16 lanes otherwise.

For each launched block, all (z,y,x) tuples in range (0,0,0) through (blockdim.z, blockdim.y, blockdim.x) are created, sorted lexicographically. Then first THREADS_PER_BLOCK of them are taken and assigned to sequential lanes, spanning multiple warps if needed. It is a DATA_ERROR if you LAUNCH with THREADS_PER_BLOCK > blockdim.x*blockdim.y*blockdim.z. So for blockdim.x == 2, blockdim.y == 3, blockdim.z == 4, THREADS_PER_BLOCK == 21, and 16 enabled lanes, threads in a block are assigned like this:

tid.x

tid.y

tid.z

warp id

lane id

0

0

0

a

0

1

0

0

a

1

0

1

0

a

2

1

1

0

a

3

0

2

0

a

4

1

2

0

a

5

0

0

1

a

6

1

0

1

a

7

0

1

1

a

8

1

1

1

a

9

0

2

1

a

0xa

1

2

1

a

0xb

0

0

2

a

0xc

1

0

2

a

0xd

0

1

2

a

0xe

1

1

2

a

0xf

0

2

2

b

0

1

2

2

b

1

0

0

3

b

2

1

0

3

b

3

0

1

3

b

4

1

1

3

cut off by THREADS_PER_BLOCK

0

2

3

1

2

3

When your block starts, shared memory for the block contains the following:

Address

Size

PTX name

Description

0x0

16-bit

gridid

Taken straight from GRIDID method.

0x2

ntid.x

Block dimensions, taken from BLOCKDIM*

0x4

ntid.y

0x6

ntid.z

0x8

nctaid.x

Grid dimensions, taken from GRIDDIM

0xa

nctaid.y

0xc

ctaid.x

Coordinates of this block inside the grid, 0 through GRIDDIM.[XY]-1

0xe

ctaid.y

0x10

USER_PARAM_COUNT*4

-

Parameters as specified in USER_PARAM

Also, at CP start, $r0 contains coordinates of the current thread inside its block:

Global memory

In CPs, there are 16 segments of global memory available. Each of them is independent, and can be either a linear area of VRAM, or a normal 2D surface. In the linear case, address used in g[] reference is simply used as an offset wrt GLOBAL_BASE_*.

However, when bound to a 2D surface, addressing is funnier. The high 16 bits of address passed to g[] are taken as y coordinate, low 16 as byte offset inside a line. Tiling is applied onto that according to specified tiling mode, with GLOBAL_PITCH used as tile pitch [distance between start of consecutive rows of tiles]. The resulting offset is then added to GLOBAL_BASE_* and bastardised with tile flags 0x7000.

Or, if you prefer a formula, for g[REG]:

Accessible area is limited by the GLOBAL_LIMIT method. For linear memory, it's set to size-1, where size needs to be a multiple of 0x100 bytes. A write is out of bound if REG > GLOBAL_LIMIT. For 2D surfaces, GLOBAL_LIMIT is a bitfield with limits for x and y parts. Access is considered out of bounds if (REG>>16) > (GLOBAL_LIMIT>>16) || (REG&0xffff) > (GLOBAL_LIMIT&0xffff).

So, methods [i is index of gi[] space you're setting up]:

0x1a0

DMA_GLOBAL

DMA segment used for all g[] spaces

0x400+(i<<5)

GLOBAL_BASE_HIGH

The base address of gi[] segment

0x404+(i<<5)

GLOBAL_BASE_LOW

0x408+(i<<5)

GLOBAL_PITCH

Surface pitch for 2D surface, ignored for linear. Must be multiple of 0x100, max 0x800000.

0x40c+(i<<5)

GLOBAL_LIMIT

Highest allowed address. Needs to be (multiple of 0x100)-1. Has separate x and y parts for 2D surfaces.

0x410+(i<<5)

GLOBAL_MODE

Bit 0: 0 == 2D surface, 1 == linear buffer. Bits 8-10: tile mode if 2D surface.