Surface Layouts
A surface's layout refers to how pixels are mapped to memory locations. This page discusses the different layouts supported by Nvidia hardware and how to deal with them. Some of the details are educated guesses and may or may not be accurate.
Nvidia hardware generally supports 3 types of layouts (that we know of):
- Linear
- Tiled
- Swizzled
Linear
Linear surfaces are the simplest formats to understand. Pixels are arranged horizontally, one after the other, in rows. Linear surfaces have good spatial locality in one dimension (the x-axis), but not the other; moving 1 pixel along the y-axis produces a relatively large address displacement. The wider the surface is, the larger the displacement. Rows may have some padding at the end for alignment purposes.
Addressing:
bpp = bytes per pixel
Sw = surface width (including padding)
Pixel(x, y) = ( y * Sw + x ) * bpp

(32x32 image)
Tiled
Tiled surfaces are similar to linear surfaces, except that instead of the entire surface being stored linearly, each tile is stored linearly, and tiles are arranged horizontally, one after the other, in rows. Tiled surfaces give more spatial locality, making better use of caches and prefetching. This is because the tiles are not nearly as wide as the entire surface; therefore moving 1 pixel along the y-axis produces a much smaller (and constant) displacement. However, crossing tile boundaries (especially along the y-axis) will produce larger displacements. The typical tile size on Nvidia hardware is 16x16. Tiled surface dimensions must be a multiple of the tile dimensions obviously.
Addressing:
Tw = tile width
Th = tile height
Ts = Tw * Th
Tx = x / Tw
Ty = y / Th
Tpx = x % Tw
Tpy = y % Th
Stw = Sw / Tw
Tile(x, y) = ( Ty * Stw + Tx ) * Ts
Pixel(x, y) = [ Tile(x, y) + Tpy * Tw + Tpx ] * bpp

(32x32 image, 16x16 tiles)
Swizzled
Swizzled surfaces are similar to tiled surfaces, but offer even more spatial locality due to their recursive nature. Displacement along either axis starts out very small and gets progressively larger. See Wikipedia for images of the access pattern and a more formal discussion: Z-order (curve). Swizzled surface dimensions must be powers of two.
Note that this has little to do with shuffling the components of a vector register in shader programs and such, which is also called swizzling. Probably the term is used in this case because we are shuffling the bits of pixel x/y coordinates to generate addresses.
Addressing:
SwizzleBits(x, y) = yn xn yn-1 xn-1 ... y1 x1 y0 x0
Pixel(x, y) = SwizzleBits(x, y) * bpp

(32x32 image)
NV04-NV40
Individual surfaces referenced by the 3D engine have a format which can be swizzled or not swizzled. NV_SCALED_IMAGE_FROM_MEMORY can be used to swizzle an image while copying it, although it seems to have a size limit of 1024x1024; larger surfaces need to be copied in chunks. In general swizzling large surfaces in this way does seem to be slow in comparison to rendering to a swizzled surface using the 3D engine.
Tiled surfaces on the other hand seem to be set up completely differently. There appear to be MTRR-like registers in the MMIO space, where you can specify an offset+len pair to mark a region of memory as tiled. The number of registers available depends on which chip we're dealing with. Tiled memory regions don't appear linear to the CPU like they do on some other hardware, since we can dump the blob's framebuffer and see that it is obviously tiled. All hardware units reading from/writing to memory probably know to check the tile regs, so most likely you can DMA or render to/from tiled without doing anything more than usual. (What do you get if you copy+swizzle a texture to a tiled region?)
Swizzled surfaces are probably only useful as textures and render targets, not as back or front buffers, partly because they have to be POT, and most textures already are POT, and partly because they probably can't be used as scanout buffers.
NV50
nvidia uses 5 sort of tiles (pitch x height):
(0x00) 64x4
(0x10) 64x8
(0x20) 64x16
(0x30) 64x32
(0x40) 64x64
The internal memory layout (which the vm hides away from you) which is normally not visible reveals some information.
tile_flag: 0x1800/0x2800/0x4800/0x6c00
These are for stencil/depth buffer formats. They do byte reordering based on their purpose. They share the same long range reordering as 0x7a00.
tile_flag: 0x7000
This is a 'standard' recursively tiled format, the subtile sizes are 64x8, 32x4 and 8x2. If your base format is smaller than some of these sizes, then they should be ignored.
tile_flag: 0x7400
Similar to 0x7000, no reordering is observed. Makes me wonder why it exists.
tile_flag: 0x7a00
Similar to 0x7000, except the memory is reordered in blocks of N*4096, for alignment don't forget page sizes as well.
The periodic structure for nv8x is: +1 -1 +1 -1 +1 -1 +2 +3 -2 +2 -3 -2 (in blocks of 0x2000 bytes, tested on 8400GS)
The periodic structure for nv9x is: +1 -1 +2 +3 -2 +2 -3 -2 +1 -1 +1 -1 (in blocks of 0x4000 bytes, tested on 9600M GT)
The periodic structure for nva0 is: +6 +1 +5 0 -4 -1 -5 -2 (in blocks of 0x7000 bytes, tested on GTX260, 448 bit memory interface related to 0x7000 periodicity ?)
The periodic structure for nva5 is: +1 -1 +1 -1 +1 -1 +2 +3 -2 +2 -3 -2 (in blocks of 0x8000 bytes, tested on GT 220)
This means each block of 8192 / 16384 / 28672 bytes is shifted several positions forward or backwards. It seems this behaviour is enabled by page flags bit 0x0800.
tile_flag: 0xe000
No long range reordering is observed. This is used for DEPTH32 STENCIL8; it first writes 32 dwords of Z, then skips 24 dwords, then writes 8 dwords of stencil.

