arthur DOT huillet AT free DOT fr
Location: Grenoble, France (CET)
Inactive nouveau developer
Hardware : NV04, NV05, NV11, NV15, NV15GL, NV18 - probably dead, not sure -, NV25, NV28, NV30GL, NV92
2007 summer project, as part of X.org's Vacation of Code : improve XVideo support in nouveau.
Topics tackled : Xv (see below), XvMC (not written by me, not quite interested in the topic either), sending commands to the card (objects, channels, subchannels), context switching (I rewrote the page ContextSwitching), DMA, DRM pre-TTM memory management, EXA, XRENDER, NV04 3D engine, NV10 3D engine, register combiners.
This page tries to explain a little bit the things that I learnt, in order for everyone to be able to follow my work, and hopefully to be a useful resource.
I have detailed my work in chronological order, starting from "Activity report" below.
The XVideo extension (a.k.a. Xv) provides a means for media players to output video directly onto the screen. Xv itself is described at several different places, no details about that here, just notes and comments about the topics I had trouble finding information about.
We are only interested in XvPutImage (function to put a frame of video on the screen). Video IN (XvPut / GetVideo) is not on my todo list.
A definite reference : http://people.freedesktop.org/~mhopf/fosdem_2007_xvideo_opengl.pdf
Blitter, overlay, texture ...
Three ways to put a video frame on the screen
Overlay : fill in the screen portion with a colorkey (usually blue), copy the frame in an offscreen portion of video RAM. The CRTC will jump to this zone upon encountering the colorkey. Supported : NV04 (specific overlay) and NV10 - NV40 (only the first NV40s, others have to use blitter/texture).
Blitter : copy the frame in an offscreen portion of the video RAM. Then the card will blit the data to the framebuffer (VRAM->VRAM, that's fast). Supported since NV04.
Texture : draw a textured quad. Possible to use shaders for format transformation. Supported : NV17+
The nv driver, as a 2D only driver, implements blitter and overlay, with YUY2 pixel format only. Nouveau used the nv code : slow (CPU data copies) feature limited Xv.
Videos are usually not in RGB, instead they are [shot and] encoded in YUV, where Y stands for luminance and UV for chrominance = color. Resolution of the eye is smaller for chroma, so less chroma samples are taken : e.g. YUYV. Those formats need to be converted to RGB colorspace before being displayed. This involves re-finding all YUV samples for a given pixel, and multiplying by some matrix (RGB -> YUV is linear). The colorspace conversion is done by the card, which receives either RGB, either YUY2 format (http://www.fourcc.org/). Update: in fact the overlay is able to receive YV12 format, which is the most common on this planet. See below in the chronogical progress reports.
Video data transfer
As mentionned above, the card accepts RGB or YUY2 format, this means that any other pixel format such as YV12 (mpeg) has to be converted. This is currently done in software. Nvidia cards support YV12 natively, and old Xv code didn't have that. The reverse engineering of YV12 overlay was done, documentation is further down.
The data is transmitted by the CPU (iterate over dst=src). While this would be fine for small chunks of data, this is unacceptable for things as large as video frames that are transmitted around 24 times a second. To lower CPU usage and increase performance, DMA transfers have to be used. See below for more details.
I tried to document it as much as possible : ContextSwitching. This is not related to Xv in any way but is an important topic to understand.
My plans are the following : at first I want to implement, by copypasting nv_exa code, AGP DMA. The reason behind that is that it can be done easily, directly from XVideo code, while helping me fully understand and structure the code correctly.
The very first thing is to clean up the current code : right now, the NvCopy* functions do two things at once - convert YUYV to YUV, and copy YUV data to the card. The conversion part will be kept in software, because, for now, I don't know if the card can do it, and it's probably easier to keep software conversion for a while. So what I'm going to do first is to make the driver convert to YUV format in AGP memory, and then DMA it to the card.
Current behavior : YUYV data in system memory - - - - -> YUV data in VRAM, copy being done by the CPU
Target behavior : YUYV data in system memory -- - - - -> YUV data in system memory (AGP memory actually), copy done by CPU, - - - - - --> YUV data in VRAM, copy by AGP DMA
At that point I expect to have working DMA transfers for at least AGP cards. This won't last though: the second step will be to make use of DMA functionnality provided by the DRM - I will have to change a few things because currently the DRM doesn't support page table DMA.
In the first step I won't have to wonder (much) how DMA works with Nvidia cards : I expect that copypasting EXA's UTS code will be enough. I'll have to manually create DMA objects and all when I'm at the second stage working on page tables.
I do know you won't like this idea of implementing AGP DMA first, but I really need it for my understanding of things, besides, it's better than nothing and the very first thing to do (restructurate the code to separate conversion and copy to VRAM) is to be done anyway.
This step is done. Just to comment : actually the conversion is to YUY2, not YUV that seems to be unsupported.
Now that I have AGP DMA working, I will be modifying the DRM to support PCI DMA. This will be done by creating a software PCIGART.
I am going to allocate some RAM space in the DRM (say 4MB), and get the DDX to allocate its GARTScratch zone in here. I will create one PCI DMA object that contains all the pages from the mentionned RAM space. ---
"TT", when referring to AGP memory, means "Translation Table" (TTM terminology).
I will create pci_heap, separate from agp_heap, so that AGP users can access this PCI zone as well, e.g. when doing 3D.
Objects are specific to the fifo they were created on, so practically the DRM creates one AGP and one FB DMA object per FIFO. However, before NV50, objects created on different FIFOs are not technically separated - an AGP object is 16 bytes large so we can create one per FIFO, but PCI objects, due to the page list, will be much larger, and for < NV50, I will create only one and add only the RAMHT entries, in order to not eat up all RAMIN space, especially since older PCI cards may be quite limited RAMIN-wise.
The PCI buffer would be created using the existing DRM semantics (create pci_heap, use drm_sg_alloc, etc.)
There is no GART in PCIE as there was in AGP, because there usually is an IOMMU (the same thing, but available to the whole bus). Early Intel implementations did not have such IOMMUs. (http://lwn.net/Articles/91870/) For now, we will not use the IOMMU of PCIE. Talked with marcheu about it, the conclusion is that using a GART or doing sgDMA probably doesn't differ much for Nvidia cards, but this is something that should be measured. I will do this once I have sgDMA working.
The memory allocation logic in the DRM uses an offset inside a given block to address a portion of memory. The offset in the case of PCI DMA is the kernel virtual address of the block start, minus the kernel virtual address of the sg zone start.
Now we have PCI DMA integrated upstream. At the time of this writing, PCI DMA object creation works only on x86 and x86-64, and someone seems to be working on it for PPC64. PCI DMA reportedly sped up EXA on PCI-Express machines, according to some gtkperf results. My own tests showed that AGP DMA was slower than PCI DMA for gtkperf, though we were not able to figure out why. This is an important point, that I consider low priority for the time being, because I want to finish my Xv stuff.
Current status is that my work improved things for PCI-Express users, except for people with NV50s because those don't deal with DMA objects the same way as other Nvidia cards.
With help from a lot of people on #nouveau, and oprofile, we did some profiling of Xvideo to find out why it was so slow. We made some interesting discoveries. The test results (you're only interested in system.symbols in each directory) are here : http://people.freedesktop.org/~ahuillet/xvtest/ The xvtest program is Keithp's xvtest modified to display a given number of frames (5000 in the 256x256 test, 10000 in the 800x600 test) as fast as possible, either in YUY2 either in YV12.
p0g profiled the blob on xvtest program, both for YUY2 and YV12. The results are that the blob is a little faster than nouveau Xv for YUY2 (10% of difference), and 25% faster on YV12 data than on YUY2, which exactly corresponds to the difference in size of pixel data (16bpp against 12bpp), which makes us think that the blob does no software conversion. This will have to be checked.
The conclusions are the following:
- The DMA transfers are slowed down by the busywait for the DMA notifier. This slowdown is 10% for AGP, and more than 50% for PCI. Removing the wait brought AGP and PCI DMA to the same performance level, with text corruption obviously.
- Blitter is slowed down by the NVSync call, that could probably be removed.
- We will need to valgrind-mmt the blob to find out whether it's doing YV12 -> YUY2 conversion, and find out how to avoid it, if relevant
We have DMA working for everything, and did some profiling. The results were not very good, and more improvements must be made to unleash the full potential of DMA. (now this sentence rocks !)
Implementing fencing based on interrupts would speed up 2D a lot, but I will not be doing that because things as unclear now as to whether we will be using TTM or not. This means status quo on fencing for some time. Instead, I have started work on host-side double buffered Xv (2 host buffers per Xv port), which will eat more memory but can be a lot faster, because we are not going to wait too long. It turned out to significantly improve performance.
2007-07-29: I have fixed a bug reported by pq, that my Xv code broke display of tvtime (a TV receiver software). This was due to me blatantly ignoring a few things regarding how X expects Xv implementations to behave. It turned out that there are use cases (tvtime with overscan enabled), in which the data transmitted to DDX's PutImage implementation is not to be displayed entirely (e.g., you only have to display a part of it). In this case, the way I was doing DMA transfers broke everything, because I was copying all the data and pointing the card at the wrong place.
In fact, the memory layout of the data transfer as it was done before (read: in nv) and now (read: in my code) are probably not so easy to understand, at least for beginners. I do not know how you do, but I usually have a pen and a lot of paper around my computer, and I use it to clarify my thinking. Also, I talk to myself, a lot, which surprises many people. I ended up with a schema that is not too bad to explain the memory layout of the different things, I will therefore re-do a more readable (and english) version, and upload it right here, it will spare me lots of lines of text. Update: the document can be found here. It's not too readable but it explains how things work. Note: the overlay/blitter are pointed at the place where there would be data if we copied everything we are handed by the X server, that is, they are pointed right at potentially (though usually not) untouched VRAM - garbage. The hardware probably expects it this way, so I imitated the behavior.
Long time since the last progress report. I have worked mostly on reverse engineering the YV12 overlay, and implemented it in our Xv. I will not write about it here because I did it extensively on IRC and in the actual Xv code, so refer to one of those, or contact me directly if you have questions.
Let me do a quick bit of documentation of the YV12 overlay, for later reference: the card defines registers at PVIDEO + 0x800, 0x808 and 0x820, that follow the semantics of those at PVIDEO + 0x900 (BASE, LIMIT, and OFFSET_BUFF), but treat of the color plane. The FORMAT registers uses bit (1 << 0) to denote usage of a separate colorplane. So, you'd set (1 << 0) in the format register, write the correct things in 0x8xx registers, and the overlay will read your planar data. The planar data actually is NV12 or NV21, that is, one luma plane (with a pitch aligned over 64 bytes boundaries), followed by one chroma plane of alternating V and U (VUVUVU) (pitch aligned over 64 bytes boundaries as well).
I've run into a bandwidth problem with the overlay when playing Elephant's Dream 1024p. I can work around it by reducing the size of the source image to only what's necessary = what will go on screen. I implemented that and this indeed got rid of the bandwidth problem. Xv is pretty much done now, and ready for wide testing on NV10 - NV40 (included) chips. NV04 will be done later when I get one of those cards, and NV50 will have to wait a little more because of the necessity to implement a texture adapter, my next target being NV10-NV20 EXA.
EXA Composite profiling - gkrellm, firefox, xchat, fluxbox menus, rxvt Results http://people.freedesktop.org/~ahuillet/EXA_profile.txt http://people.freedesktop.org/~ahuillet/EXA_profile2.txt Here we see what operations we need to support in Composite.
I discussed the technical issues with marcheu, relevant to my forthcoming EXA implementation. We took some decisions regarding card classes.
Things have changed due to some problems I experienced, the lines below are no longer relevant. NV04 and NV10 share a common NV04_DX6_MULTITEXTURED_TRIANGLE (0x55/0x95) implementation. This is motivated by the possibility of sharing the exact same code for those cards, thereby providing fast EXA on NV04 and NV10 in one shot, the easiness of use of NV04_TRI (which is an important factor for me ATM).
Note that to date, the chosen approach has a little problem: while we know NV04_TRI can work on NV10 cards, I did not manage to successfully run nv04_demo on my NV18 (the demo works but no triangles are drawn).
I received hardware donations from two people - I now have NV04/NV05 hardware as well as NV1x class cards for later testing. Due to NV04_TRI not working on NV10 yet for unknown reasons, I'll be starting my work on NV04. Unfortunately, X is unable to start with this card - before anything gets on screen, X exits, apparently because of some FIFO problem (it looks like we "loop" and joyfully overwrite the previous commands which have not made it to the card yet - unconfirmed). Reason is unknown, but getting NV04 to work is critical before I can do much EXA stuff. Marcheu and I are to investigate and fix this.
We still have some bandwidth problems with NV12 overlay, reported on a NV30 card. The bug exists in "nv" as well. Reported and unfixed (yet) in "nv".
Long time no report, because school began again. Two topics today : NV04 sw methods and NV10 EXA.
NV04 seems to lack the SET_CONTEXT* hw methods, as nvsdk suggests they are implemented in software. The card issues a PGRAPH NOTIFY interrupt, with ILLEGAL_METHOD. Offsets < 0x200 are software methods. I generated mmio dumps of my NV04 starting with the blob in order to see what those methods did, and tried to make sense out of it. It took me about one week of work before matc made me realise that I had been mistaken in my interpretation of the dump, and provided what seems to be a sensible explanation of what I had to do. The code needs to be ported to this "new" interpretation, and tested.
I wasted a lot of time due to something apparently related to my test system, which would boot the card only 50% of times - the card hung on the SKIPS at the very beginning of the FIFO. Changing the system (motherboard, CPU and RAM) to the one I found in my building's dumpster seemed to "fix" the problem. (note: it happened to me both with NV04 and NV11)
NV04 EXA turned out to be difficult because as my testing in nv04_demo and other investigations showed, the card does not seem to support untiled (non-swizzled) textures. Currently, tracking the swizzle/unswizzle state in the DDX is out of question so NV04 EXA idea is dropped, at least for now. It is noteworthy that pq managed to get nv10_demo to work on a NV2X card, so we'll be doing a one-shot EXA implementation for both NV1x and N2x classes using NV10_TCL. Due to my struggling with NV04_TRI (I did the work on my NV05 btw, since NV04 doesn't start), and with NV04 software methods, I was slow starting NV10 EXA, and marcheu did it for me. It is not complete right now since it lacks proper register combiners setup for no mask (single texturing) and masked (2 textures) drawing. I have been unsuccessful as of today in getting NV10_TCL to texture a triangle/quad for me, presumably due to incorrect register combiner setup - I can get plain quads (matchin vertex colors), or plain black quads when I try to set up the combiners. They seem a bit difficult to use anyway even on GL_NV_register_combiners side, so it will take some time. I hope init and all are fine, because I'm starting to be a bit fed up with all the recent failures.
Gave up NV04 sw method for now, will wait for some success to raise my motivation. What's next? Working NV10 texturing
NV04 sw methods are now "mostly" in - they are implemented and allow the card to properly start X, however I get caught in a STATE_INVALID interrupt storm for 0x5f (IMAGE_BLIT) : 0x308. So X starts and works mostly, but it's not perfect. 3D sw methods are not done just yet. Stopped work on NV04 for now, as it is "mostly functional" (= you can start X and watch a movie, which is about all you should be doing with this card anyway).
hkBst started work on documenting PVIDEO registers in rules-ng with my help. NV10+ PVIDEO regs are almost done, left are NV04 PRAMDAC overlay regs, and maybe the TEXTURE_FORMAT bits that I found in DirectFB. Those guys seem to know a lot about nvidia cards... Marcheu wants to race with NV30 EXA and jkolb against p0g and I on NV10 EXA. Pray for us :) Our current status is that while TCL seems to work fine enough for 2D stuff, we could not get a correctly textured quad.
The last month has seen a lot of work on NV10 EXA. We managed to get textured quads and then worked on implementing various Composite operations. As source and mask formats, we support A8R8G8B8 and X8R8G8B8 (cheating with register combiners), and A8. We can support R5G6B5 for 16bit formats. As destination we do what the card can do : A8R8G8B8 and (potentially) R5G6B5. No A8 destination is supported, which makes it very difficult to accelerate A8 + A8 operations which are heavily used in text rendering. We have no support for ABGR Over ARGB with a mask either, because of a lack of register combiners. I spent one week trying several hacks for A8 + A8, without success. AndrewR provided some profiling data: http://people.freedesktop.org/~ahuillet/NV10_EXA_accel-2007-11-08 http://people.freedesktop.org/~ahuillet/NV10_EXA_fallbacks_synthetic-2007-11-08
As we can see on the second file, we "only" have A8 + A8 and ABGR over ARGB with mask to support, then we're supposedly done.
== 2007-12-03== Late update of this page. N1x EXA is now ready - we do A8 + A8, but as I said above ABGR over ARGB with mask is not possible to do, except maybe on NV2x cards which have more regcombs, but it would require using NV20_TCL which implies doing a separate implementation. So not on our plans ATM.
As far as A8 + A8 is concerned, the details are in nv10_exa.c. I still have a little PPC endianness-related bug to correct (the masked out bytes for left and right borders are not computed correctly).
So NV1x EXA is done and I'll be moving on to NV2x randr12 as soon as possible.
Long time no update, for various reasons. I'd like to keep this habit of writing progress reports from time to time because I think it can be useful and interesting to newcomers and people who care about what I'm doing on nouveau (Xv, EXA...). I have to make this one very short though. I don't remember what was done in the past six weeks exactly, but it's mostly nothing. Notably, I haven't been working on NV2x randr12 because it is stillunknown and malc0's territory, and they are doing very well. There have been a few troop movements on Xv front though - image quality improvements and cleanup of source code. Following stillunknown's and marcheu's NV40 texture adapter were some questions about the image quality, which lead me to working on the blitter. Our YV12 to YUY2 conversion, inherited from nv, was doing a mere copy of chroma samples, whereas a linear interpolation could easily be done. I implemented that and got confirmation that it was better than before. The speed impact is unknown, reports appreciated (contact me on IRC to ask for the procedure). I also did a bit of cleanup on Xv code, removing useless stuff and moving every adapter in its own file (nv04_video_blitter.c, nv04_video_overlay.c, nv10_video_overlay.c, nv40_texture_adapter.c). What's coming next? Nothing on my todo list, as NV1x 3D seems not too possible right now...