The LKML thread: http://lkml.org/lkml/2007/1/25/184
Recursive probe hit on SMP
Reported by jb17some on 27th March 2008 with patched 2.6.25-rc6.
Mar 26 21:19:02 amd64linuxmsi kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 169.12 Thu Feb 14 17:51:09 PST 2008
Mar 26 21:19:08 amd64linuxmsi kernel: mmiotrace: ioremap_*(0xfd000000, 0x1000000) = ffffc20004d80000
...
kmmio: recursive probe hit on CPU 0, for address 0xffffc20004d80140. Ignoring.
BUG: unable to handle kernel paging request at ffffc20004d80140
IP: [<ffffffff8845fad2>] :nvidia:_nv003829rm+0x16/0x1b
PGD 77a34067 PUD 77a35067 PMD 54ada067 PTE 80000000fd00017a
Oops: 0002 [1] SMP DEBUG_PAGEALLOC
Using a single processor this bug does not trigger. It excludes the possibility of an instruction that legitimately faults twice.
Discussion and solution options: http://lkml.org/lkml/2008/3/28/391 (link fixed)
Hard lockup while tracing NV50 hardware
Netconsole does not output anything. Sometimes cpu0 log can get tens of thousands of lines, sometimes it's empty. A user reported that have to run X with the blob before starting to trace with the hooked blob. Koala_BR managed to do traces by using filter_offset.
There was a mapping of the range 0x0-0x600 in PCI space, I tried excluding that, but it doesn't help. The blob also maps multiple times addresses above 0xf0000000 where there is nothing that I can see of, looking at lspci. Didn't try to exclude those yet.
I implemented a module parameter to disable the actual MMIO tracing in mmio.ko. This gets me a list of would-be-traced addresses.
Among three users (me, cbaoth and clearer) we all have 64-bit Intel CPU, but not all 64-bit Linux. Clearer is running nv40, though.
evidence: hard lockup, netconsole is useless. After the blob has been running once, mmio-tracing after that works. Sometimes the MAP and UNMAP events get into the trace file before the machine freezes, but I never see the respective printk's in netconsole kernel output. NMI watchdog seems to detect a lockup, but the report did not make it into the disk. The only thing seen in the NMI report is something in spinlock.
cbaoth says he gets nothing to serial console, but kbd leds caps&scroll blinking. Even when I added per-CPU kmmio context, things do not change. Does not look like double faulting anymore, as that should now trigger a loud panic. All this is with blob version 100.14.19 or the one earlier. Version 169.07 is said to work okay with mmio-trace.
hypothesis: In lack of better ideas, let's blame nvidia for implementing a deadly race condition that normally never triggers, but mmio-trace screws all odds and makes it trip. Or for mapping stuff that is not theirs.
status: On hold, until mmio-trace is in mainline kernel and the "unknown opcode" issue is fixed.
- See if kernel boot option
mem=nopentium
makes a difference. In theory it should not... - May be debugged using remote DMA over Firewire (The DMA busmastering hardware of FireWire can work even when the CPU is stuck with interrupts disabled, but the PCI bus and the DRAM controller must be alive). Quick Howto: Connect two boxes with Firewire, load ochi1394 on both, copy System.map from lockup machine to debug machine, install firescope (grep -r /firescope Documentation in 2.6.25rcX) run it and get dmesg buffer readout working, run the mmiostrace. If firescope is working in that lockup, you can also use a version of fireproxy (I think 0.33) to parse the RAM of the locked-up machine using gdb with debug symbols of the kernel.
Null pointer dereferences in single step handler
existence: http://people.pwf.cam.ac.uk/sb476/nv/derefcrash.txt
hypothesis: We also seem to have a race where the page fault handler is called before we can single step the faulting instruction.
This could perhaps happen if an instruction caused two faults for different pages. However, this hypothesis has not been confirmed as the cause of the problem that we see.
status: Code was updated to not mess with the current page pointer if the address that caused the page fault is not being traced. This has fixed the problem for sb.
Something bad happens with the relay channel
pq saw some freezes when running mmio-trace at realtime priority caused by this.
Other Expected Problems
Think about accesses that cross page boundaries.
Double fault lockups (Solved)
hypothesis: It seems that we are occasionally unable to convince the processor that pages are present.
We can confirm this hypothesis by testing for the page present bit in the fault handler. Furthermore, if, in the fault handler, we try to set the page present and read from it a recursive call of the fault handler is seen.
hypothesis: The problem seems independent of the nvidia driver
sb appears to be able to reproduce the problem without the nvidia blob using https://no.spoon.fi/~pq/home_nv28/mmio13/kernel_kmmio.c and http://infidigm.net/~jeff/nv-test-0.2.tar.gz
evidence: http://people.pwf.cam.ac.uk/sb476/nv/nvtest02.txt.
hypothesis: The problem is specific to certain addresses.
test case: http://infidigm.net/~jeff/nv-test-0.5.tar.gz tries reading from a known good address and then writing to one of the 'bad' addresses.
Try loading the nv-test driver against git HEAD kmmio/mmio-trace (or any simple variations). Remember to hook-module nv-test.ko first. Also, I haven't tested the pci id stuff, so it might be buggy.
evidence: early evidence seems to indicate that specific addresses are not always bad, but they might still be sometimes bad.
Cause
processes page tables were not being synced with the reference page table.