You are not logged in.
I've been trying to debug some random lockups that I think are related to an Nvidia GeForce9500GT and found (http://www.gentoo.org/doc/en/nvidia-guide.xml) a guide that talked about uncacheable registers reported in /proc/mtrr.
My /proc/mtrr looks like this:
> cat /proc/mtrr
reg00: base=0x000000000 ( 0MB), size= 2048MB, count=1: write-back
reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back
reg02: base=0x0c0000000 ( 3072MB), size= 256MB, count=1: write-back
reg03: base=0x0cff00000 ( 3327MB), size= 1MB, count=1: uncachable
reg04: base=0x100000000 ( 4096MB), size= 512MB, count=1: write-back
reg05: base=0x120000000 ( 4608MB), size= 256MB, count=1: write-back
Which (apparently) is "not good" because there are uncacheable regions and no write-combining regions. However I am completely over my head here and have no idea what any of this means. From elsewhere online I found posts about ATI cards having trouble if all registers are write-back/uncacheable. The suggested fix in the gentoo guide (change BIOS settings) doesn't work for me because my BIOS doesn't have the appropriate option. (it also doesn't let me assign an IRQ to VGA, meaning my nvidia card is sharing an IRQ with usb controllers, which could possibly be the cause of my crashes, but I when blacklisted ohci_hcd (the module servicing the interrupts) I still got a crash.. I'm still not ruling this out though). The mobo is a gigabyte MA770T-UD3P in case you want to avoid (I certainly will in future).
Anyway, this led me to see which registers the geforce is using. The 9500GT has 512MB of RAM. This is confirmed by the nvidia splash screen during boot. However lspci -v gives me the following:
01:00.0 VGA compatible controller: nVidia Corporation G96 [GeForce 9500 GT] (rev a1) (prog-if 00 [VGA controller])
Flags: bus master, fast devsel, latency 0, IRQ 18
Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at f8000000 (64-bit, non-prefetchable) [size=32M]
I/O ports at ef00 [size=128]
[virtual] Expansion ROM at fb000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [b4] Vendor Specific Information: Len=14 <?>
Capabilities: [100] Virtual Channel
Capabilities: [128] Power Budgeting <?>
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Kernel driver in use: nvidia
Kernel modules: nvidia, nvidiafb
I understand this to be reporting that the card has 16+256+32M of RAM. I am wondering whether this could be related to my lockups? Are there any other instances of lspci reporting incorrect memory sizes? It seems like the GPU memory isn't using the uncacheable register, but is this the whole story? Is the lack of write-combining registers causing problems?
As I said, I am in over my head at this point. Any help at all, be it links or personal experience, would be very much appreciated. I've had the ultimatum "fix your crashes or your moving to windows" from my boss, which needless to say is not somewhere I want to go!
Thanks
Offline
Not much of an help here but my notebook has an ATI card and this is what I have in /proc/mtrr
reg00: base=0x000000000 ( 0MB), size= 4096MB, count=1: write-back
reg01: base=0x100000000 ( 4096MB), size= 1024MB, count=1: write-back
reg02: base=0x0c0000000 ( 3072MB), size= 1024MB, count=1: uncachable
No crashes here or random lockups here, at least if using only the notebook's display, when I also use an external display sometimes I first get a mouse pointer corruption and then a hard lockup.
I also have this
dmesg | grep -i write-combining
mtrr: type mismatch for d0000000,8000000 old: write-back new: write-combining
mtrr: type mismatch for d0000000,8000000 old: write-back new: write-combining
mtrr: type mismatch for d0000000,8000000 old: write-back new: write-combining
mtrr: type mismatch for d0000000,8000000 old: write-back new: write-combining
Maybe the problem is something else or maybe someone may have a clue about that problem. Also you may want to try things with a new PSU, I've seen and experience many strange lockups that stopped happening after changing the PSU.
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
IIRC the nvidia driver uses PAT instead of MTRR anyway, so don't worry about the MTRR.
$ cat /proc/driver/nvidia/registry
...
UsePageAttributeTable: 1
Edit: Added URL.
Last edited by brebs (2011-01-10 17:07:19)
Offline
Thanks for the info, perhaps I should look elsewhere for the causes of my problem.
I left my computer running last night without X and it didn't crash, so I'm pretty sure it is something graphical which is causing the problem, but I just don't know what :s
Offline
Kernel modules: nvidia, nvidiafb
Try without nvidiafb - it only increases the chances of crashes, in my experience.
You have 2 nvidia driver versions to try - the unstable and stable versions. See nvidia driver current versions.
Offline
I'm not sure how to disable nvidiafb. It is already in /etc/modprobe.d/framebuffer_blacklist.conf and doesn't appear to be loaded as a module. Am I missing something here?
Offline
Show your config:
zgrep -i nvidia /proc/config.gz
Offline
> zgrep -i nvidia /proc/config.gz
CONFIG_AGP_NVIDIA=m
CONFIG_FB_NVIDIA=m
CONFIG_FB_NVIDIA_I2C=y
# CONFIG_FB_NVIDIA_DEBUG is not set
CONFIG_FB_NVIDIA_BACKLIGHT=y
CONFIG_BACKLIGHT_MBP_NVIDIA=m
Does this mean nvidiafb is being compiled as a module into the kernel and the answer is to recompile the kernel with CONFIG_FB_NVIDIA not set?
I imagine that if I'm doing this I should unset CONFIG_AGP_NVIDIA since I don't even have an AGP slot on my mobo?
Last edited by clanger (2010-02-18 13:14:25)
Offline
CONFIG_FB_NVIDIA=m
Then lsmod must be showing it, yes?
$ lsmod | grep nvidia
I assume that lsmod shows nvidiafb, in which case: Try e.g.:
grep nvidia /etc/rc.d/*
Anyway, you can of course do this simple test:
stop Xorg
$ modprobe -r nvidiafb
$ modprobe -r nvidia
startx (as your normal user)
Last edited by brebs (2010-02-18 13:22:27)
Offline
> lsmod | grep nvidia
nvidia 8795767 38
i2c_core 15369 1 nvidia
agpgart 23331 2 nvidia,ati_agp
No sign of nvidiafb, yet it is still reported in lspci -v.
I can modprobe nvidiafb and it shows up in lsmod, but by default it is not there.
I guess the framebuffer_blacklist.conf is working?
If I leave X, remove both modules, and reload X, the nvidia module gets reloaded but nvidiafb does not.
Thanks for your timely help on this.
Last edited by clanger (2010-02-18 13:33:40)
Offline
I can modprobe nvidiafb and it shows up in lsmod, but by default it is not there.
OK, nvidiafb was a false alarm, lspci is just showing it as a potential, I suppose - ignore.
agpgart 23331 2 nvidia,ati_agp
What is loading ati_agp??
Add to /etc/modprobe.d/whatever.blacklist :
ati_agp
Offline
00:00.0 Host bridge: ATI Technologies Inc RX780/RX790 Chipset Host Bridge
Subsystem: ATI Technologies Inc RX780/RX790 Chipset Host Bridge
Flags: bus master, 66MHz, medium devsel, latency 32
Memory at <ignored> (64-bit, non-prefetchable)
Capabilities: <access denied>
Kernel modules: ati-agp
Not sure exactly what it is doing. I rmmod'd it and everything still seems to be working. I will blacklist it and try out the substrate screensaver, that seems to pretty consistently cause a lockup.
Offline
Argh, still no luck. Had a lockup, with ati-agp unloaded, when loading a VirtualBox image. Nothing unusual in logs, had to turn off with the power button.
Edit: VirtualBox will crash <5 minutes after starting if the vboxnetadp or vboxnetflt modules are loaded. I can only guess that my kernel is configured wrong for my hardware and certain modules trigger this in some way. Removing ati-agp seems to have stopped lockups related to the GPU (i.e. when a graphics heavy screensaver like pixelcity or substrate is running), but the problem still comes up in others ways.
Still progress is being made, and thats something!
Thanks for your help brebs.
Last edited by clanger (2010-02-18 15:36:16)
Offline
Try recompiling the virtualbox modules and check if it still crashes.
A kernel update might have changed something making vbox crash the system. Also it might have something to do with (kernel?) performance counters. Now I don't see anything in dmesg but with vbox 3.1.2 when loading vboxdrv it would issue a warning to disable that otherwise it might cause trouble.
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
nvidia card is sharing an IRQ with usb controllers, which could possibly be the cause of my crashes
So this is worth a try:
In /etc/modprobe.d/whatever.conf
options nvidia NVreg_EnableMSI=1 NVreg_UsePageAttributeTable=1
Then stop xorg, modprobe -r nvidia, and restart xorg, and check with:
cat /proc/driver/nvidia/registry
cat /proc/interrupts
You will see something like:
32: 1680505 1491409 PCI-MSI-edge nvidia
Which shows that MSI is being used.
Offline
I have a junky /proc/mtrr using an intel GMA on a new Dell Inspiron 1440.
<code>[djolk@tunamelt Desktop]$ cat /proc/mtrr
reg00: base=0x000000000 ( 0MB), size=32768MB, count=1: write-back
reg01: base=0x0e0000000 ( 3584MB), size= 512MB, count=1: uncachable
reg02: base=0x0dde00000 ( 3550MB), size= 2MB, count=1: uncachable
reg03: base=0x0de000000 ( 3552MB), size= 32MB, count=1: uncachable
</code>
reg00 is way too enormous to be anything useful and my vid card doesn't show up at all. I get a no writing combining, graphics may suffer at boot and they definitly do. I know the intel driver itself isn't super hot but KDE is downright clunky and dvd playback is grainy, out of sync and tears.
I've read a few posts regarding rewriting your own mtrr table but you have to be careful because you get a lot of hard locks.
Offline
I left the computer running over night with various programs running (including VirtualBox). It reset itself 4 hours after I left and the login prompt was waiting for me when I arrived this morning. There was nothing in the logs relating to the lockup, I only know when it crashed because of the usual startup logs. I experienced a hard lockup shortly after logging in.
I have followed brebs instructions and nvidia is using MSI now. I will see how it performs throughout the day. Fingers crossed.
Thanks for the help.
Offline
Still getting hard lockups. In fact, I think I am getting more now than I was previously. Perhaps the MSI/PageAttributeTable change is responsible. Off to read somemore about exactly what these options do.
> cat /proc/driver/nvidia/registry
EnableVia4x: 0
EnableALiAGP: 0
NvAGP: 0
ReqAGPRate: 15
EnableAGPSBA: 0
EnableAGPFW: 0
Mobile: 4294967295
ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
RemapLimit: 0
UpdateMemoryTypes: 4294967295
UseVBios: 1
RMEdgeIntrCheck: 1
UsePageAttributeTable: 1
EnableMSI: 1
MapRegistersEarly: 0
RegistryDwords: ""
Offline
I have a junky /proc/mtrr using an intel GMA on a new Dell Inspiron 1440.
<code>[djolk@tunamelt Desktop]$ cat /proc/mtrr
reg00: base=0x000000000 ( 0MB), size=32768MB, count=1: write-back
reg01: base=0x0e0000000 ( 3584MB), size= 512MB, count=1: uncachable
reg02: base=0x0dde00000 ( 3550MB), size= 2MB, count=1: uncachable
reg03: base=0x0de000000 ( 3552MB), size= 32MB, count=1: uncachable
</code>reg00 is way too enormous to be anything useful and my vid card doesn't show up at all. I get a no writing combining, graphics may suffer at boot and they definitly do. I know the intel driver itself isn't super hot but KDE is downright clunky and dvd playback is grainy, out of sync and tears.
I've read a few posts regarding rewriting your own mtrr table but you have to be careful because you get a lot of hard locks.
Add the following to the end of your kernel command line in /boot/grub/menu.lst
enable_mtrr_cleanup
I use this on my Fujitsu laptop and it corrects the entire MTRR table automatically at boot.
Offline
Using the enable_mtrr_cleanup option changes my mtrr (see first post for orignal) to this:
> cat /proc/mtrr
reg00: base=0x000000000 ( 0MB), size= 2048MB, count=1: write-back
reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back
reg02: base=0x0c0000000 ( 3072MB), size= 256MB, count=1: write-back
reg03: base=0x0cff00000 ( 3327MB), size= 1MB, count=1: uncachable
Which seems to be just removing the last 2 registers. Still no write-combining sections, and it appears to be missing 768MB of RAM. free confirms that around 768MB is not being found. If it fixes my stability problems I will be happy to make the sacrifice though.
Offline
enable_mtrr_doesn't have any affect on my system. I've tried compiling the git kernel from AUR and compiling it in with no affect as well.
I can't remember the error but I believe it was something to the effect of incorrect gran/chunk size.
I'm also 'missing' some ram.
Offline
This is how I solved mine, lspci -v:
Intel GM965/GL960: Memory at d0000000 (64-bit, prefetchable) [size=256M]
Then added to /etc/rc.local:
# Fix MTRR on Intel graphics
/bin/echo "base=0xD0000000 size=0x10000000 type=write-combining" >| /proc/mtrr
Resulting /proc/mtrr
reg00: base=0x000000000 ( 0MB), size= 1024MB, count=1: write-back
reg01: base=0x03f700000 ( 1015MB), size= 1MB, count=1: uncachable
reg02: base=0x03f800000 ( 1016MB), size= 8MB, count=1: uncachable
reg03: base=0x0d0000000 ( 3328MB), size= 256MB, count=2: write-combining
You can find a lot of discussion, and help, about this on Linux distribution forums.
You need to install an RTFM interface.
Offline
I have been working on a script to add a write combining line however I believe the first problem that needs to be solved is:
reg00: base=0x000000000 ( 0MB), size=32768MB, count=1: write-back
Reg00 should not be thirty two thousand mega bytes! It overlaps all the other mtrr regions and I can't add a write combining line.
There is a solution (ish) from ubuntu here:
http://ubuntuforums.org/showthread.php?t=1285176 but setting incorrect mtrrs causes lots of hard freezes so I have yet to get a working mtrr setup.
I am curious while its happened. I've read in several places that it occurs frequently on computers made by Dell. I've also read that mtrr isn't used anymore and PAT replaces it but if I touch my mtrr table my system tells me its using it.
Offline
You could try compiling mtrr-uncover, to see if it helps.
Offline
I believe I've tried that as well. I'll have to take another looks at it though. I've looked at a bunch of stuff regarding this and I really think the only way to do it is to manually rewrite the mtrr table - I would like to find out WHY this happens and I have no idea what my /proc/mtrr should look like I'm also wondering if it counts as a bug...
Offline