You are not logged in.
Pages: 1
From yesterday I suddenly started to experience constant kernel panics. During boot as well as during normal work, like web browsing. I've managed to grab a picture of the screen from one of the panics:
http://postimage.org/image/os13aypix/
It is somewhat different from other panics I've experienced. Journalctl doesn't have anything interesting. I've tested RAM using memtest86++ and it was OK.
Any advice is highly appreciated.
EDIT: It was a faulty motherboard. Using new one and everything is OK.
Last edited by Demon (2013-03-08 12:17:00)
Offline
probably microcode issue. Try put on grub "clocksource=hpet" and try start
if don't, activate "microcode updation" on BIOS, put clocksource=hpet and try again. Once booted, install microcode to your processor "pacman -Ss microcode" choose one and install
Offline
I can boot, after few unsuccessfull tries. I've read somewhere that this could be caused by dust in the case. I'm gonna try this first: cleaning case and cpu cooler and resetting bios. Will report back.
Offline
I doubt that microcode or use of HPET would cause "ECC error during data access from L2". This sounds like broken CPU or overheating.
Did you perform any kernel updates recently?
Don't overclock (in case you do).
Clean CPU cooler from dust.
Try disabling SpeedStep/Cool'n'Quiet - somebody recently reported that CPUFReq appears to overclock his CPU to 5GHz (???) and crashes the system.
Check if you can boot some live cd, preferably one known to work on this computer in the past.
Check CPU temperature.
As a last resort, downclocking the CPU may increase probability that it will boot.
Last edited by mich41 (2013-02-19 21:32:00)
Offline
I've cleaned the case and reseted BIOS, and everything is the same. I can boot with Cool'n'Quiet off but only if underclocked. Really don't know what else to try, except BIOS upgrade (I already have the latest version).
I've also installed amd-ucode, but I honestly don't know what to do with it.
Offline
If underclocking helps then it's almost certainly a hardware problem. Is it always L2 ECC error on CPU1?
Install lm_sensors. Run sensors-detect and load driver for the SuperIO chip it finds. Run sensors.
If it reports temperatures above 60°C, it's possible that you have insufficient cooling and the CPU goes crazy due to high temperature.
Otherwise, it's core1 L2 cache damage or some extremely weird bug.
Last edited by mich41 (2013-03-02 12:05:10)
Offline
No, I'm following temperature constantly, and it is not overheating. I'm gonna try with kernel 3.8 microcode update. If this doesn't help - I guess I'll have to buy a new CPU.
Offline
You can try some old kernels just to be sure - old live cds, linux-lts (it's v3.0), etc.
Another option is disabling the faulting core. This should help with K10 which has separate L2 caches for each core. Not sure about K8, though.
Last edited by mich41 (2013-03-02 12:18:06)
Offline
I've tried already old live CDs, no use. This also happens with Windows XP (BSOD). How can I disable the faulting core?
Offline
Some BIOSes have an option to hide cores or "unlock" factory-hidden ones.
Another way:
echo 0 >/sys/devices/system/cpu/cpu1/online
Run this before the system crashes, e.g. in /etc/rc.local or whatever is the systemd equivalent if you use this.
Yet another: add nosmp to kernel command line if it's a dual-core CPU.
Last edited by mich41 (2013-03-02 12:23:29)
Offline
OK, thank you very much for your help.
Offline
echo 0 >/sys/devices/system/cpu/cpu1/online
This doesn't help, as the failure occurs before. Also, dmesg reports this:
microcode: AMD CPU family 0xf not supported
so the last hope is gone. For now it works if I disable Cool'n'quiet (which I can live with) and if I down clock the cpu, which is not acceptable.
Any other ideas?
Edit: I've already updated BIOS, still the same.
Last edited by Demon (2013-03-02 15:44:55)
Offline
I also see this in kernel panic log:
Tag Snoop Error
Offline
Family 0xf is K8 so it must be dual core. Since core1 is failing, simply add maxcpus=1 to boot parameters and Linux will run on core0 exclusively.
Tag snooping points to cache again.
Offline
No use. The same errors happen, just this time for CPU0. Only way I can boot and use my computer is to disable AMD Cool'N'Quiet from BIOS and to downclock to 1500-1600 MHz.
Offline
It was a faulty motherboard, afteral.
Offline
Pages: 1