You are not logged in.
I found this thread while researching hardware errors on my recently built Ryzen 3590x system. There seems to be some similarities with the case discussed here.
I am running the following:
CPU: AMD Ryzen 9 3590x
GPU: Nvidia GTX 1070 Ti
MB: Asus ROG Crosshair VIII Hero (WI-FI)
Memory: G.SKILL Trident Z Neo 32GB 3600MHz CL16 DDR4 KIT OF 2 F4-3600C16D-32GTZN
OS: Fedora 31, KDE spin (kernel 5.5.9-200.fc31.x86_64)
Once in a while I encounter the following messages from kernel:
Message from syslogd@grizzly at Mar 21 19:33:53 ...
kernel:[Hardware Error]: Corrected error, no action required.
Message from syslogd@grizzly at Mar 21 19:33:53 ...
kernel:[Hardware Error]: CPU:0 (17:71:0) MC27_STATUS[-|CE|MiscV|-|-|-|SyndV|-|-|-]: 0x982000000002080b
Message from syslogd@grizzly at Mar 21 19:33:53 ...
kernel:[Hardware Error]: IPID: 0x0001002e00000500, Syndrome: 0x000000005a020001
Message from syslogd@grizzly at Mar 21 19:33:53 ...
kernel:[Hardware Error]: Power, Interrupts, etc. Ext. Error Code: 2, Link Error.
Message from syslogd@grizzly at Mar 21 19:33:53 ...
kernel:[Hardware Error]: cache level: L3/GEN, mem/io: IO, mem-tx: GEN, part-proc: SRC (no timeout)
I have done MemTest86 -- no errors there. I haven't seen hard crashes so far, might have crashed a program once, not sure. BIOS is updated to the the latest version currently available (Version 1302 2020/03/03). I am trying to understand if this is faulty hardware or something else.
Offline
Still the issue with F12e BIOS. GPU related stuff like rocminfo still crashes the system hard on the latest kernel.
The hard crashes actually corrupted some files of mine, it seems it affects mostly files with open file descriptors like service logs or the shell history file... last couple of bytes are filled with zeros there.
Happened to a bunch of video files as well where multiple sections of >= 1MB were all zero, causing skips and artefacts in video playback. Maybe ffmpegthumbnailer was running during the crashes with open file descriptors. Luckily no files of much importance affected...
If I test this in future, I'll make sure to unmount a couple of drives and probably reduce the runlevel to avoid damage.
Should I file a bug report for the linux package for Arch and/or for the upstream kernel?
In that case can anyone tell me where to report a bug for the upstream kernel?
Last edited by daren_k (2020-04-01 12:06:13)
Offline
Are things also crashing when running the RAM without overclock and at its default speed? The default speed will be something low like 2400MHz. I looked through the old posts in this thread and I couldn't understand if you tested the RAM at low speed or not. I know that it feels bad to run the RAM at 2400MHz instead of its advertised 3600MHz, but you should still test that because the 3600MHz speed is an overclock. If things run fine without overclock, making a bug report won't help you because there would be no way to replicate things on a different computer.
If you find out that things run fine when not using the 3600MHz RAM speed, you could try asking Gigabyte for help. They might send you an alternative BIOS to try, or they might point you to a BIOS setting you can tweak. The other thing to do would be to try to overclock the memory manually instead of using the XMP profile.
@adrians: I would try playing around with different UEFI/BIOS settings to see if there's a way to make those errors go away. I could always find something to make those kind of errors/warnings go away. I don't know if a computer can run stable and still record those warnings in the log so I don't know if it's important to make those kinds of errors go away. Personally, I wouldn't let it stay like this.
Offline
Happens with clocks at optimized defaults as well (RAM@2133 MHz).
In short:
F1 BIOS + linux (5.5.x) & linux-lts (5.4.x) = all good except RDRAND issue and slightly lower memory bandwidth, rocminfo and Furmark benchmark work
More recent BIOS + linux (5.5.x) = kernel crashes with rocminfo and when starting Furmark benchmark for example, file corruption possible when kernel crashes
More recent BIOS + linux-lts (5.4.x) = all good, running this for weeks with RAM@3600 MHz, rocminfo and Furmark benchmark work
It seems to be related to the amdgpu kernel driver in 5.5.x to me tbh. Something not functioning correctly with more recent BIOSes there.
I already filed a bug report about it at Gigabyte, got a reply that it's being looked at.
Offline
Can you force the crash reliably and fast? If you don't have to wait for it to show up, you could "bisect" the 5.5 kernel commits to find the one commit which introduced this crashing behavior.
I never did a bisect so don't know how it actually works. There's an ArchWiki article here:
https://wiki.archlinux.org/index.php/Bi … s_with_Git
I think this git-bisect thing doesn't work with the normal Arch kernel package as a base. The git repository in there only has tags like "v5.5.1" and "v5.4.2" etc. so you can't target the development commits that lead up to the release of 5.5. The ArchWiki article says something about using one of the "...-git" packages from the AUR to work on this. I don't know if that's really the way to do this for the kernel. I remember the user "loqs" writing a post with a guide on how to do a git-bisect on the kernel but I can't find that post.
EDIT:
I found a post by loqs that explains git-bisect for the kernel. See post #8 in this thread here:
https://bbs.archlinux.org/viewtopic.php … 9#p1875819
Last edited by Ropid (2020-04-02 12:31:35)
Offline
I lowered RAM speed to 3200MHz and haven't seen the error since then (I have been running this setup for about two weeks now). In my case this seems to be overclocking issue as suggested by @Ropid. I falsely assumed that XMP profile for 3600MHz RAM speed would just work out of the box. It might require some manual tuning to get stability.
Offline
I can instantly evoke the crash by calling rocminfo once or a couple times with linux 5.5.x.
It segfaults only with recent BIOS + linux 5.5.x.
On F1 BIOS and any BIOS with linux-lts 5.4.x I can call it with "while true; do rocminfo; done" and nothing breaks.
I will do a bit of testing once linux 5.6.x is released for arch. Maybe it is fixed then...
Last edited by daren_k (2020-04-06 16:35:38)
Offline
Seems to be fixed with the current linux 5.6.3, can call rocminfo and GPU benchmarks all I want with no issues.
Offline
Have the exact same issue that Darren originally had with Chrome tabs and pacman segfaults. Yay also segfaults. Steam webhelper segfaults like crazy.
This is with a Ryzen 3950x and an ASRock X570 Taichi motherboard. Still working the problem.
Last edited by Cxpher (2020-06-23 15:40:15)
Offline
Did you update the BIOS?
I am on F12e BIOS which works stable ever since arch is on kernel 5.6+, now 5.7.4.
There is F12f and F20a (with a lower AGESA version for some reason) that I didn't try out.
Edit:
Oh yeah those BIOS version numbers are for the GIGABYTE board, nevermind.
Last edited by daren_k (2020-06-23 15:49:31)
Offline
Yeah. The versions for the X570 are different.
https://www.asrock.com/mb/amd/x570%20taichi/#BIOS
I cold flashed to 3.00 earlier but haven't tested it. Ran into some issues before flashing where I couldn't get the system to post.
So now working on that (CMOS battery and the works). Just randomly stopped posting upon a reboot.
This has not been over locked. Just stock.
Last edited by Cxpher (2020-06-23 15:54:41)
Offline
If anybody runs into this like me:
I had the same error messages with my Ryzen 9 5900X and the Gigabyte Aorus Master.
If the recommended XMP profile was active, I could trigger these errors with the memtester tool.
https://archlinux.org/packages/communit … memtester/
The latest BIOS update solved the issue for me.
Offline
My Ryzen 3700X had the same error after upgraded AGESA ComboAM4PIV2 1.1.9.0. It worked well before this upgrade.
Offline
My Ryzen 3700X had the same error after upgraded AGESA ComboAM4PIV2 1.1.9.0. It worked well before this upgrade.
I have an ASUS X570-PRO board with a 3600X that I only started seeing these after the 3405 BIOS with AGESA 1.2.0.0 so might give the system a few more updates before winding memory back. It runs memory with 3600 XMP profile in use but the default profile is 2666.
Offline
tuxzz wrote:My Ryzen 3700X had the same error after upgraded AGESA ComboAM4PIV2 1.1.9.0. It worked well before this upgrade.
I have an ASUS X570-PRO board with a 3600X that I only started seeing these after the 3405 BIOS with AGESA 1.2.0.0 so might give the system a few more updates before winding memory back. It runs memory with 3600 XMP profile in use but the default profile is 2666.
Fixed in BIOS 3602 so worth waiting for an update.
Offline