You are not logged in.

#1 2024-05-29 04:05:18

jamdox
Member
Registered: 2015-05-02
Posts: 48

ECC RAM errors, crashes, changes with different kernel versions

I have a Ryzen 5900, 64GB ECC DRAM in a 2x32 config, x570 motherboard.  I've parts cannon'd everything but the problem recurred.  It seems that certain kernel versions, maybe, lead to reduced/fewer crashes, but I had a crash and updated the kernel, and now I'm having even more crashes!

I'm also running rasdaemon and ras-ms-ctl.  In journalctl, I get messages like

[Hardware Error]: CPU:0 (19:21:2) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
...
rasdaemon[764]:            <...>-27606 [000] .....     0.000918 mce_record 2024-05-28 07:31:11 -0600 Unified Memory Controller (bank=17), status= dc2040000000011b, 

etc.   This happened to be a "corrected" error, but it seems to still have crashed the machine.  Any help would be appreciated.

Last edited by jamdox (2024-05-29 04:05:56)

Offline

#2 2024-05-30 07:20:03

jamdox
Member
Registered: 2015-05-02
Posts: 48

Re: ECC RAM errors, crashes, changes with different kernel versions

Disabling rasdaemon and ras-ms-ctl didn't help.

Offline

#3 2024-05-30 11:42:38

xerxes_
Member
Registered: 2018-04-29
Posts: 1,056

Re: ECC RAM errors, crashes, changes with different kernel versions

This crashes happens on high CPU/memory load or before/after suspend (change power state)?
You may want to experiment with CPU power (higher) / memory frequency (lower) settings in bios, disable power states savings (C states in bios), etc. or kernel power parameters...

Offline

Board footer

Powered by FluxBB