Resume from standby failed, mce: [Hardware Error]

OpusOne · 2023-06-26 19:35:45

Hi,

I've been using standby/resume on my workstation for a few weeks now, without any issue.
Today, for the first time, resuming failed - the computer just hanged. I had to force a reset.

After rebooting from this - successfully - I looked at dmesg and noticed this type of message for each CPU core:

 mce: [Hardware Error]: TSC 0 ADDR be000000 MISC bebb1898 
[   11.912644] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1687804938 SOCKET 0 APIC 0 microcode b000040
[   11.912648] mce: [Hardware Error]: Machine check events logged
[   11.912649] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: be00000000800400

From what I've read, I'm assuming these messages are reports from the last errors that happened during the failed resume.

Kernel: 6.3.9-arch1-1
Processor: Xeon E5-2690 v4
GPU: AMD RX 6650 XT
RAM: 64GB non-ECC

My kernel boot line includes the following options:

pci=noaer threadirqs quiet loglevel=3 systemd.show_status=auto libahci.ignore_sss=1 splash

(Note: the 'pci=noaer' option is because without it, I got tons of PCI AER errors in dmesg due to an apparently well-known issue with the NVMe controller of the Samsung 960 Pro NVMe (which is one of the NVMe's I have in my machine, but not the one I boot Linux from.)

Any idea what could have suddenly happened, could it be related to the latest kernel version? Any other clue or things I can investigate?

Arch Linux

#1 2023-06-26 19:35:45

Resume from standby failed, mce: [Hardware Error]

Board footer