You are not logged in.

#1 2022-04-17 03:39:37

matt-3
Member
From: California
Registered: 2020-06-09
Posts: 16

Help diagnosing machine check error

I have an AMD Ryzen 5950x. I occasionally have stability issues where the machine crashes, reboots, and I get the following message after reboot:

mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5: bea0000000000108
mce: [Hardware Error]: TSC 0 ADDR 8a3a72 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1649997706 SOCKET 0 APIC 1e microcode a201016
mce: [Hardware Error]: CPU 19: Machine Check: 0 Bank 5: bea0000000000108
mce: [Hardware Error]: TSC 0 ADDR 8a3a96 MISC d012000100000000 SYND 4d000000 IPID 500b000000000

Is there some way I can decode these error codes? I've tried searching on the internet but found nothing helpful in this regard.

The crashes are very rare (once every few months at most) and don't seem to be triggered by any specific behaviors. They sometimes happen at idle, sometimes under load, etc.


“In science it often happens that scientists say, ‘You know that's a really good argument; my position is mistaken,’ and then they actually change their minds and you never hear that old view from them again… I cannot recall the last time something like that happened in politics or religion.” —Carl Sagan, 1987 CSICOP keynote address

Offline

#2 2022-04-17 11:20:22

Lone_Wolf
Member
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 11,868

Re: Help diagnosing machine check error

https://wiki.archlinux.org/title/Machin … _exception should help to get started troubleshooting this.


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.


(A works at time B)  && (time C > time B ) ≠  (A works at time C)

Offline

#3 2022-04-18 04:33:37

matt-3
Member
From: California
Registered: 2020-06-09
Posts: 16

Re: Help diagnosing machine check error

I read that page before making this post but I read it again to make sure I didn't miss anything. rasdaemon is only to monitor new events, so I guess I'll run it for next time, but I'm surprised there's no way to analyze an existing machine check error.


“In science it often happens that scientists say, ‘You know that's a really good argument; my position is mistaken,’ and then they actually change their minds and you never hear that old view from them again… I cannot recall the last time something like that happened in politics or religion.” —Carl Sagan, 1987 CSICOP keynote address

Offline

#4 2022-04-19 11:19:21

Lone_Wolf
Member
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 11,868

Re: Help diagnosing machine check error

I do remember mcelog had that functionality and it's man page confirms that its --ascii option does allow that.

There's https://aur.archlinux.org/packages/mcelog in the aur but you'll have to compile your own kernel with the necessary flag.

An alternatively might to be look for a distro / machine that has mcelog and use that to interpret existing reports.


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.


(A works at time B)  && (time C > time B ) ≠  (A works at time C)

Offline

#5 2022-04-19 21:12:47

matt-3
Member
From: California
Registered: 2020-06-09
Posts: 16

Re: Help diagnosing machine check error

mcelog does not know how to decode for my CPU.

mcelog: Unknown CPU type vendor 2 family 25 model 1

“In science it often happens that scientists say, ‘You know that's a really good argument; my position is mistaken,’ and then they actually change their minds and you never hear that old view from them again… I cannot recall the last time something like that happened in politics or religion.” —Carl Sagan, 1987 CSICOP keynote address

Offline

#6 2022-04-20 16:21:08

Lone_Wolf
Member
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 11,868

Re: Help diagnosing machine check error

Is that from running mcelog on another distro / machine ?

Anyway, mcelog --help should show a list of valid cpus.

Assuming the cpu of the machine where the errors occur is in the list you can then use the --cpu-cpytype option to force mcelog to decode for that cpu.


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.


(A works at time B)  && (time C > time B ) ≠  (A works at time C)

Offline

#7 2022-04-22 00:37:44

matt-3
Member
From: California
Registered: 2020-06-09
Posts: 16

Re: Help diagnosing machine check error

I'm not sure how I can get my CPU type in the same format that mcelog uses. I don't think it's there, anyway.


“In science it often happens that scientists say, ‘You know that's a really good argument; my position is mistaken,’ and then they actually change their minds and you never hear that old view from them again… I cannot recall the last time something like that happened in politics or religion.” —Carl Sagan, 1987 CSICOP keynote address

Offline

#8 2022-04-23 12:54:19

Lone_Wolf
Member
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 11,868

Re: Help diagnosing machine check error

Then i'm out of ideas. looks like you'll have to wait until new errors occur and rasdaemon stores them / sends them to journal .


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.


(A works at time B)  && (time C > time B ) ≠  (A works at time C)

Offline

Board footer

Powered by FluxBB