You are not logged in.

#1 2021-03-28 15:45:25

6rJ27GKfu
Member
Registered: 2020-06-17
Posts: 13

Sudden reboots under load and kernel error mce: [Hardware Error]?

My system sometimes suddenly reboots while playing e. g. Counter Strike: Global Offensive or GTA Online and afterwards I get the following error in journalctl.

kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: bea0000000000108
kernel: mce: [Hardware Error]: TSC 0 ADDR 6a35d4a8 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1616930179 SOCKET 0 APIC 2 microcode 8701021

I have already searched for this error and actually found a lot of threads, but nothing helped in identifying the error and offering a solution for it. I have installed rasdaemon but there hasn't been an error since.

Is there anything I can do to prevent this?

Here are my system's specifications:

System:    Kernel: 5.11.10-zen1-1-zen x86_64 bits: 64 compiler: gcc v: 10.2.0 
           parameters: initrd=\amd-ucode.img initrd=\initramfs-linux-zen.img cryptdevice=/dev/nvme0n1p2:main:allow-discards 
           root=/dev/mapper/main-root rw lang=de init=/usr/lib/systemd/systemd locale=de_DE.UTF-8 amd_iommu=on iommu=pt 
           module_blacklist=nouveau vfio-pci.ids=10de:13c2,10de:0fbb pcie_acs_override=downstream,multifunction 
           Desktop: KDE Plasma 5.21.3 tk: Qt 5.15.2 wm: kwin_x11 vt: 1 dm: SDDM Distro: Arch Linux 
Machine:   Type: Desktop Mobo: Micro-Star model: B450 TOMAHAWK MAX (MS-7C02) v: 1.0 serial: <filter> 
           UEFI: American Megatrends LLC. v: 3.A0 date: 02/03/2021 
CPU:       Info: 6-Core model: AMD Ryzen 5 3600 bits: 64 type: MT MCP arch: Zen 2 family: 17 (23) model-id: 71 (113) 
           stepping: N/A microcode: 8701021 cache: L2: 3 MiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 100800 
           Speed: 4200 MHz min/max: N/A Core speeds (MHz): 1: 4200 2: 4200 3: 4200 4: 4200 5: 4200 6: 4200 7: 4200 8: 4200 
           9: 4200 10: 4200 11: 4200 12: 4200 
           Vulnerabilities: Type: itlb_multihit status: Not affected 
           Type: l1tf status: Not affected 
           Type: mds status: Not affected 
           Type: meltdown status: Not affected 
           Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via prctl and seccomp 
           Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer sanitization 
           Type: spectre_v2 mitigation: Full AMD retpoline, IBPB: conditional, STIBP: conditional, RSB filling 
           Type: srbds status: Not affected 
           Type: tsx_async_abort status: Not affected 
Graphics:  Device-1: NVIDIA GM204 [GeForce GTX 970] vendor: Micro-Star MSI driver: vfio-pci v: 0.2 alternate: nouveau 
           bus-ID: 25:00.0 chip-ID: 10de:13c2 class-ID: 0300 
           Device-2: Advanced Micro Devices [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] 
           vendor: Micro-Star MSI driver: amdgpu v: kernel bus-ID: 28:00.0 chip-ID: 1002:731f class-ID: 0300 
           Display: x11 server: X.Org 1.20.10 compositor: kwin_x11 driver: loaded: amdgpu alternate: radeon display-ID: :0 
           screens: 1 
           Screen-1: 0 s-res: 2560x1440 s-dpi: 96 s-size: 677x381mm (26.7x15.0") s-diag: 777mm (30.6") 
           Monitor-1: DisplayPort-0 res: 2560x1440 hz: 60 dpi: 118 size: 553x311mm (21.8x12.2") diag: 634mm (25") 
           OpenGL: renderer: AMD Radeon RX 5700 (NAVI10 DRM 3.40.0 5.11.10-zen1-1-zen LLVM 13.0.0) 
           v: 4.6 Mesa 21.1.0-devel (git-b95c62545a) direct render: Yes 
Audio:     Device-1: NVIDIA GM204 High Definition Audio vendor: Micro-Star MSI driver: vfio-pci v: 0.2 
           alternate: snd_hda_intel bus-ID: 25:00.1 chip-ID: 10de:0fbb class-ID: 0403 
           Device-2: Advanced Micro Devices [AMD/ATI] Navi 10 HDMI Audio driver: snd_hda_intel v: kernel bus-ID: 28:00.1 
           chip-ID: 1002:ab38 class-ID: 0403 
           Device-3: Advanced Micro Devices [AMD] Starship/Matisse HD Audio vendor: Micro-Star MSI driver: snd_hda_intel 
           v: kernel bus-ID: 2a:00.4 chip-ID: 1022:1487 class-ID: 0403 
           Device-4: Yamaha Steinberg UR22mkII type: USB driver: snd-usb-audio bus-ID: 3-1.1:4 chip-ID: 0499:170f 
           class-ID: 0103 
           Sound Server-1: ALSA v: k5.11.10-zen1-1-zen running: yes 
           Sound Server-2: JACK v: 0.125.0 running: no 
           Sound Server-3: PulseAudio v: 14.2 running: yes 
           Sound Server-4: PipeWire v: 0.3.24 running: no 
Network:   Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: Micro-Star MSI driver: r8169 v: kernel 
           port: e000 bus-ID: 22:00.0 chip-ID: 10ec:8168 class-ID: 0200 
           IF: enp34s0 state: up speed: 1000 Mbps duplex: full mac: <filter> 
           IP v4: <filter> type: dynamic noprefixroute scope: global broadcast: <filter> 
           IP v6: <filter> type: dynamic noprefixroute scope: global 
           IP v6: <filter> type: noprefixroute scope: link 
           Device-2: Microsoft Xbox 360 Wireless Adapter type: USB driver: xpad bus-ID: 1-8:3 chip-ID: 045e:0719 
           class-ID: ff00 serial: <filter> 
           IF-ID-1: virbr0 state: down mac: <filter> 
           IP v4: <filter> scope: global broadcast: <filter> 
           WAN IP: <filter> 
Drives:    Local Storage: total: 10.01 TiB used: 7.15 TiB (71.5%) 
           SMART Message: Unable to run smartctl. Root privileges required. 
           ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Samsung model: SSD 970 EVO 1TB size: 931.51 GiB block-size: 
           physical: 512 B logical: 512 B speed: 31.6 Gb/s lanes: 4 rotation: SSD serial: <filter> rev: 2B2QEXE7 temp: 43.9 C 
           scheme: GPT 
           ID-2: /dev/sda maj-min: 8:0 vendor: Western Digital model: WD30EZRX-00D8PB0 size: 2.73 TiB block-size: 
           physical: 4096 B logical: 512 B speed: 6.0 Gb/s rotation: 5400 rpm serial: <filter> rev: 0A80 scheme: GPT 
           ID-3: /dev/sdb maj-min: 8:16 vendor: Seagate model: ST2000DM008-2FR102 size: 1.82 TiB block-size: physical: 4096 B 
           logical: 512 B speed: 6.0 Gb/s rotation: 7200 rpm serial: <filter> rev: 0001 scheme: GPT 
           ID-4: /dev/sdc maj-min: 8:32 vendor: Seagate model: ST4000DM004-2CV104 size: 3.64 TiB block-size: physical: 4096 B 
           logical: 512 B speed: 6.0 Gb/s rotation: 5425 rpm serial: <filter> rev: 0001 scheme: GPT 
           ID-5: /dev/sdd maj-min: 8:48 vendor: Western Digital model: WD10EADS-00M2B0 size: 931.51 GiB block-size: 
           physical: 512 B logical: 512 B speed: 3.0 Gb/s serial: <filter> rev: 0A01 scheme: GPT 
Partition: ID-1: / raw-size: 100 GiB size: 97.93 GiB (97.93%) used: 28.5 GiB (29.1%) fs: ext4 dev: /dev/dm-1 maj-min: 254:1 
           mapped: main-root 
           ID-2: /boot raw-size: 512 MiB size: 511 MiB (99.80%) used: 149.6 MiB (29.3%) fs: vfat dev: /dev/nvme0n1p1 
           maj-min: 259:1 
           ID-3: /home raw-size: 830.99 GiB size: 816.95 GiB (98.31%) used: 404.62 GiB (49.5%) fs: ext4 dev: /dev/dm-2 
           maj-min: 254:2 mapped: main-home 
Swap:      Kernel: swappiness: 60 (default) cache-pressure: 100 (default) 
           ID-1: swap-1 type: zram size: 667.2 MiB used: 0 KiB (0.0%) priority: 32767 dev: /dev/zram0 
           ID-2: swap-2 type: zram size: 667.2 MiB used: 0 KiB (0.0%) priority: 32767 dev: /dev/zram1 
           ID-3: swap-3 type: zram size: 667.2 MiB used: 0 KiB (0.0%) priority: 32767 dev: /dev/zram2 
           ID-4: swap-4 type: zram size: 667.2 MiB used: 0 KiB (0.0%) priority: 32767 dev: /dev/zram3 
           ID-5: swap-5 type: zram size: 667.2 MiB used: 0 KiB (0.0%) priority: 32767 dev: /dev/zram4 
           ID-6: swap-6 type: zram size: 667.2 MiB used: 0 KiB (0.0%) priority: 32767 dev: /dev/zram5 
           ID-7: swap-7 type: zram size: 667.2 MiB used: 0 KiB (0.0%) priority: 32767 dev: /dev/zram6 
           ID-8: swap-8 type: zram size: 667.2 MiB used: 0 KiB (0.0%) priority: 32767 dev: /dev/zram7 
           ID-9: swap-9 type: zram size: 667.2 MiB used: 0 KiB (0.0%) priority: 32767 dev: /dev/zram8 
           ID-10: swap-10 type: zram size: 667.2 MiB used: 0 KiB (0.0%) priority: 32767 dev: /dev/zram9 
           ID-11: swap-11 type: zram size: 667.2 MiB used: 0 KiB (0.0%) priority: 32767 dev: /dev/zram10 
           ID-12: swap-12 type: zram size: 667.2 MiB used: 0 KiB (0.0%) priority: 32767 dev: /dev/zram11 
Sensors:   System Temperatures: cpu: 51.1 C mobo: 38.0 C gpu: amdgpu temp: 58.0 C mem: 56.0 C 
           Fan Speeds (RPM): fan-1: 0 fan-2: 957 fan-3: 0 fan-4: 0 fan-5: 0 fan-6: 0 gpu: amdgpu fan: 0 
Info:      Processes: 337 Uptime: 4h 24m wakeups: 0 Memory: 31.28 GiB used: 5.35 GiB (17.1%) Init: systemd v: 248 
           tool: systemctl Compilers: gcc: 10.2.0 clang: 11.1.0 Packages: pacman: 1200 lib: 455 Shell: Bash v: 5.1.4 
           running-in: konsole inxi: 3.3.03

Offline

#2 2021-03-29 02:13:11

ponyrider
Member
Registered: 2014-11-18
Posts: 112

Re: Sudden reboots under load and kernel error mce: [Hardware Error]?

First things to check would be

1. overheating - check for dust
2. faulty ram stick - use memtest

Offline

#3 2021-03-29 05:35:43

liara
Member
Registered: 2013-04-10
Posts: 32

Re: Sudden reboots under load and kernel error mce: [Hardware Error]?

Adding to @ponyrider's list of things to check:

3. upgrade the firmware (https://bugs.archlinux.org/task/53112)

Offline

#4 2021-03-29 06:31:36

ponyrider
Member
Registered: 2014-11-18
Posts: 112

Re: Sudden reboots under load and kernel error mce: [Hardware Error]?

@liara

thanks for posting a 4 year old 'bug' which was closed for reasons:

Closed by  Doug Newgard (Scimmia)
Saturday, 08 April 2017, 21:13 GMT
Reason for closing:  None
Additional comments about closing:  Either a hardware or firmware problem, nothing related to Arch

Offline

#5 2021-03-29 07:06:26

seth
Member
Registered: 2012-09-03
Posts: 51,299

Offline

#6 2021-03-30 20:19:32

6rJ27GKfu
Member
Registered: 2020-06-17
Posts: 13

Re: Sudden reboots under load and kernel error mce: [Hardware Error]?

1. Overheating could be an issue, but sometimes I compile things and get normal temperatures under heavy load, like 80-90 °C on my CPU (group of Tctl, Tdie and Tccd1). I just ran mprime for a few minutes and my CPU jumped 10 over 100 °C though and I aborted the stress test.
2. RAM is fine, I did a full memtest86 and returned no errors, CPU temperature was fine while testing (min/max/avg = 50 °C/79 °C/64 °C).
3. Nevermind
4. I've got amd-ucode installed and added to systemd-boot.
5. Since my system doesn't freeze on shutdown or reboot, I guess that's not the issue. Also C-states work when in idle, which is not the case for me.
6. I've read somewhere that the issue could be connected to hardware acceleration, I'll try to reproduce this by stress testing the GPU.
Edit: I've ran Geeks3D GPUTest (FurMark OpenGL 2.1/3.0) for Windows through Lutris with Lutris 6.0 and Proton 5.13 for a minute each. GPU temperature was around 90 °C. This might be completely useless, but the issues always occurred when gaming. Maybe someone has a better idea for this.

Last edited by 6rJ27GKfu (2021-03-30 20:42:34)

Offline

#7 2021-03-31 04:18:57

liara
Member
Registered: 2013-04-10
Posts: 32

Re: Sudden reboots under load and kernel error mce: [Hardware Error]?

@ponyrider oops

Offline

#8 2021-03-31 05:59:53

orlfman
Member
Registered: 2007-11-20
Posts: 141

Re: Sudden reboots under load and kernel error mce: [Hardware Error]?

6rJ27GKfu wrote:

1. Overheating could be an issue, but sometimes I compile things and get normal temperatures under heavy load, like 80-90 °C on my CPU (group of Tctl, Tdie and Tccd1). I just ran mprime for a few minutes and my CPU jumped 10 over 100 °C though and I aborted the stress test.
2. RAM is fine, I did a full memtest86 and returned no errors, CPU temperature was fine while testing (min/max/avg = 50 °C/79 °C/64 °C).
3. Nevermind
4. I've got amd-ucode installed and added to systemd-boot.
5. Since my system doesn't freeze on shutdown or reboot, I guess that's not the issue. Also C-states work when in idle, which is not the case for me.
6. I've read somewhere that the issue could be connected to hardware acceleration, I'll try to reproduce this by stress testing the GPU.
Edit: I've ran Geeks3D GPUTest (FurMark OpenGL 2.1/3.0) for Windows through Lutris with Lutris 6.0 and Proton 5.13 for a minute each. GPU temperature was around 90 °C. This might be completely useless, but the issues always occurred when gaming. Maybe someone has a better idea for this.

1.
usually overheating won't actually shutdown. amd ryzen, like intel processors are designed to thermally throttle. worst case is a thermal shutdown but shouldn't cause a mce. unless thermal throttling is disabled along with thermal shutdown.
2.
even if its "passes" memtest, ram can still be a problem. its recommended to test with many different memory testing programs along with certain software that's memory intensive when overclocking. even though you are not overclocking the recommendation is still recommend because memory works in mysterious ways when it comes to instability. compared to intel, amd with their ryzen generation, historically very sensitive to memory and instability because of it. unfortunately linux doesn't have most of the more popular memory testers and general system stability monitoring and testing software. you would have to use windows if you really want to test memory.
ultimately it would be easier to set jedec standard for linux testing. i would try setting your ram to jedec standard (ie, disable xmp) and run your system for a bit in that configuration. if it fixes it, then its your memory / xmp profile. if not, it could be something else.

if it is memory / xmp profile then it could actually be your ryzens memory controller. as ryzen historically has had a silicon quality to its memory controller. followed by its silicon quality of its infinity fabric being able to run at those speeds.
4.
ucode probably won't do much if you are running the latest bios. since agesa updates from amd ship the latest microcode, its only worth doing ucode in linux if you don't update your bios.
5.
c-states historically on amd ryzen have been a problem due to power supplies not providing adequate voltage for the low sleep states. notably c6. i know there is an option in the bios about voltage handling. something called "low current idle" and "typical current idle." low current idle for those with newer spec power supplies that are certified for extreme low voltage idle states and typical for power supplies that can't handle the low voltage idle.
but if you are not getting issues at idle then c-states is most likely not the problem you are facing.
6.
since the gpu is running off the internal pci-express lanes in the ryzen cpu's i/o die, if the gpu has a hardware fault, it can cause a cpu fault. a MCE fault. 90c on the gpu is fairly high. you could try taking the gpu apart and apply fresh thermal paste, new appropriate thickness sized pads, and a good dusting. another problem is your power supply not providing either enough capacity for your system. but seeing you have a nvidia 970, its not typically known as a power hog with its 150 watt tdp rating.

personally, i'm leaning towards it being caused either by the gpu or memory. since its a 970 its old. probably needs new thermal paste and pads. and memory has always been a problem on ryzen. i would try jedec standard for a bit followed by a nice deep clean of the gpu.

Last edited by orlfman (2021-03-31 06:13:49)

Offline

#9 2021-03-31 11:35:06

6rJ27GKfu
Member
Registered: 2020-06-17
Posts: 13

Re: Sudden reboots under load and kernel error mce: [Hardware Error]?

2. I really don't want to do a thoroughly memory test, but it could maybe be XMP. I'll disable it and give it another go.
4. BIOS is up to date.
6. My primary GPU is a RX 5700, the GTX 970 is for my VM and blacklisted.
Yeah, maybe dusting my PC off will help..

EDIT:
I really don't get when and why the reboots happen! I've been playing GTA Online four hours over the past days without any issues, but today the reboot happened twice within two hours. Here are the error logs:

Apr 03 14:52:36 desktop kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: bea0000000000108
Apr 03 14:52:36 desktop kernel: mce: [Hardware Error]: TSC 0 ADDR 6a3533f0 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Apr 03 14:52:36 desktop kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1617454345 SOCKET 0 APIC 2 microcode 8701021
Apr 03 14:52:36 desktop kernel: mce: [Hardware Error]: CPU 5: Machine Check: 0 Bank 5: bea0000000000108
Apr 03 14:52:36 desktop kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff8d8d4320 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Apr 03 14:52:36 desktop kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1617454345 SOCKET 0 APIC c microcode 8701021
Apr 03 15:28:48 desktop kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: bea0000000000108
Apr 03 15:28:48 desktop kernel: mce: [Hardware Error]: TSC 0 ADDR 6a3bf118 MISC d012000200000000 SYND 4d000000 IPID 500b000000000 
Apr 03 15:28:48 desktop kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1617456509 SOCKET 0 APIC 2 microcode 8701021

EDIT #2:
I don't get how to use rasdaemon, btw. It only returns the following:

$ ras-mc-ctl --error-count
ras-mc-ctl: Error: No DIMMs found in /sys or new sysfs EDAC interface not found.

$ ras-mc-ctl --summary
No Memory errors.
No PCIe AER errors.
No Extlog errors.
No devlink errors.
Disk errors summary:
        0:0 has 6 errors
        0:2112 has 1 errors
        0:66304 has 1307 errors
No MCE errors.

Last edited by 6rJ27GKfu (2021-04-03 17:00:55)

Offline

#10 2021-06-27 05:37:35

maaxx100
Member
Registered: 2019-03-21
Posts: 1

Re: Sudden reboots under load and kernel error mce: [Hardware Error]?

I don't know if you have had the same crashes lately, but I previously experienced them when I had XMP enabled. So, after disabling XMP I did not experience any crashes. However, recently, after enabling SVM, my computer has crashed while just navigating. These crashes have been very frequent, few times per day, yet only started after some time of enabling SVM. At the moment, I have also disabled SVM and will update of how things come up.

Update:
It crashed again after a couple of days of running normally. :\

Last edited by maaxx100 (2021-07-02 06:27:00)

Offline

#11 2021-12-25 12:54:52

zesko
Member
Registered: 2021-08-22
Posts: 10

Re: Sudden reboots under load and kernel error mce: [Hardware Error]?

I have the same issue

journalctl --no-pager -p 3 -b -1
Dez 25 13:36:07 zesko kernel: mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 5: bea0000000000108
Dez 25 13:36:07 zesko kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffc0f15742 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Dez 25 13:36:07 zesko kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1640439365 SOCKET 0 APIC a microcode 8701021
Dez 25 13:36:07 zesko kernel: mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 5: bea0000000000108
Dez 25 13:36:07 zesko kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffa46e7fd0 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Dez 25 13:36:07 zesko kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1640439365 SOCKET 0 APIC 5 microcode 8701021

I think It is related to AMD GPU RX 5700 that forced the PC to reboot after watching YouTube in the browser Vivaldi with hardware acceleration, but the error log shows that CPU is the problem, not GPU?

Last edited by zesko (2021-12-25 12:55:31)

Offline

#12 2022-01-05 17:34:35

zesko
Member
Registered: 2021-08-22
Posts: 10

Re: Sudden reboots under load and kernel error mce: [Hardware Error]?

https://gitlab.freedesktop.org/drm/amd/-/issues/1481

People said that upgrade RX 6000 series resolves this problem.

Offline

Board footer

Powered by FluxBB