Hardware Error: Machine check events

ought · 2023-07-23 09:04:14

Hello. A few days ago, after updating packages, I started to get freezes and sometimes complete hanging of my PC when I perform CPU-intensive jobs such as playing videogames or compiling some applications.
The journalctl throws these errors:

kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 0: d400004000040150
kernel: mce: [Hardware Error]: TSC 3d7fc4204c4 ADDR 1ffff9121d656 
kernel: mce: [Hardware Error]: PROCESSOR 0:a0655 TIME 1689933476 SOCKET 0 APIC 6 kernel: BUG: unable to handle page fault for address: ffffffffef6e6576
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 227025067 P4D 227025067 PUD 227027067 PMD 0 
kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
kernel: CPU: 11 PID: 52085 Comm: cc1 Tainted: G           OE      6.4.4-arch1-1 #1 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C75/Z490-A PRO (MS-7C75), kernel: RIP: 0010:do_user_addr_fault+0x78/0x640
kernel: Code: ff 7f 0f 84 09 02 00 00 65 48 8b 05 0a b9 18 6f 48 85 c0 74 15 be 0e 00 kernel: RSP: 0000:ffffb9a960f5fef0 EFLAGS: 00010006
kernel: RAX: 0000000000000004 RBX: 0000000000000006 RCX: 0000000000000000
kernel: RDX: 00007f87865f8010 RSI: 0000000000000006 RDI: ffffb9a960f5ff58
kernel: RBP: ffffb9a960f5ff58 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00007f87865f8010
kernel: R13: 0000000000000006 R14: ffff952844af1b80 R15: 0000000000000000
kernel: FS:  00007f878c0bd840(0000) GS:ffff952f7dcc0000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: ffffffffef6e6576 CR3: 0000000349f0a005 CR4: 00000000007706e0
kernel: PKRU: 55555554
kernel: Call Trace:
kernel:  <TASK>
kernel:  ? __die+0x23/0x70
kernel:  ? page_fault_oops+0x171/0x4e0
kernel:  ? exc_page_fault+0x175/0x180
kernel:  ? asm_exc_page_fault+0x26/0x30
kernel:  ? do_user_addr_fault+0x78/0x640
kernel:  exc_page_fault+0x7f/0x180
kernel:  asm_exc_page_fault+0x26/0x30
kernel: RIP: 0033:0x107e030
kernel: Code: 28 f5 ff 49 89 c0 48 83 fb 08 73 1e f6 c3 04 75 39 48 85 db 74 2d c6 00 kernel: RSP: 002b:00007ffd15edac50 EFLAGS: 00010202
kernel: RAX: 00007f87865f8000 RBX: 0000000000000018 RCX: 000000000000000c
kernel: RDX: 0000000000000006 RSI: 0000000000004000 RDI: 0000000000000001
kernel: RBP: 00007f8786b4eab0 R08: 00007f87865f8000 R09: 000000000418a350
kernel: R10: 00000000040678f0 R11: 0000000000000001 R12: 0000000000000000
kernel: R13: 00007f878977ec00 R14: 00000000040e5f58 R15: 0000000000000025
kernel:  </TASK>
kernel: Modules linked in: veth tun snd_seq_dummy snd_hrtimer snd_seq rfcomm kernel:  snd_compress snd_seq_device ac97_bus ledtrig_audio joydev snd_hda_codec_hdmi kernel:  crc32c_generic crc16 mbcache jbd2 usbhid amdgpu i2c_algo_bit drm_ttm_helper kernel: CR2: ffffffffef6e6576
kernel: ---[ end trace 0000000000000000 ]---
kernel: RIP: 0010:do_user_addr_fault+0x78/0x640
kernel: Code: ff 7f 0f 84 09 02 00 00 65 48 8b 05 0a b9 18 6f 48 85 c0 74 15 be 0e 00 kernel: RSP: 0000:ffffb9a960f5fef0 EFLAGS: 00010006
kernel: RAX: 0000000000000004 RBX: 0000000000000006 RCX: 0000000000000000
kernel: RDX: 00007f87865f8010 RSI: 0000000000000006 RDI: ffffb9a960f5ff58
kernel: RBP: ffffb9a960f5ff58 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00007f87865f8010
kernel: R13: 0000000000000006 R14: ffff952844af1b80 R15: 0000000000000000
kernel: FS:  00007f878c0bd840(0000) GS:ffff952f7dcc0000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: ffffffffef6e6576 CR3: 0000000349f0a005 CR4: 00000000007706e0
kernel: PKRU: 55555554
kernel: note: cc1[52085] exited with irqs disabled

If I receive such an error, but Linux doesn't freeze, then on shutdown I get endless spam of these lines, and the PC doesn't turn off:

[ ... ] tasks_rcu_exit_srcu_stall: rcu_tasks grace period number 17 (since boot) gp_state: RTGS_POST_SCAN_TASKLIST is ... jiffles old.
[ ... ] Please check any exiting tasks stuck between calls to exit_tasks_rcu_start() and exit_tasks_rcu_finish()
[ ... ] systemd-journald[341]: Failed to send WATCHDOG=1 notification message: Connection refused

I tried to rebuild RAM images with mkinitcpio, it did not help. I (for some reason) thought that maybe the switching of kernel from (default) linux to linux-zen could help - nope, it did not work to.
Can somebody give me an idea, how can I solve this?

Last edited by ought (2023-07-23 09:11:15)

Lone_Wolf · 2023-07-23 10:52:20

https://wiki.archlinux.org/title/Machin … _exception

Install ras-daemon from AUR to get more info about the error.

Running a memtest on bank 0 would also be a good idea , memtest86+ is included in the archlinux install iso .
That test takes time, you may want to run it with only memory bank 0 present.

ought · 2023-07-24 04:55:41

Thank you for your reply!
The memtest did not give me any errors. I tried to switch the plugs for the RAM - for the chance that the one I'm using is broken - but it didn't help.
The ras-daemon does not work for me:

ras-mc-ctl --error-count        
ras-mc-ctl: Error: No DIMMs found in /sys or new sysfs EDAC interface not found.
ras-mc-ctl --status      
ras-mc-ctl: drivers not loaded.

How do I force it to work?

P.S. I'm gonna try to use stressapptest instead of memtest in order to check if the problem is really with the RAM, will post here when I'll finish it.

Last edited by ought (2023-07-24 04:56:03)

seth · 2023-07-24 06:59:55

Do you get this w/ the LTS kernel as well?
https://bbs.archlinux.org/viewtopic.php?id=287226

MCE might be red-herring, can you post a complete system journal covering the error(s)?

ought · 2023-07-25 13:54:58

Thank you for the suggestion!
The timeline for clarity - when I swapped around my RAM, the log entries were still shown while compiling, but there were no crashes or freezes, the only symptom was the log entries. Then I tried stressapptest - it did not generate any entries in the log.
And now, when I tried using linux-lts, I don't even get any entries in the log at all =-)

I'm not really sure what happened - after all, if the problem was with the kernel, then why moving RAM to other slots fixed the crash?.. So the problem is gone for now, but I'm not sure if it will stay that way in the future.

seth · 2023-07-25 15:05:34

Do any errors return w/ the regular kernel?
Otherwise I'd say re-seating the DIMMs has fixed the problem (either by the arrangement, notably if not all DIMMs were of the same production batch) or simply better contacts/fit/seating/something physical.

Arch Linux

#1 2023-07-23 09:04:14

Hardware Error: Machine check events

#2 2023-07-23 10:52:20

Re: Hardware Error: Machine check events

#3 2023-07-24 04:55:41

Re: Hardware Error: Machine check events

#4 2023-07-24 06:59:55

Re: Hardware Error: Machine check events

#5 2023-07-25 13:54:58

Re: Hardware Error: Machine check events

#6 2023-07-25 15:05:34

Re: Hardware Error: Machine check events

Board footer