You are not logged in.

#1 2020-03-24 15:58:03

Wampie
Member
Registered: 2020-03-24
Posts: 4

[SOLVED] System freeze and reboot (mce: [Hardware Error])

Hello everyone!

I am pretty new to Arch and got a wonderful guide from a friend of mine (Xaekai). He helped me set up my machine so it would be running correctly.

Now everything is done and all is up and running. Nothing seems to be going wrong, except...
I have encountered a problem when I try to play a game (CS:GO - on steam). The problem seems to have nothing to do with the game itself, just by playing the game it puts stress on the machine in load and makes it freeze and crash.
That's the reason I am posting this here (Kernel & Hardware) and not in the Multimedia and Games section.

So what exactly is happening:
Once I am in the game, everything goes well and then after a little while, my whole system freezes and after a little while it automatically reboots. After the reboot, everything starts without a problem again.

We took crash logs to see everything that went on before the reboot and apparently the problems already start rising at a very early stage. The stress on the machine from the game, just makes it too much and makes it freeze and reboot.
I have included an overview of my hardware and the crash logs at the end of this post.

How could this be fixed? Anyone has any clue?

Thanks!

Pastebin of hardware info
Link to the crash log

Last edited by Wampie (2020-03-29 16:23:08)

Offline

#2 2020-03-24 17:40:12

Ropid
Member
Registered: 2015-03-09
Posts: 722

Re: [SOLVED] System freeze and reboot (mce: [Hardware Error])

Are you overclocking? I had those MCE messages in the log when I was overclocking and didn't use enough voltage. The messages went away with reduced speed or higher voltage.

What's your CPU temperature while you are gaming? I remember in my overclocking adventures, it wasn't good to use settings that caused more than around 85°C CPU temperature. It was possible to have settings that normally ran stable but failed with bad cooling (like on hot summer days).

Another idea about temperatures: did you ever clean your PC? If you never opened it, it might be stuffed with dust and overheating. It might then run fine again after you clean it.

Offline

#3 2020-03-24 20:30:57

Wampie
Member
Registered: 2020-03-24
Posts: 4

Re: [SOLVED] System freeze and reboot (mce: [Hardware Error])

Thanks for the suggestion Ropid. Although I am not overclocking, so that shouldn't be it.

I recently opened up my PC (got an extra SSD) and did remove some dust that I saw. Not a very thorough cleaning though, but I think it was good enough.

Last time I checked my CPUs temperature it was around 40~50°C. Was not gaming at that moment though, so this might not be reliable in this case.

How could I check it whilst in a game?

Offline

#4 2020-03-24 21:42:21

xerxes_
Member
Registered: 2018-04-29
Posts: 255

Re: [SOLVED] System freeze and reboot (mce: [Hardware Error])

Offline

#5 2020-03-25 08:38:15

Wampie
Member
Registered: 2020-03-24
Posts: 4

Re: [SOLVED] System freeze and reboot (mce: [Hardware Error])

Installed, enabled and started those. I rebooted my system and made it crash again.
Here are the logs from reboot till crash: Logs



Later today I was compiling something and my system crashed again. The crash log: crash log

Last edited by Wampie (2020-03-25 15:03:52)

Offline

#6 2020-03-25 15:17:27

Xaekai
Member
From: Portland
Registered: 2014-12-26
Posts: 9
Website

Re: [SOLVED] System freeze and reboot (mce: [Hardware Error])

I did some digging on those MCEs, I believe they are just because his CPU is vulnerable to Spectre and nothing else.

I'm curious about this:

Mar 25 09:30:30 aerie kernel: ------------[ cut here ]------------
Mar 25 09:30:30 aerie kernel: refcount_t: underflow; use-after-free.
Mar 25 09:30:30 aerie kernel: WARNING: CPU: 1 PID: 534 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
Mar 25 09:30:30 aerie kernel: Modules linked in: snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device joydev input_leds intel_rapl_msr mc mousedev intel_rapl_common x86_pkg_temp_thermal intel_powerclamp core>
Mar 25 09:30:30 aerie kernel:  nvidia_uvm(OE) nvidia_modeset(POE) nvidia(POE) ipmi_devintf ipmi_msghandler
Mar 25 09:30:30 aerie kernel: CPU: 1 PID: 534 Comm: Xorg.wrap Tainted: P           OE     5.5.11-zen1-1-zen #1
Mar 25 09:30:30 aerie kernel: Hardware name: ASUS All Series/Z97-A, BIOS 2801 11/11/2015
Mar 25 09:30:30 aerie kernel: RIP: 0010:refcount_warn_saturate+0xa6/0xf0
Mar 25 09:30:30 aerie kernel: Code: 05 b9 52 1d 01 01 e8 70 e2 b5 ff 0f 0b c3 80 3d a7 52 1d 01 00 75 95 48 c7 c7 c8 88 b3 8e c6 05 97 52 1d 01 01 e8 51 e2 b5 ff <0f> 0b c3 80 3d 86 52 1d 01 00 0f 85 72 ff ff ff>
Mar 25 09:30:30 aerie kernel: RSP: 0018:ffffbc09426dbda0 EFLAGS: 00010286
Mar 25 09:30:30 aerie kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Mar 25 09:30:30 aerie kernel: RDX: 0000000000000001 RSI: 0000000000000092 RDI: 00000000ffffffff
Mar 25 09:30:30 aerie kernel: RBP: ffff9eda31e2c4e8 R08: 00000000000003c2 R09: 0000000000000001
Mar 25 09:30:30 aerie kernel: R10: 0000000000000001 R11: 000000000000bc8c R12: ffff9eda355a2ae8
Mar 25 09:30:30 aerie kernel: R13: ffff9eda355a2800 R14: 0000000000000008 R15: 0000000000000000
Mar 25 09:30:30 aerie kernel: FS:  00007f8be051c540(0000) GS:ffff9eda36c40000(0000) knlGS:0000000000000000
Mar 25 09:30:30 aerie kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 25 09:30:30 aerie kernel: CR2: 00007f8be04e3ba0 CR3: 0000000230820006 CR4: 00000000001606e0
Mar 25 09:30:30 aerie kernel: Call Trace:
Mar 25 09:30:30 aerie kernel:  nv_drm_atomic_helper_disable_all+0xec/0x290 [nvidia_drm]
Mar 25 09:30:30 aerie kernel:  nv_drm_master_drop+0x22/0x60 [nvidia_drm]
Mar 25 09:30:30 aerie kernel:  drm_master_release+0xc6/0xe0 [drm]
Mar 25 09:30:30 aerie kernel:  drm_file_free.part.0+0x1fe/0x260 [drm]
Mar 25 09:30:30 aerie kernel:  drm_release+0x9a/0x110 [drm]
Mar 25 09:30:30 aerie kernel:  __fput+0xb3/0x230
Mar 25 09:30:30 aerie kernel:  task_work_run+0x93/0xb0
Mar 25 09:30:30 aerie kernel:  exit_to_usermode_loop+0xf6/0x120
Mar 25 09:30:30 aerie kernel:  do_syscall_64+0x11f/0x150
Mar 25 09:30:30 aerie kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Mar 25 09:30:30 aerie kernel: RIP: 0033:0x7f8be0445d07
Mar 25 09:30:30 aerie kernel: Code: ff ff e8 3c e3 01 00 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24>
Mar 25 09:30:30 aerie kernel: RSP: 002b:00007ffddf4ca9d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
Mar 25 09:30:30 aerie kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 00007f8be0445d07
Mar 25 09:30:30 aerie kernel: RDX: 00007ffddf4ca9f0 RSI: 00000000c04064a0 RDI: 0000000000000003
Mar 25 09:30:30 aerie kernel: RBP: 00007ffddf4caa40 R08: 0000000000000000 R09: 00007ffddf4ca860
Mar 25 09:30:30 aerie kernel: R10: 000055a12334a64b R11: 0000000000000246 R12: 0000000000000003
Mar 25 09:30:30 aerie kernel: R13: 0000000000000001 R14: 0000000000000000 R15: 00007ffddf4ca9f0
Mar 25 09:30:30 aerie kernel: ---[ end trace dc40da3557904cfa ]---

raw paste here

It shows up on every boot as well, around the same time of 1 second in after the initial kernel spam. But since his last crash was during a compile I'm doubting the involvement of an nVidia use-after-free driver bug, which is what most of my web searches investigating it have lead to.

Offline

#7 2020-03-25 18:20:41

xerxes_
Member
Registered: 2018-04-29
Posts: 255

Re: [SOLVED] System freeze and reboot (mce: [Hardware Error])

Post output of 'lspci -nnk' and Xorg.log if you use X server.

Offline

#8 2020-03-27 13:36:11

Wampie
Member
Registered: 2020-03-24
Posts: 4

Re: [SOLVED] System freeze and reboot (mce: [Hardware Error])

xerxes_ wrote:

Post output of 'lspci -nnk' and Xorg.log if you use X server.

Got a new HSF and applied new thermal paste. CPU temperature is way down now and while I was in a game, my system didn't crash/reboot. I also stopped getting the hardware errors, so I think it's fixed now (not sure).

Thanks for the help!

Last edited by Wampie (2020-03-27 13:59:21)

Offline

#9 2020-03-27 14:07:00

Xaekai
Member
From: Portland
Registered: 2014-12-26
Posts: 9
Website

Re: [SOLVED] System freeze and reboot (mce: [Hardware Error])

I believe we have determined his problem to be overheating related. After sharing with me a photo showing his CPU temps at cold boot being ~51C, I advised him to disassemble the machine and examine levels of dust build up, clean it, and examine HSF mounting, examine thermal compound, and to probably replace thermal compound. He stated the compound was dry, cracked, and flaky. He then showed photos of the situation inside his machine and I took note of the demure nature of his HSF, the stock Intel job that came with the CPU. Most Intel models look alike and some are better than others. Searching for it's particular model number online lead me to numerous complaints of overheating.

Upon my advice, OP has ordered a Noctua HSF to replace this underperforming cooling situation. With it's arrival today and installation the usual activity that was crashing his machine before no longer does.

I am relieved because this is OP's first time on Linux, and it was on my advice he chose to try it. Had it been some software package causing the instabilities it would have been rather embarassing.

Last edited by Xaekai (2020-03-27 14:12:09)

Offline

#10 2020-03-27 15:24:52

xerxes_
Member
Registered: 2018-04-29
Posts: 255

Re: [SOLVED] System freeze and reboot (mce: [Hardware Error])

Didn't think about overheating problems because 40~50°C for CPU looked to me as not so high (even in idle). Glad to hear that it resolved problem. Please mark this thread as solved by editing title of first post.

Last edited by xerxes_ (2020-03-27 15:25:55)

Offline

Board footer

Powered by FluxBB