You are not logged in.

#1 2021-10-07 18:55:27

Jeff_WuYo
Member
Registered: 2020-12-14
Posts: 31

Occasionally system hard crash with new cpu

I recently upgraded my cpu from i5-10400 (6 core) to i7-10700 (8 core), and I encountered occasional system hard crash, doesn't have any respond. It has been 4 days, 6 boots, 4 crashes since new cpu installed. It happened when I was watching youtube live stream, the timing is inconsistent. Another time happened when compiling code from aur. I have 650w power supply, no discrete gpu. My new system can pass CineBench23 (a cpu torture benchmark) on windows for 30 min, under 95 °C, drawing about 130 walt.
Here's my system journal and system info.
It captured some hardware error, can someone help to diagnosis? Sometimes, it hard crashed, and didn't even being captured.

Last edited by Jeff_WuYo (2021-10-07 19:38:22)

Offline

#2 2021-10-09 13:46:37

Lone_Wolf
Member
From: Netherlands, Europe
Registered: 2005-10-04
Posts: 9,515

Re: Occasionally system hard crash with new cpu

Machine Check Exception (MCE) errors can be very tricky.

Since you changed hardware I'd start with verifying everything is seated correctly.
In case that doesn't help check https://wiki.archlinux.org/title/Machin … _exception and install rasdaemon .

Hopefully that will give more info about future crashes.


Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
Did you use the guided installer ? If yes, I can't help you.

(A works at time B)  && (time C > time B ) ≠  (A works at time C)

Offline

#3 2021-10-09 16:21:48

Jeff_WuYo
Member
Registered: 2020-12-14
Posts: 31

Re: Occasionally system hard crash with new cpu

Lone_Wolf wrote:

I'd start with verifying everything is seated correctly.

Yes, I double check every connection, they are all installed properly.
About rasdaemon, I haven't checked yet, I'll take times to test it.

While doing some searching, I found this thread, which has similar issue as mine. The author changed mobo and cpu, and the issue is resolved. I assume it's because the quality of the hardware in this case.

As you said, MCE error can be very tricky, and it does. The MCE error doesn't happened consistently, not on every boot. Here's an output of current boot, power on for over 12 hours. I kept it running youtube live stream for testing. Fairly light workload. Although it has an error, my system didn't crash, still function normally.

journalctl -b -p 4
-- Journal begins at Sun 2021-04-25 15:05:28 CST, ends at Sun 2021-10-10 00:00:02 CST. --
Oct 09 08:37:51 archz490-i kernel: x86/cpu: SGX disabled by BIOS.
Oct 09 08:37:51 archz490-i kernel: ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
Oct 09 08:37:51 archz490-i kernel: nvme nvme1: missing or invalid SUBNQN field.
Oct 09 08:37:51 archz490-i kernel: nvme nvme0: missing or invalid SUBNQN field.
Oct 09 08:37:51 archz490-i kernel: usb: port power management may be unreliable
Oct 09 08:37:51 archz490-i systemd-udevd[346]: /etc/udev/rules.d/50-qmk.rules:65 Unknown group 'plugdev', ignoring
Oct 09 08:37:51 archz490-i kernel: acpi PNP0C14:01: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
Oct 09 08:37:51 archz490-i kernel: acpi PNP0C14:02: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
Oct 09 08:37:51 archz490-i kernel: acpi PNP0C14:03: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
Oct 09 08:37:51 archz490-i kernel: acpi PNP0C14:04: duplicate WMI GUID 05901221-D566-11D1-B2F0-00A0C9062910 (first instance was on PNP0C14:00)
Oct 09 08:37:51 archz490-i kernel: usb 1-2: config 1 has an invalid interface number: 2 but max is 1
Oct 09 08:37:51 archz490-i kernel: usb 1-2: config 1 has no interface number 1
Oct 09 08:37:51 archz490-i kernel: platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
Oct 09 08:37:51 archz490-i kernel: iwlwifi 0000:00:14.3: Direct firmware load for iwlwifi-QuZ-a0-hr-b0-64.ucode failed with error -2
Oct 09 08:37:51 archz490-i kernel: iwlwifi 0000:00:14.3: api flags index 2 larger than supported by driver
Oct 09 08:37:51 archz490-i kernel: ACPI Warning: SystemIO range 0x0000000000000295-0x0000000000000296 conflicts with OpRegion 0x0000000000000290-0x0000000000000299 (\AMW0.SHWM) (20210604/utaddress-204)
Oct 09 08:37:51 archz490-i kernel: snd_hda_codec_hdmi hdaudioC0D2: Monitor plugged-in, Failed to power up codec ret=[-13]
Oct 09 08:37:51 archz490-i kernel: snd_hda_codec_hdmi hdaudioC0D2: Monitor plugged-in, Failed to power up codec ret=[-13]
Oct 09 08:37:51 archz490-i kernel: thermal thermal_zone2: failed to read out thermal zone (-61)
Oct 09 08:37:51 archz490-i kernel: ------------[ cut here ]------------
Oct 09 08:37:51 archz490-i kernel: WARNING: CPU: 14 PID: 478 at net/wireless/nl80211.c:7771 nl80211_get_reg_do+0x23c/0x2b0 [cfg80211]
Oct 09 08:37:51 archz490-i kernel: Modules linked in: nf_tables nfnetlink bridge stp llc ccm snd_sof_pci_intel_cnl algif_aead snd_sof_intel_hda_common cbc iwlmvm soundwire_intel soundwire_generic_allocation des_generic soundwire_cadence libdes snd_sof_intel_hda ecb snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_hda_codec_hdmi algif_skcipher soundwire_bus intel_tcc_cooling x86_pkg_temp_thermal snd_soc_skl cmac snd_hda_codec_realtek intel_powerclamp mac80211 snd_hda_codec_generic md4 snd_soc_hdac_hda eeepc_wmi iTCO_wdt ledtrig_audio algif_hash asus_wmi snd_hda_ext_core intel_pmc_bxt kvm_intel snd_soc_sst_ipc nct6775 snd_soc_sst_dsp hwmon_vid snd_soc_acpi_intel_match snd_soc_acpi af_alg mei_hdcp coretemp iTCO_vendor_support ee1004 sparse_keymap wmi_bmof libarc4 snd_soc_core kvm snd_compress ac97_bus iwlwifi snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg irqbypass crct10dif_pclmul snd_intel_sdw_acpi crc32_pclmul ghash_clmulni_intel snd_hda_codec aesni_intel crypto_simd vfat cryptd fat rapl snd_hda_core cfg80211
Oct 09 08:37:51 archz490-i kernel:  intel_cstate intel_uncore snd_hwdep snd_pcm intel_spi_pci intel_spi snd_timer spi_nor mei_me pcspkr snd i2c_i801 mtd igc soundcore i2c_smbus mei rfkill wmi mac_hid acpi_tad acpi_pad pkcs8_key_parser crypto_user fuse bpf_preload ip_tables x_tables xhci_pci xhci_pci_renesas i915 i2c_algo_bit intel_gtt ttm agpgart video drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm btrfs blake2b_generic libcrc32c crc32c_generic crc32c_intel xor raid6_pq
Oct 09 08:37:51 archz490-i kernel: CPU: 14 PID: 478 Comm: iwd Tainted: G     U            5.14.9-arch2-1 #1 3d250f0857a0255dbbcb433ce1895c81c4740764
Oct 09 08:37:51 archz490-i kernel: Hardware name: ASUS System Product Name/ROG STRIX Z490-I GAMING, BIOS 2301 07/13/2021
Oct 09 08:37:51 archz490-i kernel: RIP: 0010:nl80211_get_reg_do+0x23c/0x2b0 [cfg80211]
Oct 09 08:37:51 archz490-i kernel: Code: 0c 01 00 00 00 e8 f4 1a 87 db 85 c0 74 cc e9 f5 fe ff ff 48 89 ef 48 89 04 24 e8 1f f5 ba db e8 7a ae bd db 48 8b 04 24 eb 84 <0f> 0b 48 89 ef e8 0a f5 ba db e8 65 ae bd db b8 ea ff ff ff e9 6b
Oct 09 08:37:51 archz490-i kernel: RSP: 0018:ffffbbdcc160fb08 EFLAGS: 00010202
Oct 09 08:37:51 archz490-i kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
Oct 09 08:37:51 archz490-i kernel: RDX: ffff908b9e810008 RSI: 0000000000000000 RDI: ffff908b9e8102e0
Oct 09 08:37:51 archz490-i kernel: RBP: ffff908b8f5ed000 R08: 0000000000000004 R09: ffff908b8f406014
Oct 09 08:37:51 archz490-i kernel: R10: 0000000000000021 R11: ffff908b84fd0000 R12: ffffbbdcc160fb68
Oct 09 08:37:51 archz490-i kernel: R13: ffff908b8f406014 R14: 0000000000000000 R15: ffff908b9e8102e0
Oct 09 08:37:51 archz490-i kernel: FS:  00007f67b5e89740(0000) GS:ffff9092cf580000(0000) knlGS:0000000000000000
Oct 09 08:37:51 archz490-i kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 09 08:37:51 archz490-i kernel: CR2: 00007fffe9f43e70 CR3: 000000011dd1a003 CR4: 00000000007706e0
Oct 09 08:37:51 archz490-i kernel: PKRU: 55555554
Oct 09 08:37:51 archz490-i kernel: Call Trace:
Oct 09 08:37:51 archz490-i kernel:  genl_family_rcv_msg_doit+0xfa/0x160
Oct 09 08:37:51 archz490-i kernel:  genl_rcv_msg+0xeb/0x1e0
Oct 09 08:37:51 archz490-i kernel:  ? __cfg80211_rdev_from_attrs+0x1d0/0x1d0 [cfg80211 d5bff8db2aed3ed94635f64964734d1d042b36ae]
Oct 09 08:37:51 archz490-i kernel:  ? nl80211_send_regdom.constprop.0+0x1b0/0x1b0 [cfg80211 d5bff8db2aed3ed94635f64964734d1d042b36ae]
Oct 09 08:37:51 archz490-i kernel:  ? genl_get_cmd+0xd0/0xd0
Oct 09 08:37:51 archz490-i kernel:  netlink_rcv_skb+0x59/0x100
Oct 09 08:37:51 archz490-i kernel:  genl_rcv+0x24/0x40
Oct 09 08:37:51 archz490-i kernel:  netlink_unicast+0x23b/0x350
Oct 09 08:37:51 archz490-i kernel:  netlink_sendmsg+0x23b/0x480
Oct 09 08:37:51 archz490-i kernel:  sock_sendmsg+0x5b/0x60
Oct 09 08:37:51 archz490-i kernel:  __sys_sendto+0x124/0x190
Oct 09 08:37:51 archz490-i kernel:  __x64_sys_sendto+0x20/0x30
Oct 09 08:37:51 archz490-i kernel:  do_syscall_64+0x59/0x80
Oct 09 08:37:51 archz490-i kernel:  ? syscall_exit_to_user_mode+0x23/0x40
Oct 09 08:37:51 archz490-i kernel:  ? do_syscall_64+0x69/0x80
Oct 09 08:37:51 archz490-i kernel:  ? do_syscall_64+0x69/0x80
Oct 09 08:37:51 archz490-i kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Oct 09 08:37:51 archz490-i kernel: RIP: 0033:0x7f67b5f8bc10
Oct 09 08:37:51 archz490-i kernel: Code: c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 1d 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 68 c3 0f 1f 80 00 00 00 00 55 48 83 ec 20 48
Oct 09 08:37:51 archz490-i kernel: RSP: 002b:00007fffe9f48b48 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
Oct 09 08:37:51 archz490-i kernel: RAX: ffffffffffffffda RBX: 000055e9ba6d8870 RCX: 00007f67b5f8bc10
Oct 09 08:37:51 archz490-i kernel: RDX: 000000000000001c RSI: 000055e9ba6e44b0 RDI: 0000000000000004
Oct 09 08:37:51 archz490-i kernel: RBP: 000055e9ba6eda70 R08: 0000000000000000 R09: 0000000000000000
Oct 09 08:37:51 archz490-i kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007fffe9f48bb0
Oct 09 08:37:51 archz490-i kernel: R13: 00007fffe9f48bac R14: 000055e9ba6e2500 R15: 000055e9b915c3fd
Oct 09 08:37:51 archz490-i kernel: ---[ end trace 324483557dbba1ae ]---
Oct 09 08:37:52 archz490-i libvirtd[562]: Cannot find 'dmidecode' in path: No such file or directory
Oct 09 08:37:52 archz490-i libvirtd[562]: internal error: Missing udev property 'ID_VENDOR_ID' on 'usb1'
Oct 09 08:37:52 archz490-i libvirtd[562]: internal error: Missing udev property 'ID_VENDOR_ID' on '1-11'
Oct 09 08:37:52 archz490-i libvirtd[562]: internal error: Missing udev property 'ID_VENDOR_ID' on '1-12'
Oct 09 08:37:52 archz490-i libvirtd[562]: internal error: Missing udev property 'ID_VENDOR_ID' on '1-13'
Oct 09 08:37:52 archz490-i libvirtd[562]: internal error: Missing udev property 'ID_VENDOR_ID' on '1-2'
Oct 09 08:37:52 archz490-i libvirtd[562]: internal error: Missing udev property 'ID_VENDOR_ID' on '1-9'
Oct 09 08:37:52 archz490-i libvirtd[562]: Cannot find 'dmidecode' in path: No such file or directory
Oct 09 08:37:52 archz490-i kernel: Bluetooth: hci0: MSFT filter_enable is already on
Oct 09 08:37:57 archz490-i kernel: kauditd_printk_skb: 177 callbacks suppressed
Oct 09 08:38:08 archz490-i kernel: kauditd_printk_skb: 11 callbacks suppressed
Oct 09 08:38:45 archz490-i pulseaudio[1075]: Failed to find a working profile.
Oct 09 08:38:45 archz490-i pulseaudio[1075]: Failed to load module "module-alsa-card" (argument: "device_id="1" name="usb-ErgoKB_Phoenix-v1-02" card_name="alsa_card.usb-ErgoKB_Phoenix-v1-02" namereg_fail=false tsched=yes fixed_latency_range=no ignore_dB=no deferred_volume=yes use_ucm=yes avoid_resampling=no card_properties="module-udev-detect.discovered=1""): initialization failed.
Oct 09 08:38:45 archz490-i pulseaudio[1075]: stat('/etc/pulse/default.pa.d'): No such file or directory
Oct 09 16:37:21 archz490-i kernel: i915 0000:00:02.0: [drm] *ERROR* Atomic update failure on pipe B (start=57047 end=57048) time 243 us, min 1073, max 1079, scanline start 1068, end 1085
Oct 09 17:37:14 archz490-i kernel: i915 0000:00:02.0: [drm] *ERROR* Atomic update failure on pipe A (start=272650 end=272651) time 169 us, min 1073, max 1079, scanline start 1069, end 1080
Oct 09 17:37:55 archz490-i kernel: i915 0000:00:02.0: [drm] *ERROR* Atomic update failure on pipe A (start=275078 end=275079) time 288 us, min 1073, max 1079, scanline start 1072, end 1092
Oct 09 17:37:59 archz490-i kernel: i915 0000:00:02.0: [drm] *ERROR* Atomic update failure on pipe A (start=275324 end=275325) time 277 us, min 1073, max 1079, scanline start 1062, end 1081
Oct 09 19:36:55 archz490-i kernel: i915 0000:00:02.0: [drm] *ERROR* Atomic update failure on pipe A (start=703510 end=703511) time 233 us, min 1073, max 1079, scanline start 1069, end 1085
Oct 09 20:27:45 archz490-i kernel: mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 0: 9400004000040150
Oct 09 20:27:45 archz490-i kernel: mce: [Hardware Error]: TSC 708b428aff82 ADDR 1ffff9ce01310 
Oct 09 20:27:45 archz490-i kernel: mce: [Hardware Error]: PROCESSOR 0:a0655 TIME 1633782465 SOCKET 0 APIC c microcode ec

Last edited by Jeff_WuYo (2021-10-09 16:23:25)

Offline

#4 2021-10-09 16:32:21

Ropid
Member
Registered: 2015-03-09
Posts: 1,066

Re: Occasionally system hard crash with new cpu

These following log entries here are those "machine check exception" = "MCE" that Lone_Wolf mentions:

Oct 07 17:58:08 archz490-i kernel: mce: [Hardware Error]: Machine check events logged
Oct 07 17:58:08 archz490-i kernel: mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 0: 9400004000040150
Oct 07 17:58:08 archz490-i kernel: mce: [Hardware Error]: TSC 1c2374d185c1 ADDR 1ffffabc9970e 
Oct 07 17:58:08 archz490-i kernel: mce: [Hardware Error]: PROCESSOR 0:a0655 TIME 1633600688 SOCKET 0 APIC c microcode ec

You have many of those kind of entries in your log.

On Windows, you can find the same kind of log entries in the "Event Viewer" tool if you look in the "Administrative Events" section. Windows people call those entries "WHEA errors/warnings" from  "Windows Hardware Error Architecture". Search for "WHEA" to find discussion about this in Windows PC forums.

This kind of MCE errors shouldn't happen by themselves unless you are overclocking. If you aren't overclocking, I'd double-check all cable connections and reinstall the CPU in its socket and the RAM in its slots and reset the BIOS to defaults. If that doesn't help, I'd guess something is wrong about the CPU or at least it doesn't seem to work with your other parts (board, RAM). I guess the thing to do then would be to return the CPU. If you are overclocking, you might be able to fix the issue by tweaking voltages. Maybe you can find discussion about what BIOS settings to tweak for WHEA warnings/errors in PC overclocking forums.

Offline

Board footer

Powered by FluxBB