You are not logged in.

#1 2024-07-07 09:15:42

kalachastan
Member
Registered: 2019-05-03
Posts: 4

[SOLVED] What logs should I gather to investigate system freezes?

Hi everyone!

I have been using Arch on my Lenovo Thinkpad T470p for many years without any major problems. However, for about 2 years now, I have been experiencing rare, random freezes where the entire system just hangs and I have to perform a hard reboot by pressing the power button long enough (see below for a more detailed description, as well as some log files). Recently, these freezes have become more frequent (1-2 times a week) and I would like to try to fix this problem.

Unfortunately, I cannot reproduce the problem deterministically, and since it can take up to a week for a freeze to occur, I think it would be best for me to collect some extensive logs/configs/diagnostics for multiple freezes and then come back with my results at some point in the future. So all I'm asking for in this post is some guidance on what log/config files I should collect before/after a freeze, and what diagnostic checks I could run to find potential problems based on my error description below. I will then create a new post in a few weeks with all the requested log and config files if I fail to fix the problem myself.

Thank you so much for your help!

Description of the issue

The random freezes happen predominantly in the following situations:

  1. After closing my laptop lid and then re-opening it at some later point (~60% of freezes). In this situation, the screen just stays black, and the system becomes completely unresponsible to any keyboard and/or mouse input.

  2. In situations with high memory utilization, i.e. more than 80% of the RAM being used (~20% of freezes).

  3. When trying to shut down the system using the command shutdown now (~20% of freezes).

Note that most of the time, the situations described above don't result in freezes. Also, situation 3 is especially interesting because, during this failure case, I'm already on tty1 and can see some error message (that looks like a kernel panic to me). However, when I later checked the system logs the message is not present there, so maybe the freeze also prevented the log files from being updated.

Some log files

Over the past few months, I gathered system logs using journalctl, after rebooting my laptop after a freeze. See this Google Drive folder. The content of the files is sorted from the newest log message to the oldest log message. You can find the reboot event by searching for the last occurrence of the string "-- Boot" (without quotation marks). The lines right above that line should correspond to logs right before/during the freeze.

Some things I tried

During a freeze, I tried to get any reaction of a system, for example by exiting my graphical user interface or directly switching to tty2. However, my laptop doesn't respond to any keyboard/mouse input at all.

I recently also ran a memory stress tester overnight (MemTest86+ to be precise), and it managed to finish 10 passes without any errors. But since the freezes only happen 1-2 times a week, could it be that these tests are too short to catch any issues?

Last edited by kalachastan (2024-07-08 08:18:59)

Offline

#2 2024-07-07 20:29:36

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 25,223

Re: [SOLVED] What logs should I gather to investigate system freezes?

Journalctl is basically your best bet, to get more telling logs on a freeze you might be able to get more context via the REISUB sysrq sequence: https://wiki.archlinux.org/title/Keyboa … el_(SysRq) as that might allow more data to get flushed.

Something I can preempt here, the nvidia driver as of 550x and above has somewhat known to be causing freezes when systemd updates due to it having a bug corrupting kernel memory. This "might" be fixed by the current 555 series but I haven't yet seen much conclusive. It's pretty much guaranteed to not happen on the 535 series of the nvidia driver which you can get from: https://aur.archlinux.org/packages/nvidia-535xx-dkms (make sure you have linux-headers installed)

If you opt for these you'll also have to disable systemd user session freezing to have working suspend/hibernate:  https://bbs.archlinux.org/viewtopic.php … 7#p2181817 -- see the last post here as well as the one I'm linking.

Offline

#3 2024-07-07 21:56:48

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,939

Re: [SOLVED] What logs should I gather to investigate system freezes?

Note that most of the time, the situations described above don't result in freezes. Also, situation 3 is especially interesting because, during this failure case, I'm already on tty1 and can see some error message (that looks like a kernel panic to me). However, when I later checked the system logs the message is not present there, so maybe the freeze also prevented the log files from being updated.

The panic might have happened after the journal synced and stopped or you rebooted w/ the power button what prevents the former - do you maybe have a picture of the panic?

See this Google Drive folder.

Because it's like top-posting.
Why not?
Please never use "-r" with journalctl.

May 04 16:11:24 lukas-pc kernel: CPU: 3 PID: 39437 Comm: kworker/3:1 Tainted: P           OE      6.8.4-arch1-1 #1 7ea0d8fced45b5f098eb034690645970f116c34c
May 04 16:11:24 lukas-pc kernel: Hardware name: LENOVO 20J7S01X00/20J7S01X00, BIOS R0FET34W (1.14 ) 06/30/2017
May 04 16:11:24 lukas-pc kernel: Workqueue: cgroup_destroy css_free_rwork_fn
May 04 16:11:24 lukas-pc kernel: RIP: 0010:rb_first+0xf/0x30
May 04 16:11:24 lukas-pc kernel: Code: 10 c3 cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 8b 07 48 85 c0 74 14 48 89 c2 <48> 8b 40 10 48 85 c0 75 f4 48 89 d0 c3 cc cc cc cc 31 d2 eb f4 66
May 04 16:11:24 lukas-pc kernel: RSP: 0018:ffff9f82e2937dd8 EFLAGS: 00010206
May 04 16:11:24 lukas-pc kernel: RAX: 0000000000001336 RBX: ffff8bd54bc89180 RCX: 0000000000400035
May 04 16:11:24 lukas-pc kernel: RDX: 0000000000001336 RSI: 0000000000000000 RDI: ffff8bd5440bf0d8
May 04 16:11:24 lukas-pc kernel: RBP: ffff8bd55803a080 R08: ffff8bd548ab1e80 R09: 0000000000400035
May 04 16:11:24 lukas-pc kernel: R10: 0000000000400035 R11: fefefefefefefeff R12: 0000000000000000
May 04 16:11:24 lukas-pc kernel: R13: ffff8bd5440bf0d8 R14: ffff8bd557127000 R15: ffff8bd557127090
May 04 16:11:24 lukas-pc kernel: FS:  0000000000000000(0000) GS:ffff8bd87f4c0000(0000) knlGS:0000000000000000
May 04 16:11:24 lukas-pc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 04 16:11:24 lukas-pc kernel: CR2: 0000000000001346 CR3: 0000000165020005 CR4: 00000000003706f0
May 04 16:11:24 lukas-pc kernel: Call Trace:
May 04 16:11:24 lukas-pc kernel:  <TASK>
May 04 16:11:24 lukas-pc kernel:  ? __die+0x23/0x70
May 04 16:11:24 lukas-pc kernel:  ? page_fault_oops+0x171/0x4e0
May 04 16:11:24 lukas-pc kernel:  ? __update_load_avg_cfs_rq+0x26c/0x2e0
May 04 16:11:24 lukas-pc kernel:  ? exc_page_fault+0x7f/0x180
May 04 16:11:24 lukas-pc kernel:  ? asm_exc_page_fault+0x26/0x30
May 04 16:11:24 lukas-pc kernel:  ? rb_first+0xf/0x30
May 04 16:11:24 lukas-pc kernel:  simple_xattrs_free+0x29/0x90
May 04 16:11:24 lukas-pc kernel:  kernfs_put.part.0+0x60/0x150
May 04 16:11:24 lukas-pc kernel:  css_free_rwork_fn+0x131/0x440
May 04 16:11:24 lukas-pc kernel:  process_one_work+0x178/0x350
May 04 16:11:24 lukas-pc kernel:  worker_thread+0x30f/0x450
May 04 16:11:24 lukas-pc kernel:  ? __pfx_worker_thread+0x10/0x10
May 04 16:11:24 lukas-pc kernel:  kthread+0xe5/0x120
May 04 16:11:24 lukas-pc kernel:  ? __pfx_kthread+0x10/0x10
May 04 16:11:24 lukas-pc kernel:  ret_from_fork+0x31/0x50
May 04 16:11:24 lukas-pc kernel:  ? __pfx_kthread+0x10/0x10
May 04 16:11:24 lukas-pc kernel:  ret_from_fork_asm+0x1b/0x30
May 04 16:11:24 lukas-pc kernel:  </TASK>
May 04 16:11:24 lukas-pc kernel: Modules linked in: hid_logitech_hidpp hid_logitech_dj uinput rfcomm uhid algif_hash algif_skcipher af_alg bnep hid_generic usbhid cmac ccm ip6table_filter ip6_tables iptable_filter snd_hda_codec_hdmi snd_ctl_led snd_hda_codec_realtek snd_hda_codec_generic tun vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) nvidia_uvm(POE) nvidia_drm(POE) nvidia_modeset(POE) intel_rapl_msr intel_rapl_common intel_uncore_frequency snd_soc_avs intel_uncore_frequency_common intel_pmc_core_pltdrv snd_soc_hda_codec intel_pmc_core snd_hda_ext_core intel_vsec pmt_telemetry snd_soc_core pmt_class iwlmvm intel_tcc_cooling snd_compress ac97_bus snd_pcm_dmaengine x86_pkg_temp_thermal joydev intel_powerclamp mousedev coretemp nvidia(POE) snd_hda_intel mac80211 kvm_intel snd_intel_dspcfg i915 snd_intel_sdw_acpi libarc4 kvm snd_hda_codec btusb uvcvideo snd_hda_core drm_buddy btrtl snd_hwdep i2c_algo_bit videobuf2_vmalloc irqbypass uvc iwlwifi btintel videobuf2_memops snd_pcm ttm thinkpad_acpi btbcm rapl videobuf2_v4l2 iTCO_wdt intel_pmc_bxt
May 04 16:11:24 lukas-pc kernel:  btmtk ledtrig_audio videodev wmi_bmof mei_pxp mei_hdcp ee1004 iTCO_vendor_support e1000e drm_display_helper think_lmi snd_timer intel_cstate intel_wmi_thunderbolt platform_profile bluetooth firmware_attributes_class cfg80211 cec snd videobuf2_common mei_me ptp i2c_i801 intel_gtt ecdh_generic mc intel_uncore psmouse pcspkr pps_core soundcore rfkill mei video i2c_smbus intel_pch_thermal wmi acpi_pad mac_hid crypto_user fuse nfnetlink ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 uas usb_storage dm_crypt cbc encrypted_keys trusted asn1_encoder tee dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 serio_raw rtsx_pci_sdmmc atkbd sha1_ssse3 libps2 mmc_core aesni_intel vivaldi_fmap nvme crypto_simd nvme_core cryptd xhci_pci rtsx_pci nvme_auth xhci_pci_renesas i8042 serio loop
May 04 16:11:24 lukas-pc kernel: CR2: 0000000000001346
May 04 16:11:24 lukas-pc kernel: ---[ end trace 0000000000000000 ]---
May 04 16:11:24 lukas-pc kernel: RIP: 0010:rb_first+0xf/0x30
May 04 16:11:24 lukas-pc kernel: Code: 10 c3 cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 8b 07 48 85 c0 74 14 48 89 c2 <48> 8b 40 10 48 85 c0 75 f4 48 89 d0 c3 cc cc cc cc 31 d2 eb f4 66

That fits

V1del wrote:

the nvidia driver as of 550x and above has somewhat known to be causing freezes

but not "for about 2 years"

As forthe wakeup "freezes", https://wiki.archlinux.org/title/NVIDIA … er_suspend but you've a hybrid graphics system and it doesn#t look like the nvidia chip is a CRTC.
https://wiki.archlinux.org/title/Solid_ … leshooting might be an issue and https://wiki.archlinux.org/title/VMware … _hugepages has shown up in the past as source of spurious instability.

Offline

#4 2024-07-08 08:15:58

kalachastan
Member
Registered: 2019-05-03
Posts: 4

Re: [SOLVED] What logs should I gather to investigate system freezes?

Thank you guys so much for your answers! Since the goal of this post was to get some more info about debugging techniques + some potential hints at the underlying issue, your answers fulfill this request and I will therefore mark the post as solved.

Some short replies to some of your comments:

seth wrote:

Because it's like top-posting.
Why not?
Please never use "-r" with journalctl.

I almost didn't get what you were saying here, which I guess reinforces the point you're making. big_smile I reversed the lines of each file, in case anyone wants to have a look at them in the future.

seth wrote:

The panic might have happened after the journal synced and stopped or you rebooted w/ the power button what prevents the former - do you maybe have a picture of the panic?

Yes, I did take a picture at some point. I added the file 'panic_upon_shutdown.jpg' in the Google Drive Folder. I wasn't able to get the top part of the error message as the system was frozen at that point and wouldn't react to any of my inputs.

Offline

Board footer

Powered by FluxBB