You are not logged in.

#1 2024-05-27 15:03:36

Ailurus
Member
Registered: 2011-05-17
Posts: 26

[SOLVED] Variety of kernel bugs on Lenovo Legion Pro 7 w/ NVIDIA

I recently installed Arch on a new laptop (Lenovo Legion Pro 7 w/ NVIDIA RTX 4070, 32G RAM, no swap, latest kernel, latest nvidia package, running XFCE). Initially the system appeared solid, but lately I've been encountering a variety of kernel bugs. Here's a sample from the last couple of days:

journalctl | grep -A 4 'BUG'
May 25 18:25:24 worklaptop kernel: BUG: kernel NULL pointer dereference, address: 000000000000040d
May 25 18:25:24 worklaptop kernel: #PF: supervisor read access in kernel mode
May 25 18:25:24 worklaptop kernel: #PF: error_code(0x0000) - not-present page
May 25 18:25:24 worklaptop kernel: PGD 0 P4D 0 
May 25 18:25:24 worklaptop kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
--
May 25 21:32:49 worklaptop kernel: BUG: unable to handle page fault for address: 0000000000004350
May 25 21:32:49 worklaptop kernel: #PF: supervisor read access in kernel mode
May 25 21:32:49 worklaptop kernel: #PF: error_code(0x0000) - not-present page
May 25 21:32:49 worklaptop kernel: PGD 0 P4D 0 
May 25 21:32:49 worklaptop kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
--
May 25 23:01:26 worklaptop kernel: BUG: unable to handle page fault for address: 0000000000004350
May 25 23:01:26 worklaptop kernel: #PF: supervisor read access in kernel mode
May 25 23:01:26 worklaptop kernel: #PF: error_code(0x0000) - not-present page
May 25 23:01:26 worklaptop kernel: PGD 0 P4D 0 
May 25 23:01:26 worklaptop kernel: Oops: 0000 [#2] PREEMPT SMP NOPTI
--
May 25 23:26:49 worklaptop kernel: BUG: unable to handle page fault for address: 0000000000004350
May 25 23:26:49 worklaptop kernel: #PF: supervisor read access in kernel mode
May 25 23:26:49 worklaptop kernel: #PF: error_code(0x0000) - not-present page
May 25 23:26:49 worklaptop kernel: PGD 0 P4D 0 
May 25 23:26:49 worklaptop kernel: Oops: 0000 [#3] PREEMPT SMP NOPTI
--
May 25 23:44:04 worklaptop kernel: BUG: unable to handle page fault for address: 0000000000004350
May 25 23:44:04 worklaptop kernel: #PF: supervisor read access in kernel mode
May 25 23:44:04 worklaptop kernel: #PF: error_code(0x0000) - not-present page
May 25 23:44:04 worklaptop kernel: PGD 0 P4D 0 
May 25 23:44:04 worklaptop kernel: Oops: 0000 [#4] PREEMPT SMP NOPTI
--
May 26 10:28:06 worklaptop kernel: BUG: KFENCE: out-of-bounds write in _nv044102rm+0x10/0x30 [nvidia]
May 26 10:28:06 worklaptop kernel: Out-of-bounds write at 0x0000000013472aa8 (24B left of kfence-#254):
May 26 10:28:06 worklaptop kernel:  _nv044102rm+0x10/0x30 [nvidia]
May 26 10:28:06 worklaptop kernel:  _nv014568rm+0x4d/0x90 [nvidia]
May 26 10:28:06 worklaptop kernel:  _nv049800rm+0x18/0x60 [nvidia]
--
May 26 11:17:59 worklaptop kernel: BUG: KFENCE: memory corruption in acpi_os_release_object+0xe/0x20
May 26 11:17:59 worklaptop kernel: Corrupted memory at 0x0000000057b1c249 [ ! ! ! ! ! ! ! ! . . . . . . . . ] (in kfence-#102):
May 26 11:17:59 worklaptop kernel:  acpi_os_release_object+0xe/0x20
May 26 11:17:59 worklaptop kernel:  acpi_os_execute_deferred+0x17/0x30
May 26 11:17:59 worklaptop kernel:  process_one_work+0x18b/0x350
--
May 26 12:13:52 worklaptop kernel: BUG: KFENCE: memory corruption in acpi_os_release_object+0xe/0x20
May 26 12:13:52 worklaptop kernel: Corrupted memory at 0x0000000097caad82 [ ! ! ! ! ! ! ! ! . . . . . . . . ] (in kfence-#202):
May 26 12:13:52 worklaptop kernel:  acpi_os_release_object+0xe/0x20
May 26 12:13:52 worklaptop kernel:  acpi_os_execute_deferred+0x17/0x30
May 26 12:13:52 worklaptop kernel:  process_one_work+0x18b/0x350
--
May 26 16:44:36 worklaptop kernel: BUG: unable to handle page fault for address: 000000000000c34d
May 26 16:44:36 worklaptop kernel: #PF: supervisor read access in kernel mode
May 26 16:44:36 worklaptop kernel: #PF: error_code(0x0000) - not-present page
May 26 16:44:36 worklaptop kernel: PGD 0 P4D 0 
May 26 16:44:36 worklaptop kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
--
May 26 16:44:44 worklaptop kernel: BUG: unable to handle page fault for address: 000000000000c34d
May 26 16:44:44 worklaptop kernel: #PF: supervisor read access in kernel mode
May 26 16:44:44 worklaptop kernel: #PF: error_code(0x0000) - not-present page
May 26 16:44:44 worklaptop kernel: PGD 0 P4D 0 
May 26 16:44:44 worklaptop kernel: Oops: 0000 [#2] PREEMPT SMP NOPTI
--
May 26 16:49:33 worklaptop kernel: BUG: unable to handle page fault for address: 000000000000c34d
May 26 16:49:33 worklaptop kernel: #PF: supervisor read access in kernel mode
May 26 16:49:33 worklaptop kernel: #PF: error_code(0x0000) - not-present page
May 26 16:49:33 worklaptop kernel: PGD 0 P4D 0 
May 26 16:49:33 worklaptop kernel: Oops: 0000 [#3] PREEMPT SMP NOPTI
--
May 27 12:51:18 worklaptop kernel: BUG: KFENCE: memory corruption in acpi_os_release_object+0xe/0x20
May 27 12:51:18 worklaptop kernel: Corrupted memory at 0x00000000f6fe84e4 [ ! ! ! ! ! ! ! ! . . . . . . . . ] (in kfence-#142):
May 27 12:51:18 worklaptop kernel:  acpi_os_release_object+0xe/0x20
May 27 12:51:18 worklaptop kernel:  acpi_os_execute_deferred+0x17/0x30
May 27 12:51:18 worklaptop kernel:  process_one_work+0x18b/0x350
--
May 27 14:06:49 worklaptop kernel: BUG: KFENCE: out-of-bounds write in _nv044102rm+0x10/0x30 [nvidia]
May 27 14:06:49 worklaptop kernel: Out-of-bounds write at 0x00000000da5737df (24B left of kfence-#204):
May 27 14:06:49 worklaptop kernel:  _nv044102rm+0x10/0x30 [nvidia]
May 27 14:06:49 worklaptop kernel:  _nv014568rm+0x4d/0x90 [nvidia]
May 27 14:06:49 worklaptop kernel:  _nv049800rm+0x18/0x60 [nvidia]
--
May 27 14:38:55 worklaptop kernel: BUG: Bad rss-counter state mm:00000000a7335af2 type:MM_ANONPAGES val:256
May 27 14:38:55 worklaptop kernel: BUG: non-zero pgtables_bytes on freeing mm: 8192
May 27 14:40:05 worklaptop rtkit-daemon[1147]: Supervising 8 threads of 4 processes of 1 users.
May 27 14:40:05 worklaptop rtkit-daemon[1147]: Supervising 8 threads of 4 processes of 1 users.
May 27 14:40:43 worklaptop rtkit-daemon[1147]: Supervising 8 threads of 4 processes of 1 users.
May 27 14:40:43 worklaptop rtkit-daemon[1147]: Supervising 8 threads of 4 processes of 1 users.
--
May 27 15:56:34 worklaptop kernel: BUG: KFENCE: memory corruption in acpi_os_release_object+0xe/0x20
May 27 15:56:34 worklaptop kernel: Corrupted memory at 0x0000000018b280c4 [ ! ! ! ! ! ! ! ! . . . . . . . . ] (in kfence-#59):
May 27 15:56:34 worklaptop kernel:  acpi_os_release_object+0xe/0x20
May 27 15:56:34 worklaptop kernel:  acpi_os_execute_deferred+0x17/0x30
May 27 15:56:34 worklaptop kernel:  process_one_work+0x18b/0x350

Most of these point at something memory-based — an overnight MemTest didn't show any errors though. The system usually "survives" these kernel bugs for a while, though typically hangs on starting/closing an application, using pacman (which has already resulted in some time-consuming fixes) or attempting to power off/reboot the system. I don't see any blinking lights when a crash happens, but the system is completely unresponsive (i.e. also does not react to SysRq combinations).

So far I've only found this topic on the Manjaro forums, which sounds very similar to what I'm seeing. Unfortunately, it didn't get resolved.

I've now hooked up a second screen to the laptop, running

watch -n 10 "journalctl -b | grep -A 4 'BUG'"

in the hope of discovering some pattern. No luck so far... Any ideas would be greatly appreciated!

Last edited by Ailurus (2024-05-31 12:01:43)

Offline

#2 2024-05-27 15:23:38

loqs
Member
Registered: 2014-03-06
Posts: 18,872

Re: [SOLVED] Variety of kernel bugs on Lenovo Legion Pro 7 w/ NVIDIA

Please post the full journal for a boot with the issue without the grepping that removes context.

Last edited by loqs (2024-05-27 15:23:52)

Offline

#3 2024-05-27 15:30:36

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 20,623

Re: [SOLVED] Variety of kernel bugs on Lenovo Legion Pro 7 w/ NVIDIA

loqs wrote:

Please post the full journal for a boot with the issue without the grepping that removes context.

A good way  to do that is with journalctl -b | curl -F 'file=@-' 0x0.st
Then provide us the link it returns.


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
The shortest way to ruin a country is to give power to demagogues.— Dionysius of Halicarnassus
---
How to Ask Questions the Smart Way

Offline

#4 2024-05-27 15:48:32

CuriousRubick
Member
Registered: 2023-11-19
Posts: 6

Re: [SOLVED] Variety of kernel bugs on Lenovo Legion Pro 7 w/ NVIDIA

I had similar problems with a legion laptop. Installed nvidia 535 from the aur after reading the following thread: https://bbs.archlinux.org/viewtopic.php?id=293400.
Fixed it for me.

Offline

#5 2024-05-27 16:03:59

Ailurus
Member
Registered: 2011-05-17
Posts: 26

Re: [SOLVED] Variety of kernel bugs on Lenovo Legion Pro 7 w/ NVIDIA

My bad, full output of journalctl -b uploaded to http://0x0.st/XZCp.txt. Thanks @ewaller for suggesting 0x0

Offline

#6 2024-05-27 17:47:06

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 25,156

Re: [SOLVED] Variety of kernel bugs on Lenovo Legion Pro 7 w/ NVIDIA

Chances are high this is related to the current driver troubles, test nvidia-535 as linked above, or nvidia-open or nvidia 555 via nvidia-beta from the AUR

Last edited by V1del (2024-05-27 17:49:23)

Offline

#7 2024-05-27 19:25:31

CuriousRubick
Member
Registered: 2023-11-19
Posts: 6

Re: [SOLVED] Variety of kernel bugs on Lenovo Legion Pro 7 w/ NVIDIA

V1del wrote:

Chances are high this is related to the current driver troubles, test nvidia-535 as linked above, or nvidia-open or nvidia 555 via nvidia-beta from the AUR

Tried 555 beta before settling on 535. The former didn't help, although I never tried blacklisting nvidia_uvm. I've read that can fix some random lockups with 555.

*edit*

Can't say for sure I was experiencing the same issue as OP, but I did see this message repeatedly: "kernel: BUG: kernel NULL pointer dereference". Been getting random lockups and blank screens on shutdown since upgrading to 550.

Last edited by CuriousRubick (2024-05-27 19:29:24)

Offline

#8 2024-05-27 21:59:38

seth
Member
From: Won't reply 2 private help req
Registered: 2012-09-03
Posts: 75,287

Re: [SOLVED] Variety of kernel bugs on Lenovo Legion Pro 7 w/ NVIDIA

http://0x0.st/XZCp.txt seems to hold 4 copies of the same journal

May 27 14:06:49 worklaptop kernel: ==================================================================
May 27 14:06:49 worklaptop kernel: BUG: KFENCE: out-of-bounds write in _nv044102rm+0x10/0x30 [nvidia]
May 27 14:06:49 worklaptop kernel: Out-of-bounds write at 0x00000000da5737df (24B left of kfence-#204):
May 27 14:06:49 worklaptop kernel:  _nv044102rm+0x10/0x30 [nvidia]
May 27 14:06:49 worklaptop kernel:  _nv014568rm+0x4d/0x90 [nvidia]
May 27 14:06:49 worklaptop kernel:  _nv049800rm+0x18/0x60 [nvidia]
May 27 14:06:49 worklaptop kernel:  _nv026842rm+0x61/0x90 [nvidia]
May 27 14:06:49 worklaptop kernel:  rm_acpi_nvpcf_notify+0x1c/0xe0 [nvidia]
May 27 14:06:49 worklaptop kernel:  acpi_ev_notify_dispatch+0x4b/0x70
May 27 14:06:49 worklaptop kernel:  acpi_os_execute_deferred+0x17/0x30
May 27 14:06:49 worklaptop kernel:  process_one_work+0x18b/0x350
May 27 14:06:49 worklaptop kernel:  worker_thread+0x2eb/0x410
May 27 14:06:49 worklaptop kernel:  kthread+0xcf/0x100
May 27 14:06:49 worklaptop kernel:  ret_from_fork+0x31/0x50
May 27 14:06:49 worklaptop kernel:  ret_from_fork_asm+0x1a/0x30
May 27 15:56:34 worklaptop kernel: ==================================================================
May 27 15:56:34 worklaptop kernel: BUG: KFENCE: memory corruption in acpi_os_release_object+0xe/0x20
May 27 15:56:34 worklaptop kernel: Corrupted memory at 0x0000000018b280c4 [ ! ! ! ! ! ! ! ! . . . . . . . . ] (in kfence-#59):
May 27 15:56:34 worklaptop kernel:  acpi_os_release_object+0xe/0x20
May 27 15:56:34 worklaptop kernel:  acpi_os_execute_deferred+0x17/0x30
May 27 15:56:34 worklaptop kernel:  process_one_work+0x18b/0x350
May 27 15:56:34 worklaptop kernel:  worker_thread+0x2eb/0x410
May 27 15:56:34 worklaptop kernel:  kthread+0xcf/0x100
May 27 15:56:34 worklaptop kernel:  ret_from_fork+0x31/0x50
May 27 15:56:34 worklaptop kernel:  ret_from_fork_asm+0x1a/0x30
May 27 14:38:45 worklaptop kernel: mm/pgtable-generic.c:42: bad pud 00000000d1d4b729(0000000000002a20)
May 27 14:38:45 worklaptop kernel: Isolated Web Co[1751]: segfault at 0 ip 000071ceb1716365 sp 00007fff199a85f0 error 4 in libxul.so[71ceb12bd000+61f5000] likely on CPU 12 (core 24, socket 0)
May 27 14:38:45 worklaptop kernel: Code: 8b 00 48 8b 00 48 8b 40 10 48 85 c0 0f 85 f6 1a 00 00 49 8b 04 24 48 8b 00 4c 8b 60 10 4d 85 e4 0f 84 bc a5 00 00 49 8b 04 24 <48> 8b 08 48 8b 09 48 8b 51 28 48 85 d2 0f 85 25 6a 00 00 31 d2 48
May 27 14:38:45 worklaptop systemd[1]: Created slice Slice /system/systemd-coredump.
May 27 14:38:46 worklaptop systemd[1]: Started Process Core Dump (PID 11655/UID 0).
May 27 14:38:46 worklaptop systemd-coredump[11656]: Process 1751 (Isolated Web Co) of user 1000 dumped core.
                                                    
                                                    Stack trace of thread 1751:
                                                    #0  0x000071ceb1716365 n/a (libxul.so + 0x2f16365)
                                                    #1  0x000071ceb1731b09 n/a (libxul.so + 0x2f31b09)
                                                    #2  0x0000310a53457bc0 n/a (n/a + 0x0)
                                                    #3  0x0000310a53798767 n/a (n/a + 0x0)
                                                    #4  0x0000310a534554e6 n/a (n/a + 0x0)
                                                    #5  0x000071ceb19bd67e n/a (libxul.so + 0x31bd67e)
                                                    #6  0x000071ceb3a28ec3 n/a (libxul.so + 0x5228ec3)
                                                    #7  0x000071ceb30a6017 n/a (libxul.so + 0x48a6017)
                                                    #8  0x000071ceb30a37b9 n/a (libxul.so + 0x48a37b9)
                                                    #9  0x000071ceb30a525a n/a (libxul.so + 0x48a525a)
                                                    #10 0x000071ceb179cf3d n/a (libxul.so + 0x2f9cf3d)
                                                    #11 0x000071ceb229adf2 n/a (libxul.so + 0x3a9adf2)
                                                    #12 0x000071ceb1394dce n/a (libxul.so + 0x2b94dce)
                                                    #13 0x000071ceb13767e6 n/a (libxul.so + 0x2b767e6)
                                                    #14 0x000071ceb1391634 n/a (libxul.so + 0x2b91634)
                                                    #15 0x000071ceb13913ae n/a (libxul.so + 0x2b913ae)
                                                    #16 0x000071ceb13938d3 n/a (libxul.so + 0x2b938d3)
                                                    #17 0x000071ceb139382f n/a (libxul.so + 0x2b9382f)
                                                    #18 0x000071ceb12ce589 n/a (libxul.so + 0x2ace589)
                                                    #19 0x000058114677cbc8 n/a (firefox + 0x3ebc8)
                                                    #20 0x000071cebd34ec88 n/a (libc.so.6 + 0x25c88)
                                                    #21 0x000071cebd34ed4c __libc_start_main (libc.so.6 + 0x25d4c)
                                                    #22 0x00005811467dd875 _start (firefox + 0x9f875)

Could be the nvidia driver but w/ the userspace bug, please add a https://wiki.archlinux.org/title/Swap#Swap_file

Offline

#9 2024-05-31 12:00:23

Ailurus
Member
Registered: 2011-05-17
Posts: 26

Re: [SOLVED] Variety of kernel bugs on Lenovo Legion Pro 7 w/ NVIDIA

I haven't encountered any more kernel bugs after switching to aur/nvidia-535xx-dkms a couple of days ago, so I'll mark this as solved. Thanks all!

Last edited by Ailurus (2024-05-31 12:02:03)

Offline

#10 2024-06-02 15:36:15

CuriousRubick
Member
Registered: 2023-11-19
Posts: 6

Re: [SOLVED] Variety of kernel bugs on Lenovo Legion Pro 7 w/ NVIDIA

Ailurus wrote:

I haven't encountered any more kernel bugs after switching to aur/nvidia-535xx-dkms a couple of days ago, so I'll mark this as solved. Thanks all!

I had to switch from 535 to nvidia-open-beta-dkms (555), which is also in the aur. Also installed nvidia-utils-beta and lib32-nvidia-utils-beta. Doing this required pacman's -d flag to ignore dependencies.

Was getting kwin crashes with 535. No issues with nvidia-open so far, other than my laptop immediately wakes when trying to sleep.

Offline

Board footer

Powered by FluxBB