You are not logged in.

#1 2020-12-11 07:45:27

jypma
Member
Registered: 2015-04-21
Posts: 13

Kernel null pointer dereference in ext4 code

After a recent kernel upgrade I frequently get the whole system to freeze. This happens every few days, with no observable pattern or correlation with load or activities.

The the last systemd message has the following stack trace:

[...(no related messages before)...]
Dec 10 19:20:58 jinkpad kernel: BUG: kernel NULL pointer dereference, address: 0000000000000108
Dec 10 19:20:58 jinkpad kernel: #PF: supervisor read access in kernel mode
Dec 10 19:20:58 jinkpad kernel: #PF: error_code(0x0000) - not-present page
Dec 10 19:20:58 jinkpad kernel: PGD 0 P4D 0
Dec 10 19:20:58 jinkpad kernel: Oops: 0000 [#1] PREEMPT SMP PTI
Dec 10 19:20:58 jinkpad kernel: CPU: 1 PID: 99 Comm: kswapd0 Tainted: P        W  OE     5.9.12-arch1-1 #1
Dec 10 19:20:58 jinkpad kernel: Hardware name: LENOVO 20BECTO1WW/20BECTO1WW, BIOS GMET62WW (2.10 ) 03/19/2014
Dec 10 19:20:58 jinkpad kernel: RIP: 0010:fsverity_cleanup_inode+0x16/0x40
Dec 10 19:20:58 jinkpad kernel: Code: a8 48 c7 c7 08 c8 f4 a8 e8 47 a9 19 00 b8 ff ff ff ff c3 90 0f 1f 44 00 00 55 53 48 8b af 50 02 00 00 48 89 fb 48 85 ed 74 18 <48> 8b 7d 08 e8 e1 3c f4 ff 48 8b 3d 1a 95 c4 01 48 89 ee e8 62 76
Dec 10 19:20:58 jinkpad kernel: RSP: 0018:ffffb5c680237b08 EFLAGS: 00010206
Dec 10 19:20:58 jinkpad kernel: RAX: ffff9deab80b7ce8 RBX: ffff9deab80b79a8 RCX: 0000000000000000
Dec 10 19:20:58 jinkpad kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9deab80b79a8
Dec 10 19:20:58 jinkpad kernel: RBP: 0000000000000100 R08: 0000000000000000 R09: ffffb5c680237cb0
Dec 10 19:20:58 jinkpad kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff9deab80b7b20
Dec 10 19:20:58 jinkpad kernel: R13: ffff9dee1447b000 R14: ffff9dee1447b070 R15: 0000000000000260
Dec 10 19:20:58 jinkpad kernel: FS:  0000000000000000(0000) GS:ffff9dee2e240000(0000) knlGS:0000000000000000
Dec 10 19:20:58 jinkpad kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 10 19:20:58 jinkpad kernel: CR2: 0000000000000108 CR3: 000000006e60e004 CR4: 00000000001706e0
Dec 10 19:20:58 jinkpad kernel: Call Trace:
Dec 10 19:20:58 jinkpad kernel:  ext4_evict_inode+0x58/0x560 [ext4]
Dec 10 19:20:58 jinkpad kernel:  evict+0xc3/0x1f0
Dec 10 19:20:58 jinkpad kernel:  prune_icache_sb+0x7e/0xb0
Dec 10 19:20:58 jinkpad kernel:  super_cache_scan+0x161/0x1e0
Dec 10 19:20:58 jinkpad kernel:  do_shrink_slab+0x149/0x2f0
Dec 10 19:20:58 jinkpad kernel:  shrink_slab+0x1cb/0x2d0
Dec 10 19:20:58 jinkpad kernel:  shrink_node+0x2b2/0x6d0
Dec 10 19:20:58 jinkpad kernel:  balance_pgdat+0x2f7/0x610
Dec 10 19:20:58 jinkpad kernel:  kswapd+0x1fd/0x3e0
Dec 10 19:20:58 jinkpad kernel:  ? wait_woken+0x80/0x80
Dec 10 19:20:58 jinkpad kernel:  ? balance_pgdat+0x610/0x610
Dec 10 19:20:58 jinkpad kernel:  kthread+0x142/0x160
Dec 10 19:20:58 jinkpad kernel:  ? __kthread_bind_mask+0x60/0x60
Dec 10 19:20:58 jinkpad kernel:  ret_from_fork+0x22/0x30
Dec 10 19:20:58 jinkpad kernel: Modules linked in: veth nvidia_uvm(POE) dm_crypt cbc encrypted_keys trusted tpm rng_core dm_mod loop nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter ccm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJ>
Dec 10 19:20:58 jinkpad kernel:  ghash_clmulni_intel mac80211 aesni_intel snd_hda_codec_realtek crypto_simd snd_hda_codec_generic cryptd libarc4 glue_helper i2c_algo_bit snd_hda_intel rapl snd_intel_dspcfg intel_cstate iwlwifi snd_hda_codec intel_uncore input>
Dec 10 19:20:58 jinkpad kernel: CR2: 0000000000000108
Dec 10 19:20:58 jinkpad kernel: ---[ end trace 7bd16245d5ddbe6a ]---
Dec 10 19:20:58 jinkpad kernel: RIP: 0010:fsverity_cleanup_inode+0x16/0x40
Dec 10 19:20:58 jinkpad kernel: Code: a8 48 c7 c7 08 c8 f4 a8 e8 47 a9 19 00 b8 ff ff ff ff c3 90 0f 1f 44 00 00 55 53 48 8b af 50 02 00 00 48 89 fb 48 85 ed 74 18 <48> 8b 7d 08 e8 e1 3c f4 ff 48 8b 3d 1a 95 c4 01 48 89 ee e8 62 76
Dec 10 19:20:58 jinkpad kernel: RSP: 0018:ffffb5c680237b08 EFLAGS: 00010206
Dec 10 19:20:58 jinkpad kernel: RAX: ffff9deab80b7ce8 RBX: ffff9deab80b79a8 RCX: 0000000000000000
Dec 10 19:20:58 jinkpad kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9deab80b79a8
Dec 10 19:20:58 jinkpad kernel: RBP: 0000000000000100 R08: 0000000000000000 R09: ffffb5c680237cb0
Dec 10 19:20:58 jinkpad kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff9deab80b7b20
Dec 10 19:20:58 jinkpad kernel: R13: ffff9dee1447b000 R14: ffff9dee1447b070 R15: 0000000000000260
Dec 10 19:20:58 jinkpad kernel: FS:  0000000000000000(0000) GS:ffff9dee2e240000(0000) knlGS:0000000000000000
Dec 10 19:20:58 jinkpad kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 10 19:20:58 jinkpad kernel: CR2: 0000000000000108 CR3: 000000006e60e004 CR4: 00000000001706e0

This didn't bring down the system (so I didn't even notice), but later on a long list of further errors appears, freezing the kernel/system:

Dec 10 20:03:03 jinkpad kernel: general protection fault, probably for non-canonical address 0x80ea9b8f932d2da: 0000 [#2] PREEMPT SMP PTI
Dec 10 20:03:03 jinkpad kernel: CPU: 1 PID: 164464 Comm: SyncEngine_Thre Tainted: P      D W  OE     5.9.12-arch1-1 #1
Dec 10 20:03:03 jinkpad kernel: Hardware name: LENOVO 20BECTO1WW/20BECTO1WW, BIOS GMET62WW (2.10 ) 03/19/2014
Dec 10 20:03:03 jinkpad kernel: RIP: 0010:kmem_cache_alloc+0x9f/0x250
Dec 10 20:03:03 jinkpad kernel: Code: f2 75 e7 48 8b 01 48 83 79 10 00 48 89 04 24 0f 84 91 01 00 00 48 85 c0 0f 84 88 01 00 00 41 8b 4c 24 28 49 8b 3c 24 48 01 c1 <48> 8b 19 48 89 ce 49 33 9c 24 b8 00 00 00 48 8d 8a 00 02 00 00 48
Dec 10 20:03:03 jinkpad kernel: RSP: 0018:ffffb5c6839dfa78 EFLAGS: 00010202
Dec 10 20:03:03 jinkpad kernel: RAX: 080ea9b8f932ce92 RBX: ffffffffc03fc900 RCX: 080ea9b8f932d2da
Dec 10 20:03:03 jinkpad kernel: RDX: 000000000835a201 RSI: 000000000835a201 RDI: 000037d851a01090
Dec 10 20:03:03 jinkpad kernel: RBP: 0000000000000c40 R08: 00000000000006ab R09: ffffffffc03fe801
Dec 10 20:03:03 jinkpad kernel: R10: 00000000000006ab R11: ffffffffc03faea0 R12: ffff9dee28e4b900
Dec 10 20:03:03 jinkpad kernel: R13: ffffffffc03e1278 R14: ffff9dee28e4b900 R15: 0000000000000008
Dec 10 20:03:03 jinkpad kernel: FS:  00007f6c16bfd640(0000) GS:ffff9dee2e240000(0000) knlGS:0000000000000000
Dec 10 20:03:03 jinkpad kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 10 20:03:03 jinkpad kernel: CR2: 00007f663b721000 CR3: 000000040ef52004 CR4: 00000000001706e0
Dec 10 20:03:03 jinkpad kernel: Call Trace:
Dec 10 20:03:03 jinkpad kernel:  ext4_alloc_inode+0x18/0x190 [ext4]
Dec 10 20:03:03 jinkpad kernel:  alloc_inode+0x1d/0xb0
Dec 10 20:03:03 jinkpad kernel:  iget_locked+0xd9/0x250
Dec 10 20:03:03 jinkpad kernel:  __ext4_iget+0x111/0xd90 [ext4]
Dec 10 20:03:03 jinkpad kernel:  ext4_lookup+0x123/0x260 [ext4]
Dec 10 20:03:03 jinkpad kernel:  __lookup_slow+0x85/0x140
Dec 10 20:03:03 jinkpad kernel:  walk_component+0x141/0x1b0
Dec 10 20:03:03 jinkpad kernel:  path_lookupat+0x5b/0x190
Dec 10 20:03:03 jinkpad kernel:  filename_lookup+0xbe/0x1d0
Dec 10 20:03:03 jinkpad kernel:  vfs_statx+0x8f/0x140
Dec 10 20:03:03 jinkpad kernel:  __do_sys_newlstat+0x47/0x80
Dec 10 20:03:03 jinkpad kernel:  do_syscall_64+0x33/0x40
Dec 10 20:03:03 jinkpad kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Dec 10 20:03:03 jinkpad kernel: RIP: 0033:0x7f6c5f11431a
Dec 10 20:03:03 jinkpad kernel: Code: ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 41 89 f8 48 89 f7 48 89 d6 41 83 f8 01 77 2d b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 06 c3 0f 1f 44 00 00 48 8b 15 21 1b 0d 00 f7
Dec 10 20:03:03 jinkpad kernel: RSP: 002b:00007f6c16bfbd68 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Dec 10 20:03:03 jinkpad kernel: RAX: ffffffffffffffda RBX: 00007f6c1c13e080 RCX: 00007f6c5f11431a
Dec 10 20:03:03 jinkpad kernel: RDX: 00007f6c16bfbd70 RSI: 00007f6c16bfbd70 RDI: 00007f6c1c13dd48
Dec 10 20:03:03 jinkpad kernel: RBP: 00007f6c1c146e23 R08: 0000000000000001 R09: 0000000000000000
Dec 10 20:03:03 jinkpad kernel: R10: 00007f6c16bfbe80 R11: 0000000000000246 R12: 00007f6c1c014ed0
Dec 10 20:03:03 jinkpad kernel: R13: 00007f6c16bfbff0 R14: 00007f6c1c13e080 R15: 00007f6c16bfbea0
Dec 10 20:03:03 jinkpad kernel: Modules linked in: veth nvidia_uvm(POE) dm_crypt cbc encrypted_keys trusted tpm rng_core dm_mod loop nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter ccm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJ>
Dec 10 20:03:03 jinkpad kernel:  ghash_clmulni_intel mac80211 aesni_intel snd_hda_codec_realtek crypto_simd snd_hda_codec_generic cryptd libarc4 glue_helper i2c_algo_bit snd_hda_intel rapl snd_intel_dspcfg intel_cstate iwlwifi snd_hda_codec intel_uncore input>
Dec 10 20:03:03 jinkpad kernel: ---[ end trace 7bd16245d5ddbe6b ]---
Dec 10 20:03:03 jinkpad kernel: RIP: 0010:fsverity_cleanup_inode+0x16/0x40
Dec 10 20:03:03 jinkpad kernel: Code: a8 48 c7 c7 08 c8 f4 a8 e8 47 a9 19 00 b8 ff ff ff ff c3 90 0f 1f 44 00 00 55 53 48 8b af 50 02 00 00 48 89 fb 48 85 ed 74 18 <48> 8b 7d 08 e8 e1 3c f4 ff 48 8b 3d 1a 95 c4 01 48 89 ee e8 62 76
Dec 10 20:03:03 jinkpad kernel: RSP: 0018:ffffb5c680237b08 EFLAGS: 00010206
Dec 10 20:03:03 jinkpad kernel: RAX: ffff9deab80b7ce8 RBX: ffff9deab80b79a8 RCX: 0000000000000000
Dec 10 20:03:03 jinkpad kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9deab80b79a8
Dec 10 20:03:03 jinkpad kernel: RBP: 0000000000000100 R08: 0000000000000000 R09: ffffb5c680237cb0
Dec 10 20:03:03 jinkpad kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff9deab80b7b20
Dec 10 20:03:03 jinkpad kernel: R13: ffff9dee1447b000 R14: ffff9dee1447b070 R15: 0000000000000260
Dec 10 20:03:03 jinkpad kernel: FS:  00007f6c16bfd640(0000) GS:ffff9dee2e240000(0000) knlGS:0000000000000000
Dec 10 20:03:03 jinkpad kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 10 20:03:03 jinkpad kernel: CR2: 00007f663b721000 CR3: 000000040ef52004 CR4: 00000000001706e0
[...repeated lots of times from here...]

This is on kernel `5.9.12.arch1-1`.

A reboot got systemd to check my filesystems (most ext4), with no observable errors. This is the file system setup (sdb is an SSD, sda is a harddisk):

/dev/sdb1 on / type ext4 (rw,relatime)
/dev/sda4 on /boot type ext4 (rw,relatime,stripe=4,data=ordered)
/dev/sda6 on /home type ext4 (rw,relatime,data=ordered)
/dev/sda5 on /var type ext4 (rw,relatime,data=ordered)
/dev/sda2 on /boot/efi type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,utf8,errors=remount-ro)

I also have a swapfile on /var/swapfile.

In addition, I have a luks-encrypted file loop-mounted as ext4 as well (in case that may be relevant).

Any tips on what to try?

Offline

#2 2020-12-12 04:12:01

jdoggsc
Member
Registered: 2011-08-15
Posts: 98

Re: Kernel null pointer dereference in ext4 code

No real solution here, but I'm having a similar issue:

Dec 09 10:21:17 ryzen kernel: BUG: kernel NULL pointer dereference, address: 0000000000000020
Dec 09 10:21:17 ryzen kernel: #PF: supervisor read access in kernel mode
Dec 09 10:21:17 ryzen kernel: #PF: error_code(0x0000) - not-present page
Dec 09 10:21:17 ryzen kernel: PGD 31b136067 P4D 31b136067 PUD 31b135067 PMD 0 
Dec 09 10:21:17 ryzen kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Dec 09 10:21:17 ryzen kernel: CPU: 0 PID: 698 Comm: irq/70-nvidia Tainted: P           OE     5.9.1-arch1-1 #1
Dec 09 10:21:17 ryzen kernel: Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 5603 07/28/2020
Dec 09 10:21:17 ryzen kernel: RIP: 0010:_nv027470rm+0x9/0x90 [nvidia]
Dec 09 10:21:17 ryzen kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Dec 09 10:21:17 ryzen kernel: RSP: 0018:ffffb98b411cbbe0 EFLAGS: 00010202
Dec 09 10:21:17 ryzen kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Dec 09 10:21:17 ryzen kernel: RDX: ffff9c6ac3df3b08 RSI: ffffffffffffffff RDI: 0000000000000020
Dec 09 10:21:17 ryzen kernel: RBP: ffff9c6ae51bd940 R08: ffffffffc317fb90 R09: ffff9c6ae51bd920
Dec 09 10:21:17 ryzen kernel: R10: ffffffffc1e557d0 R11: ffff9c6b0c014008 R12: 0000000000000020
Dec 09 10:21:17 ryzen kernel: R13: 0000000000000000 R14: ffff9c6ae51bdaa8 R15: ffff9c6ae51bdbb0
Dec 09 10:21:17 ryzen kernel: FS:  0000000000000000(0000) GS:ffff9c6b0e800000(0000) knlGS:0000000000000000
Dec 09 10:21:17 ryzen kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 09 10:21:17 ryzen kernel: CR2: 0000000000000020 CR3: 00000002f72cc000 CR4: 00000000003506f0
Dec 09 10:21:17 ryzen kernel: Call Trace:
Dec 09 10:21:17 ryzen kernel:  ? _nv029878rm+0x1b/0x90 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? _nv025417rm+0x18/0x60 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? _nv011659rm+0x13d/0x1c0 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? _nv000083rm+0x12f/0x1a0 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? _nv011587rm+0xff/0x180 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? _nv018400rm+0x1af/0x210 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? _nv018340rm+0xd9a/0xe90 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? _nv018341rm+0xde/0x260 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? _nv018307rm+0x72/0xc0 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? _nv018321rm+0x235/0x2d0 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? _nv026019rm+0x10/0x10 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? _nv018354rm+0xac/0xe0 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? _nv027677rm+0x820/0xdc0 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? _nv007561rm+0x155/0x270 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? _nv027685rm+0x8d/0x180 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? _nv000711rm+0xa9/0x200 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? disable_irq_nosync+0x10/0x10
Dec 09 10:21:17 ryzen kernel:  ? rm_isr_bh+0x1c/0x60 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? nvidia_isr_kthread_bh+0x1b/0x40 [nvidia]
Dec 09 10:21:17 ryzen kernel:  ? irq_thread_fn+0x20/0x60
Dec 09 10:21:17 ryzen kernel:  ? irq_thread+0xf5/0x1a0
Dec 09 10:21:17 ryzen kernel:  ? irq_finalize_oneshot.part.0+0xe0/0xe0
Dec 09 10:21:17 ryzen kernel:  ? irq_thread_check_affinity+0xd0/0xd0
Dec 09 10:21:17 ryzen kernel:  ? kthread+0x142/0x160
Dec 09 10:21:17 ryzen kernel:  ? __kthread_bind_mask+0x60/0x60
Dec 09 10:21:17 ryzen kernel:  ? ret_from_fork+0x22/0x30
Dec 09 10:21:17 ryzen kernel: Modules linked in: udp_diag tcp_diag inet_diag fuse rfcomm cfg80211 8021q garp mrp stp llc cmac algif_hash algif_skcipher af_alg bnep vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) nvidia_drm(POE) nvidia_modese>
Dec 09 10:21:17 ryzen kernel:  crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci crc32c_intel xhci_pci_renesas xhci_hcd
Dec 09 10:21:17 ryzen kernel: CR2: 0000000000000020
Dec 09 10:21:17 ryzen kernel: ---[ end trace 94f1672408441b65 ]---
Dec 09 10:21:17 ryzen kernel: RIP: 0010:_nv027470rm+0x9/0x90 [nvidia]
Dec 09 10:21:17 ryzen kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Dec 09 10:21:17 ryzen kernel: RSP: 0018:ffffb98b411cbbe0 EFLAGS: 00010202
Dec 09 10:21:17 ryzen kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Dec 09 10:21:17 ryzen kernel: RDX: ffff9c6ac3df3b08 RSI: ffffffffffffffff RDI: 0000000000000020
Dec 09 10:21:17 ryzen kernel: RBP: ffff9c6ae51bd940 R08: ffffffffc317fb90 R09: ffff9c6ae51bd920
Dec 09 10:21:17 ryzen kernel: R10: ffffffffc1e557d0 R11: ffff9c6b0c014008 R12: 0000000000000020
Dec 09 10:21:17 ryzen kernel: R13: 0000000000000000 R14: ffff9c6ae51bdaa8 R15: ffff9c6ae51bdbb0
Dec 09 10:21:17 ryzen kernel: FS:  0000000000000000(0000) GS:ffff9c6b0e800000(0000) knlGS:0000000000000000
Dec 09 10:21:17 ryzen kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 09 10:21:17 ryzen kernel: CR2: 0000000000000020 CR3: 00000002f72cc000 CR4: 00000000003506f0
Dec 09 10:21:17 ryzen kernel: BUG: kernel NULL pointer dereference, address: 0000000000000928
Dec 09 10:21:17 ryzen kernel: #PF: supervisor write access in kernel mode
Dec 09 10:21:17 ryzen kernel: #PF: error_code(0x0002) - not-present page
Dec 09 10:21:17 ryzen kernel: PGD 31b136067 P4D 31b136067 PUD 31b135067 PMD 0 
Dec 09 10:21:17 ryzen kernel: Oops: 0002 [#2] PREEMPT SMP NOPTI
Dec 09 10:21:17 ryzen kernel: CPU: 0 PID: 698 Comm: irq/70-nvidia Tainted: P      D    OE     5.9.1-arch1-1 #1
Dec 09 10:21:17 ryzen kernel: Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 5603 07/28/2020
Dec 09 10:21:17 ryzen kernel: RIP: 0010:mutex_lock+0x10/0x20
Dec 09 10:21:17 ryzen kernel: Code: 03 31 c0 c3 eb d4 0f 1f 40 00 0f 1f 44 00 00 be 02 00 00 00 e9 61 fa ff ff 90 0f 1f 44 00 00 31 c0 65 48 8b 14 25 c0 7b 01 00 <f0> 48 0f b1 17 75 01 c3 eb d6 66 0f 1f 44 00 00 0f 1f 44 00 00 41
Dec 09 10:21:17 ryzen kernel: RSP: 0018:ffffb98b411cbe30 EFLAGS: 00010246
Dec 09 10:21:17 ryzen kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Dec 09 10:21:17 ryzen kernel: RDX: ffff9c6afd800000 RSI: 0000000000000000 RDI: 0000000000000928
Dec 09 10:21:17 ryzen kernel: RBP: 0000000000000928 R08: 000000000000001f R09: 0000000000000000
Dec 09 10:21:17 ryzen kernel: R10: ffff9c6b06053000 R11: ffffb98b411cb801 R12: ffff9c6afd8007c4
Dec 09 10:21:17 ryzen kernel: R13: 0000000000000000 R14: 0000000000000001 R15: ffff9c6afd800000
Dec 09 10:21:17 ryzen kernel: FS:  0000000000000000(0000) GS:ffff9c6b0e800000(0000) knlGS:0000000000000000
Dec 09 10:21:17 ryzen kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 09 10:21:17 ryzen kernel: CR2: 0000000000000928 CR3: 00000002f72cc000 CR4: 00000000003506f0
Dec 09 10:21:17 ryzen kernel: Call Trace:
Dec 09 10:21:17 ryzen kernel:  perf_event_exit_task+0x30/0x440
Dec 09 10:21:17 ryzen kernel:  ? preempt_count_add+0x49/0xa0
Dec 09 10:21:17 ryzen kernel:  ? put_cpu_partial+0x92/0x140
Dec 09 10:21:17 ryzen kernel:  do_exit+0x379/0xa90
Dec 09 10:21:17 ryzen kernel:  ? task_work_run+0x5c/0x90
Dec 09 10:21:17 ryzen kernel:  ? do_exit+0x369/0xa90
Dec 09 10:21:17 ryzen kernel:  ? kthread+0x142/0x160
Dec 09 10:21:17 ryzen kernel:  ? rewind_stack_do_exit+0x17/0x17
Dec 09 10:21:17 ryzen kernel: Modules linked in: udp_diag tcp_diag inet_diag fuse rfcomm cfg80211 8021q garp mrp stp llc cmac algif_hash algif_skcipher af_alg bnep vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) nvidia_drm(POE) nvidia_modese>
Dec 09 10:21:17 ryzen kernel:  crypto_user agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 xhci_pci crc32c_intel xhci_pci_renesas xhci_hcd
Dec 09 10:21:17 ryzen kernel: CR2: 0000000000000928
Dec 09 10:21:17 ryzen kernel: ---[ end trace 94f1672408441b66 ]---
Dec 09 10:21:17 ryzen kernel: RIP: 0010:_nv027470rm+0x9/0x90 [nvidia]
Dec 09 10:21:17 ryzen kernel: Code: 90 ff e8 ea b0 00 00 31 c0 48 83 c4 08 c3 31 c0 eb bf 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 48 85 ff 74 57 <48> 8b 17 31 c0 48 85 d2 75 0e eb 2b 0f 1f 00 48 8b 52 10 48 85 d2
Dec 09 10:21:17 ryzen kernel: RSP: 0018:ffffb98b411cbbe0 EFLAGS: 00010202
Dec 09 10:21:17 ryzen kernel: RAX: 0000000000000020 RBX: 0000000000000020 RCX: 0000000000000010
Dec 09 10:21:17 ryzen kernel: RDX: ffff9c6ac3df3b08 RSI: ffffffffffffffff RDI: 0000000000000020
Dec 09 10:21:17 ryzen kernel: RBP: ffff9c6ae51bd940 R08: ffffffffc317fb90 R09: ffff9c6ae51bd920
Dec 09 10:21:17 ryzen kernel: R10: ffffffffc1e557d0 R11: ffff9c6b0c014008 R12: 0000000000000020
Dec 09 10:21:17 ryzen kernel: R13: 0000000000000000 R14: ffff9c6ae51bdaa8 R15: ffff9c6ae51bdbb0
Dec 09 10:21:17 ryzen kernel: FS:  0000000000000000(0000) GS:ffff9c6b0e800000(0000) knlGS:0000000000000000
Dec 09 10:21:17 ryzen kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 09 10:21:17 ryzen kernel: CR2: 0000000000000928 CR3: 00000002f72cc000 CR4: 00000000003506f0
Dec 09 10:21:17 ryzen kernel: Fixing recursive fault but reboot is needed!
Dec 09 10:21:21 ryzen /usr/lib/gdm-x-session[1009]: (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00004844, 0x00006520)
Dec 09 10:21:28 ryzen /usr/lib/gdm-x-session[1009]: (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x00004844, 0x00006520)
Dec 09 10:21:31 ryzen /usr/lib/gdm-x-session[1009]: (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00004844, 0x00009ebc)
Dec 09 10:21:38 ryzen /usr/lib/gdm-x-session[1009]: (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x00004844, 0x00009ebc)
Dec 09 10:21:38 ryzen /usr/lib/gdm-x-session[3144]: [12/09/20, 10:21:38:182] info: Store: SET_SYSTEM_IDLE idle_taskbar
Dec 09 10:21:38 ryzen /usr/lib/gdm-x-session[3144]: [12/09/20, 10:21:38:186] info: [DESKTOP-SIDE-EFFECT] Update from desktop for keys  ["app"]
Dec 09 10:21:41 ryzen /usr/lib/gdm-x-session[1009]: (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00004844, 0x0000d858)
Dec 09 10:21:48 ryzen /usr/lib/gdm-x-session[1009]: (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x00004844, 0x0000d858)
Dec 09 10:21:51 ryzen /usr/lib/gdm-x-session[1009]: (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x00004844, 0x00001c48)
Dec 09 10:21:58 ryzen /usr/lib/gdm-x-session[1009]: (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x00004844, 0x00001c48)

Just like what you're saying, it will randomly freeze and I have the same Kernel NULL pointer dereference line at the start.
I downloaded memtest-efi from AUR and ran it overnight.  All tests passed without issues.  I the remembered that this BOIS had some overclocking settings applied which have been in place for over 2 years (like 3-5% speed improvement--nothing crazy).  Since setting clock speeds and voltages back to default levels, it ran all day today without problems, but it's too early to tell if the problems were caused by hardware instability.

I just thought I'd mention that in case your computer is overclocked you might have a related problem.

Offline

#3 2020-12-12 07:35:20

dietzi96
Member
Registered: 2015-07-04
Posts: 17

Re: Kernel null pointer dereference in ext4 code

In the log I can read NVidia tainted. Do you use a nvidia card and if yes, which driver is installed.

I haven't seen OC related kernel crashes for a long time, so maybe that's worth digging in.

Offline

#4 2020-12-12 10:04:08

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 21,626

Re: Kernel null pointer dereference in ext4 code

The nvidia issue is well known and I don't believe it to have a relation to the OPs problem (... though it might have but the stack being weirdly populated  by ext4 errors).

While for a different underlying issue, the nvidia related null pointer hasn't happened to me yet after applying Aarons small kmalloc patch: https://bbs.archlinux.org/viewtopic.php … 3#p1938113

Offline

#5 2020-12-13 13:00:53

jypma
Member
Registered: 2015-04-21
Posts: 13

Re: Kernel null pointer dereference in ext4 code

Yes I'm using the nvidia propietary driver (nvidia 455.45.01-5), it's a Thinkpad T540p with the following graphics:

00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06)
01:00.0 VGA compatible controller: NVIDIA Corporation GK208M [GeForce GT 730M] (rev a1)

Would you recommend trying to switch to nouveau?

I'm not sure though how the nvidia driver would cause issues with the ext4 filesystem handling... :-)

It might be worth noting that I'm using prime-run to switch the nvidia card on and off; although the crashes don't really occur while the card is on (I only use it for games).

Last edited by jypma (2020-12-13 13:02:37)

Offline

#6 2020-12-13 13:42:29

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 21,626

Re: Kernel null pointer dereference in ext4 code

I really don't think nvidia is related in your case,  the initial failure seems to be an issue with kswap, do you use a swapfile? Maybe that has some corruption or so. Try fsck'ing your drives at the least, maybe even run a long SMART check and post the output of  smartctl -a after the time it mentions to need has elapsed

Offline

#7 2020-12-13 14:00:13

seth
Member
Registered: 2012-09-03
Posts: 50,927

Re: Kernel null pointer dereference in ext4 code

Is there context in the log (HW messsages, IO or PCI errors?)
What does the output of "smartctl -a" for the device look like? (w/o even a test)

Online

#8 2020-12-13 14:17:29

jypma
Member
Registered: 2015-04-21
Posts: 13

Re: Kernel null pointer dereference in ext4 code

Yes I have a swapfile, I'll disable it for now and see what that brings.

smartctl -a for /dev/sda:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.81-1-lts] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD20SPZX-00UA7T0
Serial Number:    WD-WXF1E882Z11E
LU WWN Device Id: 5 0014ee 6b3dc7f6d
Firmware Version: 01.01A01
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Dec 13 15:15:18 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		(21660) seconds.
Offline data collection
capabilities: 			 (0x71) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 350) minutes.
Conveyance self-test routine
recommended polling time: 	 (   4) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   219   218   021    Pre-fail  Always       -       2050
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1143
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       8183
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1143
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       45
193 Load_Cycle_Count        0x0032   197   197   000    Old_age   Always       -       11178
194 Temperature_Celsius     0x0022   103   090   000    Old_age   Always       -       44
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         6         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

smartctl for /dev/sdb:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.81-1-lts] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     TS240GMTS420S
Serial Number:    G022060218
LU WWN Device Id: 5 7c3548 1a28c22ba
Firmware Version: S0130B0
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      M.2
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Dec 13 15:16:21 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		(  120) seconds.
Offline data collection
capabilities: 			 (0x11) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					No Selective Self-test supported.
SMART capabilities:            (0x0002)	Does not save SMART data before
					entering power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  10) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   100   100   050    Old_age   Always       -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   050    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   050    Old_age   Always       -       510
 12 Power_Cycle_Count       0x0032   100   100   050    Old_age   Always       -       86
160 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       0
161 Unknown_Attribute       0x0033   100   100   050    Pre-fail  Always       -       100
163 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       34
164 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       3577
165 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       8
166 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       1
167 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       1
168 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       1000
169 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       100
175 Program_Fail_Count_Chip 0x0032   100   100   050    Old_age   Always       -       0
176 Erase_Fail_Count_Chip   0x0032   100   100   050    Old_age   Always       -       0
177 Wear_Leveling_Count     0x0032   100   100   050    Old_age   Always       -       0
178 Used_Rsvd_Blk_Cnt_Chip  0x0032   100   100   050    Old_age   Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   050    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   050    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   050    Old_age   Always       -       7
194 Temperature_Celsius     0x0022   100   100   050    Old_age   Always       -       39
195 Hardware_ECC_Recovered  0x0032   100   100   050    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   050    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   050    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0032   100   100   050    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   050    Old_age   Always       -       0
232 Available_Reservd_Space 0x0032   100   100   050    Old_age   Always       -       100
241 Total_LBAs_Written      0x0030   100   100   050    Old_age   Offline      -       2973
242 Total_LBAs_Read         0x0030   100   100   050    Old_age   Offline      -       1860
245 Unknown_Attribute       0x0032   100   100   050    Old_age   Always       -       420

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

Offline

#9 2020-12-13 14:46:26

seth
Member
Registered: 2012-09-03
Posts: 50,927

Re: Kernel null pointer dereference in ext4 code

Neither looks particularily unhealthy.
It's more likely RAM than ext4, though. Do you have more GPF logs?

After the smart test, see https://wiki.archlinux.org/index.php/St … ing_memory

Online

#10 2021-01-02 16:58:01

jypma
Member
Registered: 2015-04-21
Posts: 13

Re: Kernel null pointer dereference in ext4 code

I've been running kernel 5.9.14-arch1-1 since a while now, uptime 6 days and counting. No crashes yet; another kernel upgrade is available though, so I'll soon reboot. The original issue seems to have gone, which means it could indeed have been a bug in that particular kernel version. I'll add [SOLVED] if there's no crash for a while to come.

Offline

Board footer

Powered by FluxBB