Kernel NULL pointer dereference

Bidski · 2020-04-02 00:18:32

I am experiencing regular lock-ups on my system. During the most recent lock-up I managed to ssh in retrieve this from journalctl -xb

Apr 02 10:41:14 bidski-beast kernel: general protection fault: 0000 [#1] PREEMPT SMP NOPTI
Apr 02 10:41:14 bidski-beast kernel: CPU: 4 PID: 1009 Comm: irq/115-nvidia Tainted: P           OE     5.5.13-zen2-1-zen #1
Apr 02 10:41:14 bidski-beast kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B79/X470 GAMING PLUS (MS-7B79), BIOS A.G0 11/11/2019
Apr 02 10:41:14 bidski-beast kernel: RIP: 0010:_nv010253rm+0x4b/0x60 [nvidia]
Apr 02 10:41:14 bidski-beast kernel: Code: 48 8b 52 10 48 3b 50 08 73 0b eb 12 0f 1f 00 48 39 50 08 77 09 48 8b 40 20 48 85 c0 75 f1 48 89 07 31 c0 c3 0f 1f 00 48 89 d0 <48> 8b 50 28 48 85 d2 75 f4 48 89 07 31 c0 c3 66 0f 1f 44 00 00 48
Apr 02 10:41:14 bidski-beast kernel: RSP: 0018:ffffae890393fc28 EFLAGS: 00010286
Apr 02 10:41:14 bidski-beast kernel: RAX: ff0f9a028411b690 RBX: ffff9a02e7ebac50 RCX: ffffffffc2944fc0
Apr 02 10:41:14 bidski-beast kernel: RDX: ff0f9a028411b690 RSI: ffff9a028411ee90 RDI: ffff9a02e7ebac50
Apr 02 10:41:14 bidski-beast kernel: RBP: ffff9a02e7ebac30 R08: 0000000000000000 R09: ffff9a02e7ebaad0
Apr 02 10:41:14 bidski-beast kernel: R10: ffffffffc1c4e080 R11: ffffffffffffffff R12: 00000000c1d00020
Apr 02 10:41:14 bidski-beast kernel: R13: ffff9a01dd2ecc08 R14: 00000000beef0003 R15: ffff9a028411ec10
Apr 02 10:41:14 bidski-beast kernel: FS:  0000000000000000(0000) GS:ffff9a031e900000(0000) knlGS:0000000000000000
Apr 02 10:41:14 bidski-beast kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 02 10:41:14 bidski-beast kernel: CR2: 00007fb3e1d85ec0 CR3: 000000067aaf4000 CR4: 00000000003406e0
Apr 02 10:41:14 bidski-beast kernel: Call Trace:
Apr 02 10:41:14 bidski-beast kernel:  ? _nv000073rm+0x8f/0x1b0 [nvidia]
Apr 02 10:41:14 bidski-beast kernel:  ? _nv011154rm+0xfc/0x180 [nvidia]
Apr 02 10:41:14 bidski-beast kernel:  ? _nv018848rm+0x1af/0x210 [nvidia]
Apr 02 10:41:14 bidski-beast kernel:  ? _nv018774rm+0xdad/0xef0 [nvidia]
Apr 02 10:41:14 bidski-beast kernel:  ? _nv018775rm+0xd9/0x240 [nvidia]
Apr 02 10:41:14 bidski-beast kernel:  ? _nv018749rm+0x72/0xd0 [nvidia]
Apr 02 10:41:14 bidski-beast kernel:  ? _nv018760rm+0x20c/0x2b0 [nvidia]
Apr 02 10:41:14 bidski-beast kernel:  ? _nv018790rm+0xa7/0xd0 [nvidia]
Apr 02 10:41:14 bidski-beast kernel:  ? _nv027074rm+0x51e/0x9f0 [nvidia]
Apr 02 10:41:14 bidski-beast kernel:  ? _nv007421rm+0x17d/0x260 [nvidia]
Apr 02 10:41:14 bidski-beast kernel:  ? _nv027081rm+0x8d/0x180 [nvidia]
Apr 02 10:41:14 bidski-beast kernel:  ? _nv000911rm+0x10e/0x180 [nvidia]
Apr 02 10:41:14 bidski-beast kernel:  ? irq_forced_thread_fn+0x80/0x80
Apr 02 10:41:14 bidski-beast kernel:  ? rm_isr_bh+0x1c/0x60 [nvidia]
Apr 02 10:41:14 bidski-beast kernel:  ? nvidia_isr_kthread_bh+0x1b/0x40 [nvidia]
Apr 02 10:41:14 bidski-beast kernel:  ? irq_thread_fn+0x20/0x60
Apr 02 10:41:14 bidski-beast kernel:  ? irq_thread+0x209/0x2b0
Apr 02 10:41:14 bidski-beast kernel:  ? irq_thread_fn+0x60/0x60
Apr 02 10:41:14 bidski-beast kernel:  ? kthread+0x131/0x170
Apr 02 10:41:14 bidski-beast kernel:  ? irq_set_irq_wake+0x1b0/0x1b0
Apr 02 10:41:14 bidski-beast kernel:  ? kthread_park+0x90/0x90
Apr 02 10:41:14 bidski-beast kernel:  ? ret_from_fork+0x22/0x40
Apr 02 10:41:14 bidski-beast kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_nat n>
Apr 02 10:41:14 bidski-beast kernel:  libcrc32c videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common cryptd glue_helper videodev md_mod mc snd_timer syscopyarea snd realtek sysfillrect libphy rfkill soundcore sysimgblt gpio>
Apr 02 10:41:14 bidski-beast kernel: ---[ end trace 80272e4c098a8505 ]---
Apr 02 10:41:14 bidski-beast kernel: RIP: 0010:_nv010253rm+0x4b/0x60 [nvidia]
Apr 02 10:41:14 bidski-beast kernel: Code: 48 8b 52 10 48 3b 50 08 73 0b eb 12 0f 1f 00 48 39 50 08 77 09 48 8b 40 20 48 85 c0 75 f1 48 89 07 31 c0 c3 0f 1f 00 48 89 d0 <48> 8b 50 28 48 85 d2 75 f4 48 89 07 31 c0 c3 66 0f 1f 44 00 00 48
Apr 02 10:41:14 bidski-beast kernel: RSP: 0018:ffffae890393fc28 EFLAGS: 00010286
Apr 02 10:41:14 bidski-beast kernel: RAX: ff0f9a028411b690 RBX: ffff9a02e7ebac50 RCX: ffffffffc2944fc0
Apr 02 10:41:14 bidski-beast kernel: RDX: ff0f9a028411b690 RSI: ffff9a028411ee90 RDI: ffff9a02e7ebac50
Apr 02 10:41:14 bidski-beast kernel: RBP: ffff9a02e7ebac30 R08: 0000000000000000 R09: ffff9a02e7ebaad0
Apr 02 10:41:14 bidski-beast kernel: R10: ffffffffc1c4e080 R11: ffffffffffffffff R12: 00000000c1d00020
Apr 02 10:41:14 bidski-beast kernel: R13: ffff9a01dd2ecc08 R14: 00000000beef0003 R15: ffff9a028411ec10
Apr 02 10:41:14 bidski-beast kernel: FS:  0000000000000000(0000) GS:ffff9a031e900000(0000) knlGS:0000000000000000
Apr 02 10:41:14 bidski-beast kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 02 10:41:14 bidski-beast kernel: CR2: 00007fb3e1d85ec0 CR3: 000000067aaf4000 CR4: 00000000003406e0
Apr 02 10:41:14 bidski-beast kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Apr 02 10:41:14 bidski-beast kernel: #PF: supervisor instruction fetch in kernel mode
Apr 02 10:41:14 bidski-beast kernel: #PF: error_code(0x0010) - not-present page
Apr 02 10:41:14 bidski-beast kernel: PGD 68fc15067 P4D 68fc15067 PUD 0
Apr 02 10:41:14 bidski-beast kernel: Oops: 0010 [#2] PREEMPT SMP NOPTI
Apr 02 10:41:14 bidski-beast kernel: CPU: 4 PID: 1009 Comm: irq/115-nvidia Tainted: P      D    OE     5.5.13-zen2-1-zen #1
Apr 02 10:41:14 bidski-beast kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B79/X470 GAMING PLUS (MS-7B79), BIOS A.G0 11/11/2019
Apr 02 10:41:14 bidski-beast kernel: RIP: 0010:0x0
Apr 02 10:41:14 bidski-beast kernel: Code: Bad RIP value.
Apr 02 10:41:14 bidski-beast kernel: RSP: 0018:ffffae890393fe90 EFLAGS: 00010286
Apr 02 10:41:14 bidski-beast kernel: RAX: 0000000000000000 RBX: ffff9a0302dcbc80 RCX: 0000000000000000
Apr 02 10:41:14 bidski-beast kernel: RDX: ffffae890393fea0 RSI: 0000000000000000 RDI: ffffae890393fea0
Apr 02 10:41:14 bidski-beast kernel: RBP: ffff9a0302dcc3f0 R08: 0000000000000001 R09: 0000000000000001
Apr 02 10:41:14 bidski-beast kernel: R10: ffffffff9371dec8 R11: ffffae890393f900 R12: ffff9a0302dcbc80
Apr 02 10:41:14 bidski-beast kernel: R13: ffffffff93716dd0 R14: 0000000000000000 R15: ffff9a0302dcc424
Apr 02 10:41:14 bidski-beast kernel: FS:  0000000000000000(0000) GS:ffff9a031e900000(0000) knlGS:0000000000000000
Apr 02 10:41:14 bidski-beast kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 02 10:41:14 bidski-beast kernel: CR2: ffffffffffffffd6 CR3: 000000067aaf4000 CR4: 00000000003406e0
Apr 02 10:41:14 bidski-beast kernel: Call Trace:
Apr 02 10:41:14 bidski-beast kernel:  task_work_run+0x93/0xb0
Apr 02 10:41:14 bidski-beast kernel:  do_exit+0x36b/0xb60
Apr 02 10:41:14 bidski-beast kernel:  ? kthread+0x131/0x170
Apr 02 10:41:14 bidski-beast kernel:  rewind_stack_do_exit+0x17/0x20
Apr 02 10:41:14 bidski-beast kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_nat n>
Apr 02 10:41:14 bidski-beast kernel:  libcrc32c videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common cryptd glue_helper videodev md_mod mc snd_timer syscopyarea snd realtek sysfillrect libphy rfkill soundcore sysimgblt gpio>
Apr 02 10:41:14 bidski-beast kernel: CR2: 0000000000000000
Apr 02 10:41:14 bidski-beast kernel: ---[ end trace 80272e4c098a8506 ]---
Apr 02 10:41:14 bidski-beast kernel: RIP: 0010:_nv010253rm+0x4b/0x60 [nvidia]
Apr 02 10:41:14 bidski-beast kernel: Code: 48 8b 52 10 48 3b 50 08 73 0b eb 12 0f 1f 00 48 39 50 08 77 09 48 8b 40 20 48 85 c0 75 f1 48 89 07 31 c0 c3 0f 1f 00 48 89 d0 <48> 8b 50 28 48 85 d2 75 f4 48 89 07 31 c0 c3 66 0f 1f 44 00 00 48
Apr 02 10:41:14 bidski-beast kernel: RSP: 0018:ffffae890393fc28 EFLAGS: 00010286
Apr 02 10:41:14 bidski-beast kernel: RAX: ff0f9a028411b690 RBX: ffff9a02e7ebac50 RCX: ffffffffc2944fc0
Apr 02 10:41:14 bidski-beast kernel: RDX: ff0f9a028411b690 RSI: ffff9a028411ee90 RDI: ffff9a02e7ebac50
Apr 02 10:41:14 bidski-beast kernel: RBP: ffff9a02e7ebac30 R08: 0000000000000000 R09: ffff9a02e7ebaad0
Apr 02 10:41:14 bidski-beast kernel: R10: ffffffffc1c4e080 R11: ffffffffffffffff R12: 00000000c1d00020
Apr 02 10:41:14 bidski-beast kernel: R13: ffff9a01dd2ecc08 R14: 00000000beef0003 R15: ffff9a028411ec10
Apr 02 10:41:14 bidski-beast kernel: FS:  0000000000000000(0000) GS:ffff9a031e900000(0000) knlGS:0000000000000000
Apr 02 10:41:14 bidski-beast kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 02 10:41:14 bidski-beast kernel: CR2: ffffffffffffffd6 CR3: 000000067aaf4000 CR4: 00000000003406e0
Apr 02 10:41:14 bidski-beast kernel: Fixing recursive fault but reboot is needed!

Apr 02 10:42:47 bidski-beast kernel: audit: type=1101 audit(1585784567.793:166): pid=7225 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:accounting grantors=pam_tally2,pam_access,pam_unix,pam_permit,pam_time acct="bidski" exe="/usr/bin>
Apr 02 10:41:14 bidski-beast kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 02 10:41:14 bidski-beast kernel: CR2: ffffffffffffffd6 CR3: 000000067aaf4000 CR4: 00000000003406e0
Apr 02 10:41:14 bidski-beast kernel: Call Trace:
Apr 02 10:41:14 bidski-beast kernel:  task_work_run+0x93/0xb0
Apr 02 10:41:14 bidski-beast kernel:  do_exit+0x36b/0xb60
Apr 02 10:41:14 bidski-beast kernel:  ? kthread+0x131/0x170
Apr 02 10:41:14 bidski-beast kernel:  rewind_stack_do_exit+0x17/0x20
Apr 02 10:41:14 bidski-beast kernel: Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_nat n>
Apr 02 10:41:14 bidski-beast kernel:  libcrc32c videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common cryptd glue_helper videodev md_mod mc snd_timer syscopyarea snd realtek sysfillrect libphy rfkill soundcore sysimgblt gpio>
Apr 02 10:41:14 bidski-beast kernel: CR2: 0000000000000000
Apr 02 10:41:14 bidski-beast kernel: ---[ end trace 80272e4c098a8506 ]---
Apr 02 10:41:14 bidski-beast kernel: RIP: 0010:_nv010253rm+0x4b/0x60 [nvidia]
Apr 02 10:41:14 bidski-beast kernel: Code: 48 8b 52 10 48 3b 50 08 73 0b eb 12 0f 1f 00 48 39 50 08 77 09 48 8b 40 20 48 85 c0 75 f1 48 89 07 31 c0 c3 0f 1f 00 48 89 d0 <48> 8b 50 28 48 85 d2 75 f4 48 89 07 31 c0 c3 66 0f 1f 44 00 00 48
Apr 02 10:41:14 bidski-beast kernel: RSP: 0018:ffffae890393fc28 EFLAGS: 00010286
Apr 02 10:41:14 bidski-beast kernel: RAX: ff0f9a028411b690 RBX: ffff9a02e7ebac50 RCX: ffffffffc2944fc0
Apr 02 10:41:14 bidski-beast kernel: RDX: ff0f9a028411b690 RSI: ffff9a028411ee90 RDI: ffff9a02e7ebac50
Apr 02 10:41:14 bidski-beast kernel: RBP: ffff9a02e7ebac30 R08: 0000000000000000 R09: ffff9a02e7ebaad0
Apr 02 10:41:14 bidski-beast kernel: R10: ffffffffc1c4e080 R11: ffffffffffffffff R12: 00000000c1d00020
Apr 02 10:41:14 bidski-beast kernel: R13: ffff9a01dd2ecc08 R14: 00000000beef0003 R15: ffff9a028411ec10
Apr 02 10:41:14 bidski-beast kernel: FS:  0000000000000000(0000) GS:ffff9a031e900000(0000) knlGS:0000000000000000
Apr 02 10:41:14 bidski-beast kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 02 10:41:14 bidski-beast kernel: CR2: ffffffffffffffd6 CR3: 000000067aaf4000 CR4: 00000000003406e0
Apr 02 10:41:14 bidski-beast kernel: Fixing recursive fault but reboot is needed!

The full log includes some crashes in other apps (gnome-shell is one). Here is a link to the full output from journalctl -xb

System information

$ uname -a 
Linux bidski-beast 5.5.13-zen2-1-zen #1 ZEN SMP PREEMPT Mon, 30 Mar 2020 20:45:45 +0000 x86_64 GNU/Linux

$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 8
model name	: AMD Ryzen 7 2700X Eight-Core Processor
stepping	: 2
microcode	: 0x800820d
cpu MHz		: 2484.691
cache size	: 512 KB
physical id	: 0
siblings	: 16
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 7399.10
TLB size	: 2560 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
[ --- TRIMMED --- ]

$ dmesg | head
[    0.000000] Linux version 5.5.13-zen2-1-zen (linux-zen@archlinux) (gcc version 9.3.0 (Arch Linux 9.3.0-1)) #1 ZEN SMP PREEMPT Mon, 30 Mar 2020 20:45:45 +0000
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-linux-zen root=UUID=871d06cd-e1db-473b-bedf-2a4ffd3aa98e rw resume=UUID=18787b13-2746-401d-95b1-1ea1dee6bcfb rcu_nocbs=0-15 processor.max_cstate=5 idle=nomwait

The parameters "rcu_nocbs=0-15 processor.max_cstate=5 idle=nomwait" were added because of potential issues the AMD Ryzen 2700X CPU

$ dkms status
broadcom-wl, 6.30.223.271, 5.5.13-arch2-1, x86_64: installed
broadcom-wl, 6.30.223.271, 5.5.13-zen2-1-zen, x86_64: installed
nvidia, 440.64, 5.5.13-arch2-1, x86_64: installed
nvidia, 440.64, 5.5.13-zen2-1-zen, x86_64: installed
vboxhost, 6.1.4_OSE, 5.5.13-arch2-1, x86_64: installed
vboxhost, 6.1.4_OSE, 5.5.13-zen2-1-zen, x86_64: installed

Has anyone else experienced this (or a similar) issue or knows how to debug it further?

Bidski · 2020-04-07 00:18:41

I tried to boot up again today with the hope of trying the linux-lts kernel. Both linux and linux-zen kernels are reporting a NULL pointer dereference during boot and I am unable to reach a login screen (tty or otherwise).

I loaded up a LiveUSB instead, updated my system to the latest linux and linux-zen versions (5.6.2). I still see the NULL pointer dereference for both kernels.

I had to install linux-lts 5.4.30 twice, because during the first attempt the LiveUSB experienced a general protection fault.

He are some links to some images showing the kernel faults. Sorry for the image quality, a skilled photographer I am not.
linux-zen boot time fault
linux-lts boot time fault
LiveUSB general protection fault

It seems that the nvidia driver may have nothing to do with the issues I am experiencing?

Does anyone have any ideas about how I can fix this?

V1del · 2020-04-07 06:44:46

Looks like RAM issues, run a memtest over night.

Bidski · 2020-04-07 21:52:14

Ran memtest (memtest86-efi package from aur). No errors

Bidski · 2020-04-10 06:19:07

I ran mprime from a LiveUSB for ~24 hours and got 3 or 4 errors. Is there any way to narrow down what is causing the errors?

Bidski · 2020-04-10 06:25:53

I am dual booting with Windows and Windows seems perfectly happy to boot. Is Windows just more tolerant of RAM issues or is there something specific about linux going on here?

Lone_Wolf · 2020-04-10 11:35:37

In my experience windows tries agressively to keep used memory as low as possible while linux tries to use all available memory .

example : system with 8 GiB memory (2x 4GiB ) with an error in the highest 512 Mib of one stick .
Windows ran for years without memory errrors due to not using that part of memory .

Archlinux crashed several times in the first week of installation.
checking the sticks one at a time revealed which one had problems, memtest confirmed one stick was unreliable.

I've tried to look at the full log you posted even though the -x parameter makes it hard to read.
I'm not sure this is a memory issue though.

The 2 boot pics show the error after starting the journal service .
The liveUSB GPF occurs when system is trying to write something to disk also.

Boot a recent liveusb* , run lsblk -f and check whether your root partition still has room.
Also use smartctl to verify health of your drives.

*The one shown in the pic looks like it may be several months old.

Last edited by Lone_Wolf (2020-04-10 11:36:43)

Bidski · 2020-04-11 00:30:36

Now running the 2020-04-01 liveusb.

Appears to be plenty of free space on my root and boot partitions

$ lsblk -f
NAME        FSTYPE            FSVER LABEL          UUID                                 FSAVAIL FSUSE% MOUNTPOINT
loop0       squashfs          4.0                                                             0   100% /run/archiso/sfs/airootfs
sda         linux_raid_member 1.2   bidski-beast:0 e4385ee6-216f-baa2-8df6-bf33c136cff1                
└─md127     ext4              1.0   RAID           b2bedd2d-384e-4663-907f-94bd3bfe4a24                
sdb         linux_raid_member 1.2   bidski-beast:0 e4385ee6-216f-baa2-8df6-bf33c136cff1                
└─md127     ext4              1.0   RAID           b2bedd2d-384e-4663-907f-94bd3bfe4a24                
sdc         linux_raid_member 1.2   bidski-beast:0 e4385ee6-216f-baa2-8df6-bf33c136cff1                
└─md127     ext4              1.0   RAID           b2bedd2d-384e-4663-907f-94bd3bfe4a24                
sdd         linux_raid_member 1.2   bidski-beast:0 e4385ee6-216f-baa2-8df6-bf33c136cff1                
└─md127     ext4              1.0   RAID           b2bedd2d-384e-4663-907f-94bd3bfe4a24                
sde                                                                                                    
├─sde1                                                                                                 
├─sde2      ntfs                                   9234DA4D34DA33C7                                    
└─sde3      ntfs                                   6234ADE634ADBD83                                    
sdf         iso9660                 ARCH_202004    2020-04-01-03-37-13-00                              
├─sdf1      iso9660                 ARCH_202004    2020-04-01-03-37-13-00                     0   100% /run/archiso/bootmnt
└─sdf2      vfat              FAT16 ARCHISO_EFI    C3DB-1B0D                                           
sr0                                                                                                    
nvme0n1                                                                                                
├─nvme0n1p1 vfat              FAT32                1266-B556                             362.6M    34% /mnt/boot
├─nvme0n1p2 ext4              1.0                  871d06cd-e1db-473b-bedf-2a4ffd3aa98e     42G    52% /mnt
├─nvme0n1p3 swap              1                    18787b13-2746-401d-95b1-1ea1dee6bcfb                
└─nvme0n1p4 ext4              1.0                  e8afd074-9161-40ab-85ba-f5c5cf216bd8  201.1G    36% /mnt/home

Root partition is on an SSD drive that doesn't appear to have any SMART support

$ smartctl --info /dev/nvme0n1 
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.5.13-arch2-1] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 500GB
Serial Number:                      S466NX0KC11385E
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 500,107,862,016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Utilization:            496,132,259,840 [496 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5c81b03193
Local Time is:                      Sat Apr 11 00:23:17 2020 UTC

I will run some smart tests on the other drives (part of a raid that I use for data storage), but I suspect that these drive shouldn't be causing any issues (I would guess that they wouldn't be mounted until later in the boot process)

Bidski · 2020-04-11 06:52:30

I ran the following on the SSD

# smartctl -l selftest /dev/nvme0n1
# smartctl -a /dev/nvme0n1

and I ran

# smartctl -t long /dev/<device>

for all other attached HDDs. All reports show

SMART overall-health self-assessment test result: PASSED

seth · 2020-04-11 07:21:00

Did you try passing "iommu=soft" to the kernel?

Bidski · 2020-04-11 07:32:39

I haven't tried that yet, but I was just playing around with kernel boot parameters. I mentioned earlier that I had the following kernel parameters

rcu_nocbs=0-15 processor.max_cstate=5 idle=nomwait

but I just realised that I also have these ones enabled

systemd.log_level=debug acpi=off systemd.journald.forward_to_console=1 console=tty1

I can't remember at what point I added these, must have been just after I first reported issues in an attempt to figure out what was happening. Turns out, if I remove "acpi=off" from the kernel boot parameters then I can boot into arch and log in.

Should I still try adding "iommu=soft"? Its too soon to tell if I am still suffering from instability issues with my Ryzen 2700X, would the iommu option potentially help with this (I do vaguely recall reading about people saying that changing IOMMU and OpCache in the bios could help)?

seth · 2020-04-11 12:30:58

Well, see whether the nullptr derefs re-occur.
If yes, try the software iommu.
If not, don't fix it if it ain't broken ;-)

Bidski · 2020-04-12 22:44:28

Is it a bug for the kernel to have a null pointer dereference when "acpi=off" is provided as a kernel command line parameter?

Ropid · 2020-04-12 23:11:42

Bidski wrote:

Is it a bug for the kernel to have a null pointer dereference when "acpi=off" is provided as a kernel command line parameter?

Perhaps the kernel is using an area of RAM that's also used by a hardware device, and because you disabled ACPI the kernel can't know that the hardware device has reserved that area of RAM?

I noticed you use "processor.max_cstate=5". That's not doing what you think it does on a 2700X. You probably want "processor.max_cstate=1". On Ryzen, the OS can just see two C-states (or three if you count "C0"). The one that Linux thinks is "C2" seems to be what AMD calls C6.

Bidski · 2020-04-12 23:40:48

Ropid wrote:

Perhaps the kernel is using an area of RAM that's also used by a hardware device, and because you disabled ACPI the kernel can't know that the hardware device has reserved that area of RAM?

What is the best forum to use to ask kernel devs about this? Do any kernel devs frequent this forum?

Ropid wrote:

I noticed you use "processor.max_cstate=5". That's not doing what you think it does on a 2700X. You probably want "processor.max_cstate=1". On Ryzen, the OS can just see two C-states (or three if you count "C0"). The one that Linux thinks is "C2" seems to be what AMD calls C6.

Thanks for this. I'll be sure to try it if my system ever freezes on me again. I only added it because "rcu_nocbs=0-15" didn't seem to be working, but I found out recently that the linux-zen kernel wasn't compiled to make use of the rcu_nocbs parameter, so that may have been why it wasn't doing it.

EDIT: Although looking at it more closely it appears that the linux kernel package is also ignoring that kernel parameter. Are there any kernels in the main package repositories that recognise the rcu_nocbs kernel parameter?

Last edited by Bidski (2020-04-12 23:59:38)

loqs · 2020-04-12 23:57:26

See https://www.kernel.org/doc/html/latest/ … -bugs.html if you are confident it is not an issue with your motherboards firmware.

Bidski · 2020-04-13 00:01:56

loqs wrote:

See https://www.kernel.org/doc/html/latest/ … -bugs.html if you are confident it is not an issue with your motherboards firmware.

That is the problem though, I am not sure where the issue lies. I felt it would be better to hit a forum before I report a bug. How would I go about confirming if it is a firmware issue, a kernel issue, or some other issue?

seth · 2020-04-13 06:33:50

You could use acpi=off but disable the otherwise claimed (see dmesg, the memory map is pretty much on the top) memory using the memmap kernel parameters, https://raw.githubusercontent.com/torva … meters.txt (ACPI NVS is probably the only critical range but idk that for sure at all)
If the system remains stable, you probably encountered a memory collision.

lkml will btw. probably tell you to first remove the nvidia blob from the equation

Bidski · 2020-04-13 06:48:43

seth wrote:

You could use acpi=off but disable the otherwise claimed (see dmesg, the memory map is pretty much on the top) memory using the memmap kernel parameters, https://raw.githubusercontent.com/torva … meters.txt (ACPI NVS is probably the only critical range but idk that for sure at all)
If the system remains stable, you probably encountered a memory collision.

I am a little lost here. Here is the current memory map for my system (acpi left unspecified on the command line, so whatever the default is). What are you suggesting I do with the memmap kernel parameter?

BIOS-provided physical RAM map:
BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x0000000009d7ffff] usable
BIOS-e820: [mem 0x0000000009d80000-0x0000000009ffffff] reserved
BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
BIOS-e820: [mem 0x000000000a200000-0x000000000a20afff] ACPI NVS
BIOS-e820: [mem 0x000000000a20b000-0x000000000affffff] usable
BIOS-e820: [mem 0x000000000b000000-0x000000000b01ffff] reserved
BIOS-e820: [mem 0x000000000b020000-0x00000000bba65fff] usable
BIOS-e820: [mem 0x00000000bba66000-0x00000000bbbd4fff] reserved
BIOS-e820: [mem 0x00000000bbbd5000-0x00000000bbd57fff] usable
BIOS-e820: [mem 0x00000000bbd58000-0x00000000bc16ffff] ACPI NVS
BIOS-e820: [mem 0x00000000bc170000-0x00000000bcfaafff] reserved
BIOS-e820: [mem 0x00000000bcfab000-0x00000000bd04bfff] type 20
BIOS-e820: [mem 0x00000000bd04c000-0x00000000beffffff] usable
BIOS-e820: [mem 0x00000000bf000000-0x00000000bfffffff] reserved
BIOS-e820: [mem 0x00000000f8000000-0x00000000fbffffff] reserved
BIOS-e820: [mem 0x00000000fd000000-0x00000000ffffffff] reserved
BIOS-e820: [mem 0x0000000100000000-0x000000083f37ffff] usable

seth wrote:

lkml will btw. probably tell you to first remove the nvidia blob from the equation

lkml??

I assume I would be aiming to boot to tty rather than a graphical target?

seth · 2020-04-13 07:44:57

https://en.wikipedia.org/wiki/Linux_kernel_mailing_list
The target doesn't matter, if you want to avoid the nvidia blob, you'd have to uninstall it (just blacklisting the nvidia module will be a problem because the nvidia package blacklists nouveau)

BIOS-e820: [mem 0x000000000a200000-0x000000000a20afff] ACPI NVS
…
BIOS-e820: [mem 0x00000000bbd58000-0x00000000bc16ffff] ACPI NVS

*Assuming* those are the only relevant lines for your issue

memmap=0xaff#0xa200000,0x417fff#0xbbd58000

Pass this to the boot loader command line (in case anything goes wrong - typo, miscalculation - this might simply fail the boot, so you don't want it on your disk where you have to fix it offline)

Bidski · 2020-04-13 22:59:22

seth wrote:

BIOS-e820: [mem 0x000000000a200000-0x000000000a20afff] ACPI NVS
…
BIOS-e820: [mem 0x00000000bbd58000-0x00000000bc16ffff] ACPI NVS

memmap=0xaff#0xa200000,0x417fff#0xbbd58000

I haven't managed to test it yet, but this seems wrong?

From https://raw.githubusercontent.com/torva … meters.txt

	memmap=nn[KMG]@ss[KMG]
			[KNL] Force usage of a specific region of memory.
			Region of memory to be used is from ss to ss+nn.
			If @ss[KMG] is omitted, it is equivalent to mem=nn[KMG],
			which limits max address to nn[KMG].
			Multiple different regions can be specified,
			comma delimited.
			Example:
				memmap=100M@2G,100M#3G,1G!1024G

	memmap=nn[KMG]#ss[KMG]
			[KNL,ACPI] Mark specific memory as ACPI data.
			Region of memory to be marked is from ss to ss+nn.

This makes me feel that I would need to convert the hexadecimal ranges to decimal values of KiB, MiB, or GiB?

So the range

BIOS-e820: [mem 0x000000000a200000-0x000000000a20afff] ACPI NVS

would seem to be equivalent to 162M-162.042967796M, and

BIOS-e820: [mem 0x00000000bbd58000-0x00000000bc16ffff] ACPI NVS

would be 2.934906006G-2.938903808G, so would the kernel parameter be this then?

memmap=162M#44K,3G#4M

seth · 2020-04-14 07:51:10

No, memmap should™ take the hex values just fine ("[KMG]" is optional by "[]") and I'd always go with those numbers rather than some rounding errors (the KMG notation is just more convenient for other scenarios where you want to limit your RAM etc.)

Arch Linux

#1 2020-04-02 00:18:32

Kernel NULL pointer dereference

#2 2020-04-07 00:18:41

Re: Kernel NULL pointer dereference

#3 2020-04-07 06:44:46

Re: Kernel NULL pointer dereference

#4 2020-04-07 21:52:14

Re: Kernel NULL pointer dereference

#5 2020-04-10 06:19:07

Re: Kernel NULL pointer dereference

#6 2020-04-10 06:25:53

Re: Kernel NULL pointer dereference

#7 2020-04-10 11:35:37

Re: Kernel NULL pointer dereference

#8 2020-04-11 00:30:36

Re: Kernel NULL pointer dereference

#9 2020-04-11 06:52:30

Re: Kernel NULL pointer dereference

#10 2020-04-11 07:21:00

Re: Kernel NULL pointer dereference

#11 2020-04-11 07:32:39

Re: Kernel NULL pointer dereference

#12 2020-04-11 12:30:58

Re: Kernel NULL pointer dereference

#13 2020-04-12 22:44:28

Re: Kernel NULL pointer dereference

#14 2020-04-12 23:11:42

Re: Kernel NULL pointer dereference

#15 2020-04-12 23:40:48

Re: Kernel NULL pointer dereference

#16 2020-04-12 23:57:26

Re: Kernel NULL pointer dereference

#17 2020-04-13 00:01:56

Re: Kernel NULL pointer dereference

#18 2020-04-13 06:33:50

Re: Kernel NULL pointer dereference

#19 2020-04-13 06:48:43

Re: Kernel NULL pointer dereference

#20 2020-04-13 07:44:57

Re: Kernel NULL pointer dereference

#21 2020-04-13 22:59:22

Re: Kernel NULL pointer dereference

#22 2020-04-14 07:51:10

Re: Kernel NULL pointer dereference

Board footer