You are not logged in.

#1 2025-01-30 23:53:55

gen2arch
Member
Registered: 2013-05-16
Posts: 201

Zen kernel regression

Hello,

I am trying to debug a problem with my Arch installation on  a Threadripper Pro system with Nvidia GPUs.
Kernel and driver (nvidia-open-dkms) versions are: nvidia/565.77, 6.12.10-arch1-1, x86_64 and nvidia/565.77, 6.12.10-zen1-1-zen.
The system boots via SecureBoot signed UKIs.
This is a fully LUKS encrypted root.
The Zen UKIs are build from the same config files as the other kernels (/etc/mkinitcpio.conf and /etc/kernel/cmdline). /etc/mkinitcpio.d/linux-zen.preset points to the correct kernel image (/boot/vmlinuz-linux-zen).
There are no errors during mkinitcpio's generation of the arch-linux-zen UKI.
I examined the resulting arch-linux-zen.efi with objcopy / cpio: everything necessary seems to be included (the four nvidia drivers, crypto / dm stuff, nvme driver). Can't see what might be missing in the UKI.

The boot fails immediately (right after "Finished virtual console setup"), unfortunately, there is not much to see on the screen other than the kernel panic code. Afterwards I see "Failed to start Load Kernel Modules". There is no trace in the journal (journalctl -k --boot=-1) of these failed boots.
I tried to disable anything nvidia in the "MODULES=" line of /etc/mkinitcpio.conf, rebuilt the zen UKI but the boot still fails.
Is there something essentially different with arch-linux-zen, does it need different kernel parameters, firmware, modules?

Any help to debug this is greatly  appreciated.

thanks

Last edited by gen2arch (2025-02-09 17:15:25)

Offline

#2 2025-01-31 01:11:30

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 23,963

Re: Zen kernel regression

Are linux-zen-headers installed so the dkms build can actually happen? Output of

dkms status

?

Offline

#3 2025-01-31 09:18:15

gen2arch
Member
Registered: 2013-05-16
Posts: 201

Re: Zen kernel regression

Good morning, yes: the header packages of  all three kernels are installed (linux-headers,linux-lts-headers,linux-zen-headers).

dkms status
nvidia/565.77, 6.12.10-arch1-1, x86_64: installed
nvidia/565.77, 6.12.10-zen1-1-zen, x86_64: installed
nvidia/565.77, 6.6.72-1-lts, x86_64: installed

As mentioned: a minimal initrd with just

MODULES=(nls_cp437 nls_ascii vfat ext2)

also fails to boot.
I guess it is the "Cryptography Setup" phase of the boot is where it fails: this is weird as the zen UKI uses the  exact same kernel cmd line file (containing the crypto stuff for sd-encrypt) as the other kernels.
The "/etc/kernel/cmdline" looks like this:

rd.luks.name=d1f25416-5ee6-4155-ae07-28a27ade0db8=cryptroot rd.luks.key=d1f25416-5ee6-4155-ae07-28a27ade0db8=/keyfile:UUID=082b3d74-e0e2-41d3-bdbd-359e3562861e rd.luks.options=discard,no_read_workqueue,no_write_workqueue,keyfile-timeout=5s root=/dev/mapper/cryptroot rootflags=subvol=@ rw nomodeset nouveau.modeset=0  nvidia_drm.modeset=1 nvidia.NVreg_PreserveVideoMemoryAllocations=1

I forgot to mention that "EFI" system dir is not "/boot", instead the existing  EFI dir from an equally present Windows install gets mounted to "/efi/EFI", which in turn contains the "Linux" dir, where my UKIs reside.
Is the Zen kernel doing something different with regard to the kernel cmd line?

Last edited by gen2arch (2025-01-31 09:21:38)

Offline

#4 2025-01-31 12:46:09

gen2arch
Member
Registered: 2013-05-16
Posts: 201

Re: Zen kernel regression

In the meantime I booted with "loglevel=7" on the kernel command line (and with all nvidia stuff disabled), which gives slightly more info:
it seems that Zen kernel is unable to decrypt and mount the encrypted root filesystem, that is where the boot fails. It has nothing to do with nvidia drivers I suspected in the first place.
"Starting Cryptography Setup for cryptroot": that is, where it fails. Last messages are "Key Type trusted registered" and "Key Type encrypted registered".
Immediately afterward things went south with: "Oops: general protection fault, maybe for address 0x0"

Offline

#5 2025-01-31 13:27:11

gen2arch
Member
Registered: 2013-05-16
Posts: 201

Re: Zen kernel regression

I can now confirm that there is a regression in the Zen kernel: I installed 6.11.6-zen1-1-zen from 2024-11-01 (chose this one, as is was about the date of a previous Arch install on same machine), and system boots without any issues.
So there happened something between zen 6.11.6 and 6.12.10 that makes the boot fail where it didn't before.

Offline

#6 2025-01-31 14:20:06

loqs
Member
Registered: 2014-03-06
Posts: 18,277

Re: Zen kernel regression

Can you reproduce the issue with the linux package?  Can you post a screenshot of the oops?  Please edit you last post instead of creating a new one if yours is the last post in a thread.

Offline

#7 2025-02-01 08:58:52

gen2arch
Member
Registered: 2013-05-16
Posts: 201

Re: Zen kernel regression

loqs wrote:

Can you reproduce the issue with the linux package?  Can you post a screenshot of the oops?

No, there never was an issue with the regular Arch linux package! Boot failures only occurred with linux-zen.
I kind of "bisected" the linux-zen and linux-zen-headers packages from https://archive.archlinux.org/packages/l/:
the last working version is 6.12.7-zen1-1-zen dating from 27-Dec-2024 14:49
the first failing version is 6.12.8-zen1-1-zen dating from 03-Jan-2025 05:12 (hangover? :-)
Unfortunately I cannot really assess what got introduced on 12.8. but it sure leads to problems on certain machines.
As to screenshots: it is obviously not possible from the failing machine itself, but even with a slo-mo video of the boot one can hardly read anything as it happens so fast; I can try again though. What would be a good site to upload?

Offline

#8 2025-02-01 22:13:36

loqs
Member
Registered: 2014-03-06
Posts: 18,277

Re: Zen kernel regression

Comparing v6.12.7-zen1...v6.12.8-zen1 as it is not triggered in the linux package I would look at the merge commit https://github.com/zen-kernel/zen-kerne … 67740a46a5 and the 12 commits it includes or the last 5 commits as all the others should have been included in 6.12.8-arch1-1.

Offline

#9 2025-02-02 00:26:14

gen2arch
Member
Registered: 2013-05-16
Posts: 201

Re: Zen kernel regression

loqs wrote:

Comparing v6.12.7-zen1...v6.12.8-zen1 as it is not triggered in the linux package I would look at the merge commit https://github.com/zen-kernel/zen-kerne … 67740a46a5 and the 12 commits it includes or the last 5 commits as all the others should have been included in 6.12.8-arch1-1.

Thanks, that's interesting and very likely the culprit, here is why: this commit is a patch for Zen3+ processors, that is exactly what we have (AMD Ryzen Threadripper PRO 5975WX)! Added to that: we installed Arch with exactly the same procedure and also the Zen kernel on an Intel notebook, and on this machine, using the zen kernel resulted in no disruption whatsoever.

So it is very likely that this cache-optimizing merge leads to the boot  failure. But it is very unlikely nobody else using the Zen kernel on an AMD Zen3+ system should have run into that! There must be special conditions that prevents our system from booting correctly, and I cannot see what that might be.

Last edited by gen2arch (2025-02-02 00:27:01)

Offline

#10 2025-02-02 10:41:37

BS86
Member
Registered: 2022-11-03
Posts: 39

Re: Zen kernel regression

gen2arch wrote:

Thanks, that's interesting and very likely the culprit, here is why: this commit is a patch for Zen3+ processors, that is exactly what we have (AMD Ryzen Threadripper PRO 5975WX)! Added to that: we installed Arch with exactly the same procedure and also the Zen kernel on an Intel notebook, and on this machine, using the zen kernel resulted in no disruption whatsoever.

So it is very likely that this cache-optimizing merge leads to the boot  failure. But it is very unlikely nobody else using the Zen kernel on an AMD Zen3+ system should have run into that! There must be special conditions that prevents our system from booting correctly, and I cannot see what that might be.

Just to chime in: I am almost always using the Zen Kernel on a 7700X (so no 3D Cache and no Multi-CCD) and haven't run into any issues. Can you disable CCD's in your BIOS and check if you also trigger the bug with only one CCD active? Your 5975WX has 4 CCD's with 8 cores each.

Last edited by BS86 (2025-02-02 10:45:20)

Offline

#11 2025-02-02 15:11:57

loqs
Member
Registered: 2014-03-06
Posts: 18,277

Offline

#12 2025-02-02 15:52:24

gen2arch
Member
Registered: 2013-05-16
Posts: 201

Re: Zen kernel regression

BS86 wrote:
gen2arch wrote:

Thanks, that's interesting and very likely the culprit, here is why: this commit is a patch for Zen3+ processors, that is exactly what we have (AMD Ryzen Threadripper PRO 5975WX)! Added to that: we installed Arch with exactly the same procedure and also the Zen kernel on an Intel notebook, and on this machine, using the zen kernel resulted in no disruption whatsoever.

So it is very likely that this cache-optimizing merge leads to the boot  failure. But it is very unlikely nobody else using the Zen kernel on an AMD Zen3+ system should have run into that! There must be special conditions that prevents our system from booting correctly, and I cannot see what that might be.

Just to chime in: I am almost always using the Zen Kernel on a 7700X (so no 3D Cache and no Multi-CCD) and haven't run into any issues. Can you disable CCD's in your BIOS and check if you also trigger the bug with only one CCD active? Your 5975WX has 4 CCD's with 8 cores each.

Thanks for the input, but I could not find anything in the BIOS that looks anything like a CCD setting! but then again, this Mainboard/BIOS and its handbook is so grotesquely under-documented that perhaps the setting runs under a different name.

Last edited by gen2arch (2025-02-02 16:20:59)

Offline

#13 2025-02-02 15:59:59

gen2arch
Member
Registered: 2013-05-16
Posts: 201

Re: Zen kernel regression

Wow, can't believe it! In fact, with this commit reverted, the kernel boots without any issues! Thanks for tracking this down.
Can you say in somewhat more plain English what this commit does? Couldn't make too much sense of it from Phoronix  other than bringing considerable speed gains on certain workflows.
And in particular: how might this commit affect the decryption / mount of the encrypted root, because that is where the kernel fails?
Our machine is uniquely dedicated to ML workloads, is "invlpg" relevant for that?
Thanks

Offline

#14 2025-02-02 19:30:13

loqs
Member
Registered: 2014-03-06
Posts: 18,277

Re: Zen kernel regression

gen2arch wrote:

Can you say in somewhat more plain English what this commit does?

It allows one CPU to tell other CPU's to invalidate there cached copies osome memory and reload them and then continue without waiting for the other CPU's to respond, currently the CPU waits for acknowledgement.

gen2arch wrote:

And in particular: how might this commit affect the decryption / mount of the encrypted root, because that is where the kernel fails?
Our machine is uniquely dedicated to ML workloads, is "invlpg" relevant for that?

So no idea on either of those specific points.

Can you still reproduce the issue with linux-zen 6.13.1.zen1-1 which is currently in extra-testing? If so the next step would be to add the "AMD broadcast TLB invalidation" patch series to linux none zen and see if that introduces the issue and if so report it upstream.

Offline

#15 2025-02-02 23:25:48

gen2arch
Member
Registered: 2013-05-16
Posts: 201

Re: Zen kernel regression

loqs wrote:

Can you still reproduce the issue with linux-zen 6.13.1.zen1-1 which is currently in extra-testing? If so the next step would be to add the "AMD broadcast TLB invalidation" patch series to linux none zen and see if that introduces the issue and if so report it upstream.

Unfortunately yes for the first of these tests: 6.13.1.zen1-1 fails to boot like any other kernel since 12.8!
Second, will report back as soon as I got a chance to recompile vanilla kernel with TLB invalidation applied.

Offline

#16 2025-02-03 01:25:25

loqs
Member
Registered: 2014-03-06
Posts: 18,277

Re: Zen kernel regression

6.13.1-arch1 with https://github.com/zen-kernel/zen-kerne … 78a65e6ccc applied:
linux-6.13.1.arch1-1.1-x86_64.pkg.tar.zst/linux-headers-6.13.1.arch1-1.1-x86_64.pkg.tar.zst.
Edit:
If the above is broken please try the following which applies my attempt to port v7 of the series to 6.13.1:
linux-6.13.1.arch1-1.2-x86_64.pkg.tar.zst/linux-headers-6.13.1.arch1-1.2-x86_64.pkg.tar.zst.

Last edited by loqs (2025-02-03 02:28:56)

Offline

#17 2025-02-03 11:27:59

gen2arch
Member
Registered: 2013-05-16
Posts: 201

Re: Zen kernel regression

Thanks loqs for providing these.
I installed both versions: both fail, impossible to boot with either.
Yet, there is perhaps further valuable info from the fail; now as I know what to look for, I made a screenshot of the last messages following the kernel panic, namely it reads:

RIP:  0010:flush_tlb_kernel_range+0xba/0x140 

So this seems to indicate that the TLB commit is the cause for the boot failure.

Offline

#18 2025-02-03 14:32:13

loqs
Member
Registered: 2014-03-06
Posts: 18,277

Re: Zen kernel regression

I would suggest you open an issue on the Arch gitlab instance against the linux-zen package.

linux-6.13.1.arch1 with the first 5 parts of https://lore.kernel.org/all/20250123042 … rriel.com/ which are all preparation patches for the TLB changes.  Does this trigger the issue?
linux-6.13.1.arch1-1.3-x86_64.pkg.tar.zst/linux-headers-6.13.1.arch1-1.3-x86_64.pkg.tar.zst

Offline

#19 2025-02-03 16:29:54

gen2arch
Member
Registered: 2013-05-16
Posts: 201

Re: Zen kernel regression

loqs wrote:

linux-6.13.1.arch1 with the first 5 parts of https://lore.kernel.org/all/20250123042 … rriel.com/ which are all preparation patches for the TLB changes.  Does this trigger the issue?
linux-6.13.1.arch1-1.3-x86_64.pkg.tar.zst/linux-headers-6.13.1.arch1-1.3-x86_64.pkg.tar.zst

No, this kernel boots normally:

uname -a
Linux lab 6.13.1-arch1-1.3 #1 SMP PREEMPT_DYNAMIC Mon, 03 Feb 2025 14:06:27 +0000 x86_64 GNU/Linux

I opened an issue on Zen kernel Github.

Last edited by gen2arch (2025-02-03 17:15:28)

Offline

#20 2025-02-03 19:12:27

loqs
Member
Registered: 2014-03-06
Posts: 18,277

Re: Zen kernel regression

linux-6.13.1.arch1 with the first 6 parts of https://lore.kernel.org/all/20250123042 … rriel.com/ the sixth part touches flush_tlb_kernel_range reported in the RIP.  Does this trigger the issue?
linux-6.13.1.arch1-1.4-x86_64.pkg.tar.zst/linux-headers-6.13.1.arch1-1.4-x86_64.pkg.tar.zst.

Offline

#21 2025-02-03 21:30:46

gen2arch
Member
Registered: 2013-05-16
Posts: 201

Re: Zen kernel regression

loqs wrote:

linux-6.13.1.arch1 with the first 6 parts of https://lore.kernel.org/all/20250123042 … rriel.com/ the sixth part touches flush_tlb_kernel_range reported in the RIP.  Does this trigger the issue?
linux-6.13.1.arch1-1.4-x86_64.pkg.tar.zst/linux-headers-6.13.1.arch1-1.4-x86_64.pkg.tar.zst.

Yep, this one fails! With about the same error messages as during the previous boot failures.

Offline

#22 2025-02-04 01:50:33

loqs
Member
Registered: 2014-03-06
Posts: 18,277

Re: Zen kernel regression

As the previous build with the following from https://github.com/zen-kernel/zen-kerne … 2632522151 applied:

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1bd1947ef..fde4324ac 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1066,6 +1066,10 @@ static bool broadcast_kernel_range_flush(struct flush_tlb_info *info)
 
 	for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) {
 		nr = min((info->end - addr) >> PAGE_SHIFT, invlpgb_count_max);
+		if (!nr) {
+			WARN_ONCE(1, "zero length flush remaining: start %lu, end %lu\n", info->start, info->end);
+			break;
+		}
 		invlpgb_flush_addr_nosync(addr, nr);
 	}
 	tlbsync();

linux-6.13.1.arch1-1.5-x86_64.pkg.tar.zst/linux-headers-6.13.1.arch1-1.5-x86_64.pkg.tar.zst

Offline

#23 2025-02-05 07:04:21

loqs
Member
Registered: 2014-03-06
Posts: 18,277

Re: Zen kernel regression

Replaces the patch from the previous build with:

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1bd1947ef..8c462db58 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -973,8 +973,13 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 	BUG_ON(this_cpu_inc_return(flush_tlb_info_idx) != 1);
 #endif
 
-	info->start		= start;
-	info->end		= end;
+	/*
+	* Round the start and end addresses to the page size specified
+	* by the stride shift. This ensures partial pages at the end of
+	* a range get fully invalidated.
+	*/
+	info->start             = round_down(start, 1 << stride_shift);
+	info->end               = round_up(end, 1 << stride_shift);
 	info->mm		= mm;
 	info->stride_shift	= stride_shift;
 	info->freed_tables	= freed_tables;

linux-6.13.1.arch1-1.6-x86_64.pkg.tar.zst/linux-headers-6.13.1.arch1-1.6-x86_64.pkg.tar.zst
Edit:
linux-6.13.1.arch1 with AMD-broadcast-TLB-invalidation V8 applied:
linux-6.13.1.arch1-1.7-x86_64.pkg.tar.zst/linux-headers-6.13.1.arch1-1.7-x86_64.pkg.tar.zst

Last edited by loqs (2025-02-05 07:34:51)

Offline

#24 2025-02-05 10:12:38

gen2arch
Member
Registered: 2013-05-16
Posts: 201

Re: Zen kernel regression

Thanks again loqs: your 1.7 is the first version that results in perfectly normal boots!

Offline

#25 2025-02-09 17:19:17

gen2arch
Member
Registered: 2013-05-16
Posts: 201

Re: Zen kernel regression

Unfortunately with 6.13.1-zen3-2-zen, problem still persists, although not resulting in a boot failure, see discussion on Github.

Last edited by gen2arch (2025-02-09 17:21:27)

Offline

Board footer

Powered by FluxBB