You are not logged in.
After upgrading to linux-lts 6.6.39-1 today I started encountering watchdog lockups at boot:
: running early hook [udev]
: starting systemd-udevd version 256.2-1-arch
: running hook [udev]
: Triggering uevents...
[ 34.055126] watchdog: Watchdog detected hard LOCKUP on cpu 10
[ 34.524386] watchdog: Watchdog detected hard LOCKUP on cpu 12
[ 35.580115] watchdog: Watchdog detected hard LOCKUP on cpu 4
[ 38.755618] watchdog: Watchdog detected hard LOCKUP on cpu 1
[ 66.433586] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 66.433602] rcu: o10 -... 0: (1 GPs behind) idle=ad24/1/0x4000000000000000 softirq=113/114 fqs=5994
[ 66.433622] rcu: o12 -... 0: (1 GPs behind) idle=bb84/1/0x4000000000000002 softirq=115/115 fqs=5994
[ 66.433642] rcu: o(detected by 5, t=18002 jiffies, g =- 875, q=1394 ncpus=16)
[Warning: OCRd from a photo, but all looks correct]
I am able to isolate this to the linux-lts update as I can reproduce this by taking my running system on linux-lts 6.6.38-1 and only upgrading the kernel package to 6.6.39-1 (via ZFS rootfs snapshots). Trying mainline linux 6.9 kernel is not currently an option as I rely on zfs-dkms for root.
I did try noacpi, nomodeset, iommu=off to no avail.
Ryzen 2700X, ASRock X370 Taichi - been running Linux perfectly fine since 2018 on this hardware until this kernel version. Two other Intel based systems upgraded fine with an otherwise similar configuration/installation.
I'll try again when 6.6.40 is out, but figured I'd post this in the off chance someone else has seen something similar.
Last edited by ScottE (2024-07-15 22:17:17)
Offline
This sounds like a possible regression, which should be bisected and reported to the stable team (which maintains the tree for linux-lts)
Are you confident in doing that on your own or should I supply you some prebuilt images?
The bisection will be roughtly 6 steps, so its not a lot of stuff to test
Offline
Are you confident in doing that on your own or should I supply you some prebuilt images?
Thanks - I appreciate the offer. It's been close to a decade since I've built my own kernels, it's just not something I've had to do recently, so I'd appreciate some help with images to help bisect. Thanks again!
Offline
Please test the following image and report whether it works or not:
sudo pacman -U https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-lts-v6.6.38.r68.ge536e6e-1-x86_64.pkg.tar.zst
Offline
I had some issues getting dkms to build, but I rebooted anyway, knowing the system won't boot without a root filesystem, but this hang happens early on before filesystems are mounted, so I think that's OK, if not ideal.
Result: The system hung on v6.6.38.r68.ge536e6e-1 with a hard lockup.
Offline
I had some issues getting dkms to build, but I rebooted anyway, knowing the system won't boot without a root filesystem,
You can install the matching headers with:
sudo pacman -U https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-lts-v6.6.38.r68.ge536e6e-1-x86_64.pkg.tar.zst https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-lts-headers-v6.6.38.r68.ge536e6e-1-x86_64.pkg.tar.zst
Offline
Please try the following
sudo pacman -U https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-lts-v6.6.38.r34.g3f25b5f-1-x86_64.pkg.tar.zst
Offline
Thanks on the headers, loqs - I should have guessed that.
Result for v6.6.38.r34.g3f25b5f-1: Hard lockup.
Offline
Please test the following:
sudo pacman -U https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-lts-v6.6.38.r17.g855ae72-1-x86_64.pkg.tar.zst https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-lts-headers-v6.6.38.r17.g855ae72-1-x86_64.pkg.tar.zst
Offline
Result for v6.6.38.r17.g855ae72-1: Good boot - Linux 6.6.38-1-lts-00017-g855ae72c2031-dirty #1 SMP PREEMPT_DYNAMIC Sat, 13 Jul 2024 23:54:19 +0000 x86_64 GNU/Linux
Offline
Please try the following:
sudo pacman -U https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-lts-v6.6.38.r25.gaf19067-1-x86_64.pkg.tar.zst https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-lts-headers-v6.6.38.r25.gaf19067-1-x86_64.pkg.tar.zst
Offline
Looks like there's an issue with the linux-lts-headers signature for v6.6.38.r25.gaf19067-1:
error: failed to read signature file: /var/cache/pacman/pkg/linux-lts-headers-v6.6.38.r25.gaf19067-1-x86_64.pkg.tar.zst.sig
error: '/var/cache/pacman/pkg/linux-lts-headers-v6.6.38.r25.gaf19067-1-x86_64.pkg.tar.zst': unexpected error
Signature size is 0:
-rw-r--r-- 1 root root 25758640 Jul 13 17:24 /var/cache/pacman/pkg/linux-lts-headers-v6.6.38.r25.gaf19067-1-x86_64.pkg.tar.zst
-rw-r--r-- 1 root root 0 Jul 13 23:57 /var/cache/pacman/pkg/linux-lts-headers-v6.6.38.r25.gaf19067-1-x86_64.pkg.tar.zst.sig
-rw-r--r-- 1 root root 134429059 Jul 13 17:25 /var/cache/pacman/pkg/linux-lts-v6.6.38.r25.gaf19067-1-x86_64.pkg.tar.zst
-rw-r--r-- 1 root root 566 Jul 13 23:56 /var/cache/pacman/pkg/linux-lts-v6.6.38.r25.gaf19067-1-x86_64.pkg.tar.zst.sig
Offline
Yeah you are right ... Seems like something went wrong when creating the signature. I have now fixed the issue, but you may need to delete the packages from the cache (something like "sudo rm /var/cache/pacman/pkg/linux-lts-headers-v6.6.38.r25.gaf19067-1-x86_64.pkg.tar.zst{,.sig}"). Afterwards you can just re-use the pacman command from above.
Offline
Sorry for the lack of details (I haven't been able to get more information - if someone give me some guidelines, I could help), but I think this is as well related to linux-lts 6.6.39-1: when plugging in an external hard drive through a USB cable (encrypted drive [1]), the system freezes completely. If the drive is plugged in while booting, SDDM never gets loaded. If I unplug the drive and boot from scratch, I'm able to login, but if I connect the drive, the system gets freezed and totally unusable. Removing linux-lts and installing linux (currently 6.9.9-arch1-1) I was able to get the system useful again.
Offline
v6.6.38.r25.gaf19067-1: Good boot
If the drive is plugged in while booting, SDDM never gets loaded. If I unplug the drive and boot from scratch, I'm able to login, but if I connect the drive, the system gets freezed and totally unusable.
This is very interesting. The system where I'm experiencing the lockup has 2 external USB drives. Once I've completed the bisect, unplugging the external drives will be a good test to confirm the same issue. Thanks for replying with a potential lead.
Last edited by ScottE (2024-07-14 15:46:03)
Offline
You're welcome! I'm glad it was useful!
Offline
Trying mainline linux 6.9 kernel is not currently an option as I rely on zfs-dkms for root.
Is https://github.com/archzfs/archzfs an option for you? I used to use the archzfs repo - but as the auto-build lacks often lacks behind I started to build ZFS myself - works without issues for my pool.
Offline
Is https://github.com/archzfs/archzfs an option for you?
The honest answer is that I don't know. I tend to stick with LTS kernels as a general preference for stability, especially on my servers, double especially with ZFS root.
I tried to figure out how to do bisect builds of the kernel myself, but couldn't find the right, current, source tree for the arch version of linux-lts, and after a couple of hours of trying different things it exceeded my effort:reward ratio. :-)
If this is related to the USB disk issue, I expect it will be resolved soon anyway: https://bugzilla.kernel.org/show_bug.cgi?id=219039 - I think I'll re-apply 6.6.39 and unplug the USB drives to see what happens - if it boots then I don't think there's much point in completing the bisect search, given this being a known issue.
Offline
Please test:
sudo pacman -U https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-lts-v6.6.38.r29.gc727e46-1-x86_64.pkg.tar.zst https://pkgbuild.com/\~gromit/linux-bisection-kernels/linux-lts-headers-v6.6.38.r29.gc727e46-1-x86_64.pkg.tar.zst
Offline
I'm certain that this is the USB issue, after unplugging the drives, 6.6.39-1 boots just fine.
gromit - I greatly appreciate your help in working through this bisect, but I think I'm going to call this one for now - given all the evidence that this is related to a known issue. Thank you so much for your time in building packages for me!
[Edit: Let me see if I can reproduce this on my test mule, rather than keep disrupting my server, to continue the rebase - I did see your comment gromit in the other thread about finishing the rebase on this one for confirmation].
Last edited by ScottE (2024-07-14 19:50:28)
Offline
cryptearth wrote:Is https://github.com/archzfs/archzfs an option for you?
The honest answer is that I don't know. I tend to stick with LTS kernels as a general preference for stability, especially on my servers, double especially with ZFS root.
Oh, I see - good point then.
For me I use ZFS for a 8x 3tb raidz2 pool but a regular single nvme ssd for root and home (there's not much stored on home which isn't either easy obtainable by just downloading from official sources or has at least one copy on the zfs pool - so no point of moving home onto the zfs pool)
As for root on zfs: The instructions changed quite a lot - and relying on another distribution which comes with ZFS in the install media isn't the real true arch way for me.
As for dkms vs version specific: I guess from a technical point it's the same of either doing DKMS or package built for the current version.
Important: ZFS currently only supports up to 6.8 - 6.9 and upcomming 6.10 still in figuring out issues - so I guess stick to LTS is a good idea for Arch.
Offline
This is the bisection log to reach 9a24eb8010c2dc6a2eba56e3eb9fc07d14ffe00a which matches your results so far:
$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# bad: [2ced7518a03d002284999ed8336ffac462a358ec] Linux 6.6.39
git bisect bad 2ced7518a03d002284999ed8336ffac462a358ec
# status: waiting for good commit(s), bad commit known
# good: [2928631d5304b8fec48bad4c7254ebf230b6cc51] Linux 6.6.38
git bisect good 2928631d5304b8fec48bad4c7254ebf230b6cc51
# bad: [e536e6efa65f447a7611b4fb07ede1a9c895f8ea] e1000e: Fix S0ix residency on corporate systems
git bisect bad e536e6efa65f447a7611b4fb07ede1a9c895f8ea
# bad: [3f25b5f1635449036692a44b771f39f772190c1d] net: dsa: mv88e6xxx: Correct check for empty list
git bisect bad 3f25b5f1635449036692a44b771f39f772190c1d
# good: [855ae72c20310e5402b2317fc537d911e87537ef] drm/amdgpu: Using uninitialized value *size when calling amdgpu_vce_cs_reloc
git bisect good 855ae72c20310e5402b2317fc537d911e87537ef
# good: [af19067bd58f0f6f90eb6c604babffb55c2d6a00] media: dw2102: Don't translate i2c read into write
git bisect good af19067bd58f0f6f90eb6c604babffb55c2d6a00
# good: [c727e46f0cc8bd81788bb29dac9a0a45f2dfa2eb] Input: ff-core - prefer struct_size over open coded arithmetic
git bisect good c727e46f0cc8bd81788bb29dac9a0a45f2dfa2eb
# bad: [ff6b26be13032c5fbd6b6a0b24358f8eaac4f3af] wifi: mt76: replace skb_put with skb_put_zero
git bisect bad ff6b26be13032c5fbd6b6a0b24358f8eaac4f3af
# bad: [9a24eb8010c2dc6a2eba56e3eb9fc07d14ffe00a] usb: xhci: prevent potential failure in handle_tx_event() for Transfer events without TRB
git bisect bad 9a24eb8010c2dc6a2eba56e3eb9fc07d14ffe00a
# first bad commit: [9a24eb8010c2dc6a2eba56e3eb9fc07d14ffe00a] usb: xhci: prevent potential failure in handle_tx_event() for Transfer events without TRB
linux-lts-6.6.39-1 with 9a24eb8010c2dc6a2eba56e3eb9fc07d14ffe00a reverted:
linux-lts-6.6.39-1.1-x86_64.pkg.tar.zst/linux-lts-headers-6.6.39-1.1-x86_64.pkg.tar.zst
Edit:
linux-lts-6.6.39-1 with the proposed fix from https://bugzilla.kernel.org/show_bug.cgi?id=219039#c6 applied:
linux-lts-6.6.39-1.2-x86_64.pkg.tar.zst/linux-lts-headers-6.6.39-1.2-x86_64.pkg.tar.zst.
Last edited by loqs (2024-07-14 22:10:07)
Offline
linux-lts-6.6.39-1 with the proposed fix from https://bugzilla.kernel.org/show_bug.cgi?id=219039#c6 applied:
linux-lts-6.6.39-1.2-x86_64.pkg.tar.zst/linux-lts-headers-6.6.39-1.2-x86_64.pkg.tar.zst.
Good news! System where I've been having this issue boots fine with this 6.6.39-1.2-lts version and with USB disks plugged back in. I'm comfortable enough with this proof that I don't see a reason to continue down the bisect tree (which is good as I was unable to repro the issue on my test mule system and testing on my home server was disruptive). Thank you for providing a test build with the proposed fix, I appreciate the time and effort!
Offline
I retested with 6.6.40.1-lts from the testing repository and all is good there too.
Offline
This topic can be marked SOLVED with 6.6.40.1-lts moving to release repos. Thank you!
Last edited by ScottE (2024-07-15 20:15:02)
Offline