You are not logged in.
Hello, I have installed Arch Linux on Lenovo ThinkPad T14 Gen 4 with WD Blue SN580 1TB NVME drive. (I decided to post here instead of to laptop issues, because I think this is more general kernel / hardware issue.) When I bought the laptop, I changed native sector size of the NVME from 512 bytes (factory default) to 4096 bytes and sanitized the NVME. At the time of installation I used Arch Linux 6.7.x, now Arch Linux 6.8.1. The NVME has 2 partitions: one FAT32 and one LUKS container with BTRFS filesystem with 3 subvolumes. Every now and then, during boot, I encounter the following error:
The same issue, captured different day:
This problem was occasionally happening from the beginning, even on a freshly installed OS. When this error occurs, the system fails to boot. It's not possible to reboot it through Ctrl+Alt+Delete, I must only forcibly shut the laptop down via power button. I even tried to re-sanitize the NVME, reformat it and install Arch Linux from scratch, but the issue still occurs. When this error does not occur, the system boots just fine. Other than this issue, I am not seeing any issues with the NVME drive on day to day basis.
SMART self-tests don't report any errors. smartctl --log=selftest /dev/nvme0:
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.1-arch1-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF SMART DATA SECTION ===
Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num Test_Description Status Power_on_Hours Failing_LBA NSID Seg SCT Code
0 Extended Completed without error 7 - - - - -
1 Short Completed without error 7 - - - - -
Kernel command line parameters for mounting the root FS:
rd.systemd.gpt_auto=0 rd.luks.name=7b55554b-68a0-4472-8445-1a39375cfd1b=cryptroot rd.luks.options=discard,no-read-workqueue,no-write-workqueue,password-echo=no,tpm2-device=auto root=/dev/mapper/cryptroot rootfstype=btrfs rootflags=defaults,rw,noatime,subvol=@,commit=60,compress=zstd:1,space_cache=v2 rootwait
Contents of fstab:
UUID=43C8-A5B2 /boot/efi vfat defaults,noatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro,x-systemd.automount 0 2
UUID=d587a045-3030-4417-9e0b-73179c751492 /home btrfs defaults,noatime,compress=zstd:1,space_cache=v2,subvol=@home,x-systemd.automount 0 0
UUID=d587a045-3030-4417-9e0b-73179c751492 /swap btrfs defaults,noatime,compress=no,space_cache=v2,subvol=@swap,x-systemd.automount 0 0
/swap/swapfile none swap defaults,noatime,nodev,noexec,nosuid,x-systemd.automount 0 0
What I tried:
- Booting with intel_iommu=off command line parameter.
- Booting with pcie_aspm=off command line parameter.
- Booting with nvme_core.default_ps_max_latency_us=0 command line parameter.
- Booting with rootwait rootdelay=5 command line parameter.
However none if this helped. I am running out of ideas what is causing this problem, please do you have some idea what could be the issue?
Thank you very much for any ideas and have a nice day.
Last edited by armandleg (2024-04-04 22:17:00)
Offline
Given that the LBA/sector numbers are identical (266500 and 2132000) on different days, that's probably hardware failure.
Check for logged PCIe errors with dmesg or journalctl. You could try starting over with the drive in 512e mode, see if if changes anything.
Offline
https://wiki.archlinux.org/title/Badblocks
Though, have all/both installations been luks+btrfs?
Tried "iommu=soft"?
memtest86+?
The device temperature is fine?
Any pattern to the failure (ie. "fails on reboots after the system was compiilnig stuff fo 5h, but is usually fine in the next morning. I live in the arctic, btw")?
Offline
Yes, both installations were LUKS+BTRFS. Tried also iommu=soft, didn't help. Device temperature was fine, always in cca 30 - 45 celsius range. No pattern to the failure. Yesterday I tried to replace the NVME with Samsung NVME. Installed system with the same configuration. Since then I haven't encountered any such issues. Seems like either 1. the original NVME was faulty 2. maybe wasn't somehow compatible with Linux 3. didn't work properly in sector size = 4096 mode. When I'll have more time I'll try to play around with the original NVME a bit more, but for the time being the replacement solved the issue. Thank you very much for your assistance.
Offline
This is the only other instance I could find on the internet of someone with the exact same issue as me. Googling those BTRFS error messages got me ~4 irrelevant results.
I have a WD_BLACK SN770 1TB whose sector size I also changed from 512n to 4Kn. The drive clearly reported it supported this configuration, and the "nvme format" seemed to work. I then (coincidentally?) created the same configuration as you: GPT partitioning with an EFI system partition and a single LUKS-encrypted BTRFS. FAT32, LUKS, and BTRFS were all using 4K sector sizes and all appeared to be going well until I experienced this issue on first reboot. I found rebuilding my UKI via chroot would sometimes help me boot but after spending enough time booted (a few hours?), my PC would suddenly hang (browser stopped working, CTRL+Fx led to a black screen, I couldn't return back to the GUI from there and caps lock light stopped working - does that mean kernel panic?). "btrfs check" and "btrfs scrub" did not find any issues with the filesystem. Also, it wasn't just BTRFS that was failing here, the EFI system partition was also failing to mount on startup. Weirdly, I never had trouble mounting manually. Making this post now on a new install with the SSD back to 512n. This time, FAT32 and LUKS are using 512B sectors while BTRFS is using 4K sectors (BTRFS always defaults to 4K sectors regardless of logical sector size). Have not experienced the issue since changing back to 512n. Install was done using kernel 6.14.4-zen1-1-zen.
It seems some drives (potentially WD consumer drives in particular?) do not work well with 4Kn sectors despite reporting it as a valid configuration.
Last edited by ppp7032 (2025-04-27 11:31:55)
Offline
Alas! I've found someone else document the same issue here: https://halestrom.net/darksleep/blog/054_nvme/.
This person identified that their NVMe controller would lock up under heavy random read loads. OP and I were actually lucky to be using LUKS because I'm pretty sure using it in general (e.g. booting from it) inherently causes heavy random reads. This caused the issue, which would have been there regardless, to identify itself immediately.
3 people with the issue found so far and all 3 were using a Western Digital Color NVMe SSD... Hmm...
I've updated my drive's firmware and will (eventually) check if that fixed the issue for me, but I'm sceptical it'll help because apparently WD doesn't "believe" in firmware updates for consumer drives.
Edit: Firmware update does not fix the issue...
I'll be sticking to 512 for the foreseeable future on this drive.
Last edited by ppp7032 (2025-04-27 16:15:47)
Offline
I /think/ I am experiencing the same issue, but on somewhat different hardware!
I have a T14 gen 2 AMD with an Inland TN470 2TB SSD (Phison E27T controller). I sometimes get these I/O errors when booting right after entering my LUKS password as well and I generally have to hard reboot. I also think it sometimes is happening when my computer is waking up from sleep, because occasionally when waking from sleep the thinkpad light turns solid red (meaning the computer is awake) but the computer is completely unresponsive with no display or anything whatsoever, which I think could reasonably be caused by the SSD locking up (nobody else reports issues waking from sleep on this laptop on modern kernels). I'm using 512 byte sectors (didn't mess with the sector sizes).
I'm on NixOS with LUKS + EXT4.
Offline
Systematic STR+nvme issues are documented and addressed in https://wiki.archlinux.org/title/Solid_ … leshooting (disable APST, ASPM and the IOMMU, see whether that sabilizes the system, then look for the curcial parameter)
Offline