You are not logged in.
Pages: 1
I'm on kernel 6.11.5.zen1-1 on a Lenovo laptop with a Ryzen 7 5800U CPU. Lately (since last three weeks I think) I am having weekly system crashes. I suspect it has something to do with 'fstrim', as the last few times this happened, the journalctl boot logs got cut off abruptly as fstrim was running (I have 'fstrim.timer' set up to run 'fstrim.service' weekly).
journalctl -u fstrimshows this (I've posted only entries from the last four weeks):
-- Boot 71718fe1e53947b1be70b217b04facff --
Sep 30 11:42:25 jazz systemd[1]: Starting Discard unused blocks on filesystems from /etc/fstab...
Sep 30 11:46:56 jazz fstrim[1893837]: /efi: 548.7 MiB (575307776 bytes) trimmed on /dev/nvme0n1p1
Sep 30 11:46:56 jazz fstrim[1893837]: /: 278.5 GiB (299079884800 bytes) trimmed on /dev/mapper/vg-root
Sep 30 11:46:56 jazz systemd[1]: fstrim.service: Deactivated successfully.
Sep 30 11:46:56 jazz systemd[1]: Finished Discard unused blocks on filesystems from /etc/fstab.
Sep 30 11:46:56 jazz systemd[1]: fstrim.service: Consumed 2.128s CPU time, 88.1M memory peak.
-- Boot 0ef302d2806e4abbb4cb31f15117fcc9 --
Oct 07 00:03:53 jazz systemd[1]: Starting Discard unused blocks on filesystems from /etc/fstab...
Oct 07 00:08:28 jazz fstrim[2536603]: /efi: 548.7 MiB (575307776 bytes) trimmed on /dev/nvme0n1p1
Oct 07 00:08:28 jazz fstrim[2536603]: /: 269.6 GiB (289504501760 bytes) trimmed on /dev/mapper/vg-root
Oct 07 00:08:28 jazz systemd[1]: fstrim.service: Deactivated successfully.
Oct 07 00:08:28 jazz systemd[1]: Finished Discard unused blocks on filesystems from /etc/fstab.
Oct 07 00:08:28 jazz systemd[1]: fstrim.service: Consumed 2.949s CPU time, 83.5M memory peak.
-- Boot ad30f489255d4cd7a49bbfaf06744143 --
Oct 14 12:44:02 jazz systemd[1]: Starting Discard unused blocks on filesystems from /etc/fstab...
-- Boot 9c68f7e93e0f4b56bccac9a481b8f595 --
Oct 21 00:38:17 jazz systemd[1]: Starting Discard unused blocks on filesystems from /etc/fstab...
Oct 21 00:43:20 jazz fstrim[3037488]: /efi: 548.7 MiB (575307776 bytes) trimmed on /dev/nvme0n1p1
Oct 21 00:43:20 jazz fstrim[3037488]: /: 231.1 GiB (248178024448 bytes) trimmed on /dev/mapper/vg-root
Oct 21 00:43:20 jazz systemd[1]: fstrim.service: Deactivated successfully.
Oct 21 00:43:20 jazz systemd[1]: Finished Discard unused blocks on filesystems from /etc/fstab.
Oct 21 00:43:20 jazz systemd[1]: fstrim.service: Consumed 6.440s CPU time, 87.6M memory peak.
-- Boot 6976f6bb6b77479cb6cdec4b081328d9 --
Oct 28 00:58:26 jazz systemd[1]: Starting Discard unused blocks on filesystems from /etc/fstab...
Oct 28 01:03:09 jazz fstrim[336635]: /efi: 548.7 MiB (575307776 bytes) trimmed on /dev/nvme0n1p1
Oct 28 01:03:09 jazz fstrim[336635]: /: 225.9 GiB (242595250176 bytes) trimmed on /dev/mapper/vg-root
Oct 28 01:03:09 jazz systemd[1]: fstrim.service: Deactivated successfully.
Oct 28 01:03:09 jazz systemd[1]: Finished Discard unused blocks on filesystems from /etc/fstab.On Oct 14, Oct 21 and Oct 28 I had complete system freezes as fstrim was running and had to hard reboot (REISUB didn't work completely always, getting stuck at 'emergency remount complete'). On reboot GRUB complained about "failure reading from sector 0x113840 from 'hd0'" and dumped me into rescue mode but upon two three tries (i.e. holding the power button) did let me boot successfully. I suspected an nvme failure. However, smartctl or nvme-cli do not show any critical warnings. Here's the output of
sudo smartctl -a /dev/nvme0n1smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.11.5-zen1-1-zen] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZVLB1T0HBLR-000L2
Serial Number: S4DZNX0R681274
Firmware Version: 3L1QEXF7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 1,024,209,543,168 [1.02 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,024,209,543,168 [1.02 TB]
Namespace 1 Utilization: 781,778,247,680 [781 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 8611b9414b
Local Time is: Mon Oct 28 02:15:57 2024 IST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 84 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 8.00W - - 0 0 0 0 0 0
1 + 6.30W - - 1 1 1 1 0 0
2 + 3.50W - - 2 2 2 2 0 0
3 - 0.0760W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 39 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 26,495,310 [13.5 TB]
Data Units Written: 21,203,317 [10.8 TB]
Host Read Commands: 319,847,077
Host Write Commands: 256,428,855
Controller Busy Time: 1,544
Power Cycles: 3,046
Power On Hours: 658
Unsafe Shutdowns: 156
Media and Data Integrity Errors: 0
Error Information Log Entries: 4,536
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 39 Celsius
Temperature Sensor 2: 35 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS Message
0 4536 0 0x000c 0x4004 - 0 0 - Invalid Field in Command
Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests LoggedAny ideas what's going on here? Can upload the 'journalctl -b' logs for these boots if needed. Thanks.
Last edited by ssm@arch (2024-10-27 22:05:18)
Offline
Did you check journalctl / dmesg as well? (for kernel messages - in general)
fstrim does not do anything, it mostly tells the filesystem to do, so this happens in kernel. So if there is any problem, kernel should spit out some messages.
Of course, if the storage is the problem, those messages might not have made it into the journal / logs. You can keep a terminal with `dmesg -w` open or try network logging.
Before that however you should backup all data, if you don't have backups already. Misbehavior with fstrim, could be either faulty media or a kernel bug, either way data loss is a possibility.
fstrim can be very slow depending on filesystem size and fragmentation; specifying a generous minimum size (fstrim --minimum) might help in some cases.
Are you monitoring temperatures under load?
Last edited by frostschutz (2024-10-27 22:36:51)
Offline
So if there is any problem, kernel should spit out some messages.
and then see https://wiki.archlinux.org/title/Solid_ … leshooting
Offline
Thank you both. I did check temps under load, but they never crossed 65C. I kept `dmesg -w` open and got 'controller is down' and 'disabling device after reset failure' followed by a lot of I/O errors. Following seth's pointer, I have disabled APST (this suggestion also showed up in `dmesg -w`). Will report back if I encounter further crashes.
Last edited by ssm@arch (2024-10-28 09:01:13)
Offline
Pages: 1