You are not logged in.
Pages: 1
My computer keeps randomly freezing. The frequency is increasing to the point where it's time to take drastic measures. When the computer freezes, and I use sysrq-e, I see a bunch of disk related errors:
ext4-fs (dm-2( I/O error while writing to superblock)
ionode <different nodes> error-5 reading directory blockSo: Either the filesystem is corrupted, the underlying drive is bad (or both), and I want to figure out which it is. The filesystem is encrypted EXT-4 on top of LVM
I am willing to wipe the drive and run read/write commands to check the health, but I haven't found something e.g. like "memtest" for SSD's. Does such a thing exist? Otherwise I'll just reformat and see if errors crop up again?
fsck reports no errors. Here's my smartctl dump:
sudo smartctl -a -H /dev/nvme0n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.5.8-arch1-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: SHPP41-2000GM
Serial Number: REDACTED
Firmware Version: 51060A20
PCI Vendor/Subsystem ID: 0x1c5c
IEEE OUI Identifier: 0xace42e
Controller ID: 0
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Formatted LBA Size: 4096
Local Time is: Tue Oct 24 18:41:37 2023 PDT
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x00df): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Verify
Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size: 64 Pages
Warning Comp. Temp. Threshold: 86 Celsius
Critical Comp. Temp. Threshold: 87 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.50W - - 0 0 0 0 5 305
1 + 3.9000W - - 1 1 1 1 30 330
2 + 1.5000W - - 2 2 2 2 100 400
3 - 0.0500W - - 3 3 3 3 500 1500
4 - 0.0050W - - 4 4 4 4 1000 9000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 - 512 0 0
1 + 4096 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 57 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 108,401,115 [55.5 TB]
Data Units Written: 24,698,772 [12.6 TB]
Host Read Commands: 1,286,741,812
Host Write Commands: 329,871,941
Controller Busy Time: 11,058
Power Cycles: 501
Power On Hours: 4,501
Unsafe Shutdowns: 148
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 51 Celsius
Temperature Sensor 2: 61 Celsius
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
Read Self-test Log failed: Invalid Field in Command (0x002)Last edited by eternal_floof (2023-10-25 01:50:03)
Offline
And of course only after I post something do I discover badblocks. I'll give that a go unless there are other suggestions: https://wiki.archlinux.org/title/Badblocks
Offline
The destructive mode is gonna be much faster than the non-destructive one, but you're losing all data.
Next to the disk "encrypted EXT … on top of LVM" and of course: cables (or in your case: "seating")
Might also be power saving and nvme's tend to struggle w/ iommu and APST
https://wiki.archlinux.org/title/Solid_ … leshooting
ASPM might also be a factor in this (pcie_aspm=off)
Last edited by seth (2023-10-25 07:22:04)
Offline
> Read Self-test Log failed: Invalid Field in Command (0x002)
For smartctl, use /dev/nvme0 instead of nmve0n1, if your nvme supports selftest logs, it should show up there and you can run a smartctl -t long /dev/nvme0 for a selftest.
For I/O errors it's important to check dmesg, not in between but where it started. I/O errors on dm-x (device mapper) can also be soft errors (e.g. integrity errors for a dm-crypt/integrity device, or when the backing device was disconnected). So the actual error that kicked these off might be further up the log...
Last edited by frostschutz (2023-10-25 09:10:15)
Offline
@frostschutz:
Thanks for the tip! After running against /dev/nvme0 I was able to run the self test. If I run the long self-test when booted into my laptop, it causes the freezing behavior. Instead, if I boot into a live USB and run the command, it completes successfully without any errors.
dmesg doesn't show anything, because I suspect the entire drive goes into read-only mode or something. I never have any journalctl or other errors after rebooting. The only way I'm able to see things is by killing everything except PID 1 (sysreq+e) and then watching the terminal from the screen. I'll play around with aspm next, to see if that might be the cause.
Given this info, I'm starting to think the problem is on the LVM or extfs side?
Offline
Okay, badblocks returned no errors. I strongly suspect LVM issues based on previous errors on my computer. I've reformatted the machine to use standard partitions and btrfs instead of LVM + ext4. We'll see how it goes *crosses fingers*
Offline
nb. that ASPM/APST/trimming related issues will *NOT* show up w/ badblocks - that one tests whether there unusable sectors on the disk while the mentioned problems are bus and firmware related.
Offline
Pages: 1