SSD errors, how to determine SSD health?

eternal_floof · 2023-10-25 01:49:14

My computer keeps randomly freezing. The frequency is increasing to the point where it's time to take drastic measures. When the computer freezes, and I use sysrq-e, I see a bunch of disk related errors:

ext4-fs (dm-2( I/O error while writing to superblock)
ionode <different nodes> error-5 reading directory block

So: Either the filesystem is corrupted, the underlying drive is bad (or both), and I want to figure out which it is. The filesystem is encrypted EXT-4 on top of LVM
I am willing to wipe the drive and run read/write commands to check the health, but I haven't found something e.g. like "memtest" for SSD's. Does such a thing exist? Otherwise I'll just reformat and see if errors crop up again?

fsck reports no errors. Here's my smartctl dump:

 sudo smartctl -a -H /dev/nvme0n1 
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.5.8-arch1-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SHPP41-2000GM
Serial Number:                      REDACTED
Firmware Version:                   51060A20
PCI Vendor/Subsystem ID:            0x1c5c
IEEE OUI Identifier:                0xace42e
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Formatted LBA Size:     4096
Local Time is:                      Tue Oct 24 18:41:37 2023 PDT
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x00df):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp Verify
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     86 Celsius
Critical Comp. Temp. Threshold:     87 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.50W       -        -    0  0  0  0        5     305
 1 +   3.9000W       -        -    1  1  1  1       30     330
 2 +   1.5000W       -        -    2  2  2  2      100     400
 3 -   0.0500W       -        -    3  3  3  3      500    1500
 4 -   0.0050W       -        -    4  4  4  4     1000    9000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -     512       0         0
 1 +    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        57 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    108,401,115 [55.5 TB]
Data Units Written:                 24,698,772 [12.6 TB]
Host Read Commands:                 1,286,741,812
Host Write Commands:                329,871,941
Controller Busy Time:               11,058
Power Cycles:                       501
Power On Hours:                     4,501
Unsafe Shutdowns:                   148
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               51 Celsius
Temperature Sensor 2:               61 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Read Self-test Log failed: Invalid Field in Command (0x002)

Last edited by eternal_floof (2023-10-25 01:50:03)

eternal_floof · 2023-10-25 01:52:25

And of course only after I post something do I discover badblocks. I'll give that a go unless there are other suggestions: https://wiki.archlinux.org/title/Badblocks

seth · 2023-10-25 07:21:51

The destructive mode is gonna be much faster than the non-destructive one, but you're losing all data.

Next to the disk "encrypted EXT … on top of LVM" and of course: cables (or in your case: "seating")
Might also be power saving and nvme's tend to struggle w/ iommu and APST
https://wiki.archlinux.org/title/Solid_ … leshooting

ASPM might also be a factor in this (pcie_aspm=off)

Last edited by seth (2023-10-25 07:22:04)

frostschutz · 2023-10-25 09:09:29

> Read Self-test Log failed: Invalid Field in Command (0x002)

For smartctl, use /dev/nvme0 instead of nmve0n1, if your nvme supports selftest logs, it should show up there and you can run a smartctl -t long /dev/nvme0 for a selftest.

For I/O errors it's important to check dmesg, not in between but where it started. I/O errors on dm-x (device mapper) can also be soft errors (e.g. integrity errors for a dm-crypt/integrity device, or when the backing device was disconnected). So the actual error that kicked these off might be further up the log...

Last edited by frostschutz (2023-10-25 09:10:15)

eternal_floof · 2023-10-26 00:01:08

@frostschutz:

Thanks for the tip! After running against /dev/nvme0 I was able to run the self test. If I run the long self-test when booted into my laptop, it causes the freezing behavior. Instead, if I boot into a live USB and run the command, it completes successfully without any errors.

dmesg doesn't show anything, because I suspect the entire drive goes into read-only mode or something. I never have any journalctl or other errors after rebooting. The only way I'm able to see things is by killing everything except PID 1 (sysreq+e) and then watching the terminal from the screen. I'll play around with aspm next, to see if that might be the cause.

Given this info, I'm starting to think the problem is on the LVM or extfs side?

eternal_floof · 2023-11-04 07:55:16

Okay, badblocks returned no errors. I strongly suspect LVM issues based on previous errors on my computer. I've reformatted the machine to use standard partitions and btrfs instead of LVM + ext4. We'll see how it goes *crosses fingers*

seth · 2023-11-04 08:27:28

nb. that ASPM/APST/trimming related issues will *NOT* show up w/ badblocks - that one tests whether there unusable sectors on the disk while the mentioned problems are bus and firmware related.

Arch Linux

#1 2023-10-25 01:49:14

SSD errors, how to determine SSD health?

#2 2023-10-25 01:52:25

Re: SSD errors, how to determine SSD health?

#3 2023-10-25 07:21:51

Re: SSD errors, how to determine SSD health?

#4 2023-10-25 09:09:29

Re: SSD errors, how to determine SSD health?

#5 2023-10-26 00:01:08

Re: SSD errors, how to determine SSD health?

#6 2023-11-04 07:55:16

Re: SSD errors, how to determine SSD health?

#7 2023-11-04 08:27:28

Re: SSD errors, how to determine SSD health?

Board footer