[Solved] Is my SSD dying?

quiqueck · 2024-12-08 10:34:24

After booting (and only after booting) my system sometimes hang when opening e.g. a browser or when clicking on a mail the first time in thunderbird. Hang means i could move the mouse pointer around but i could do nothing else, can't click on program starters and keyboard shortcuts won't work. This situation last some seconds. After that the system works without any noticeable delay and works normally. Looking at dmesg after the hang:

[   80.212749] ata3.00: failed command: WRITE FPDMA QUEUED
[   80.212750] ata3.00: cmd 61/e8:f8:20:87:8a/00:00:05:00:00/40 tag 31 ncq dma 118784 out
                        res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[   80.212752] ata3.00: status: { DRDY }
[... some more of this ...]
[   80.212754] ata3: hard resetting link
[   80.675968] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   80.683338] ata3.00: configured for UDMA/133
[   80.693637] ahci 0000:01:00.1: port does not support device sleep
[   80.693748] sd 2:0:0:0: [sdb] tag#5 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=34s
[   80.693751] sd 2:0:0:0: [sdb] tag#5 CDB: Read(10) 28 00 02 f2 38 00 00 06 78 00
[   80.693753] I/O error, dev sdb, sector 49428480 op 0x0:(READ) flags 0x80700 phys_seg 19 prio class 0
[   80.693766] sd 2:0:0:0: [sdb] tag#6 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=34s
[   80.693767] sd 2:0:0:0: [sdb] tag#6 CDB: Read(10) 28 00 02 f2 3f 78 00 02 88 00
[   80.693768] I/O error, dev sdb, sector 49430392 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 0
[... some more of this ...]
[   80.693790] ata3: EH complete
[   87.019492] logitech-hidpp-device 0003:046D:2008.0006: HID++ 1.0 device connected.

I've made a long smartctl test but at the end it says:

$:> sudo smartctl -H /dev/sdb
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.1-arch1-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

The output of "smartctl -x" can be found here

lsblk:

$:> lsblk
NAME   LABEL         FSTYPE   SIZE FSAVAIL MOUNTPOINT       UUID
sda                         232,9G                          
├─sda1 Windows       ntfs    39,1G                          D4104CE9104CD460
├─sda2                          1K                          
├─sda5 VMs           ext3    94,2G   69,1G /run/media/VMs   43baff62-585e-4802-b71b-3598be0bb2fb
├─sda6               swap     2,5G                          a6691214-03d7-4792-a31e-e6f49d7d35ab
└─sda7 Daten         ntfs      97G   24,6G /run/media/Daten 4688B73734B455D1
sdb                         447,1G                          
├─sdb1 arch_root     ext4      40G   19,2G /                dfb61aac-71a9-412e-9528-730c63750354
├─sdb2 arch_swap     swap       4G         [SWAP]           676f657f-55cc-4423-a2ae-6d6b2bda8f62
├─sdb3 arch_home     ext4     100G   18,6G /home            e46de322-93a7-43d2-adeb-49031a288e28
└─sdb4 daten_neu     ext4   303,1G                          1bbf4bfb-d079-438f-a9fa-7ba935123c9f
sdc                         119,2G                          
├─sdc1 arch_root_new ext4    19,6G                          f4633465-f70c-4b86-9e95-88ab9cac5773
└─sdc2 arch_home_new ext4    58,6G                          cc75b52d-0ac0-4240-b459-80cc905ae795
sdd                             0B                          
sde                             0B                          
sr0                          1024M

fstab:

$:> cat /etc/fstab
# 
# /etc/fstab: static file system information
#
# <file system> <dir>   <type>  <options>       <dump>  <pass>
# root /dev/sdb1
UUID=dfb61aac-71a9-412e-9528-730c63750354               /               ext4            rw,noatime,discard   0 1

# /dev/sdb2
UUID=676f657f-55cc-4423-a2ae-6d6b2bda8f62               none            swap            defaults        0 0

# Datenpartition /dev/sda7
UUID=4688B73734B455D1   /run/media/Daten        ntfs    defaults,uid=1000,utf8  0 0

# home /dev/sdb3
UUID=e46de322-93a7-43d2-adeb-49031a288e28       /home           ext4    defaults,noatime,discard             0 2

# Virtuelle Mashinen /dev/sda5
UUID=43baff62-585e-4802-b71b-3598be0bb2fb /run/media/VMs ext3  defaults  0 2

Should i trust smartctl?

Last edited by quiqueck (2025-06-01 08:33:34)

John-Something-Something · 2024-12-08 11:34:52

Man, your output seems very bad. The device isn't answering after a time, and the worse is that the journal confirm that the driver is ok

[ 80.693748] sd 2:0:0:0: [sdb] tag#5 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=34s

So, i'm not a hard drive specialist but i would not believe in smart in this case. Specially if you bough this drive for a strangely low price.
One day i bough a very low price ssd from an online shop. It shoud have 2tb but it don't had even 128gb. After 2 months it stopped working. Not even windows SMART detecting programs like DiskGenius were able to tell if the ssd was dying or completely ok, even after tests and benchmark, the disk was simply fine. But them one day I left it doing a very large download (about 100GB) and the next day it didn't turn on, it wasn't recognized by Linux or Windows on another computer, I changed the sata input and nothing.
If you are having time to see these outputs it's better do a backup.

But again, It's better to study this with more experienced guys, I'm not sure if I'm right. Maybe something in the kernel is wrong, try to test with another OS or other installation.
Another time, my system started showing a lot of I/O errors and so I needed to install arch again for it to work, after a time i discovered that my installation iso from the live usb was corrupted. Maybe it's something like this.

Last edited by John-Something-Something (2024-12-08 12:23:13)

Lone_Wolf · 2024-12-08 12:10:24

Below are the important numbers of the smartctl output .

I'm not great with interpreting those numbers , but Total_Bad_Block & Unexpect_Power_Loss_Ct look worrisome .

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   -O--CK   100   100   000    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    14351
 12 Power_Cycle_Count       -O--CK   100   100   000    -    4044
165 Total_Write/Erase_Count -O--CK   100   100   000    -    2751
166 Min_W/E_Cycle           -O--CK   100   100   ---    -    8
167 Min_Bad_Block/Die       -O--CK   100   100   ---    -    0
168 Maximum_Erase_Cycle     -O--CK   100   100   ---    -    19
169 Total_Bad_Block         -O--CK   100   100   ---    -    787
170 Unknown_Marvell_Attr    -O--CK   100   100   ---    -    0
171 Program_Fail_Count      -O--CK   100   100   000    -    0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
173 Avg_Write/Erase_Count   -O--CK   100   100   000    -    8
174 Unexpect_Power_Loss_Ct  -O--CK   100   100   000    -    132
184 End-to-End_Error        -O--CK   100   100   ---    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
188 Command_Timeout         -O--CK   100   100   ---    -    0
194 Temperature_Celsius     -O---K   080   054   000    -    20 (Min/Max 7/54)
199 SATA_CRC_Error          -O--CK   100   100   ---    -    0
230 Perc_Write/Erase_Count  -O--CK   100   100   000    -    1878 316 1878
232 Perc_Avail_Resrvd_Space PO--CK   100   100   005    -    100
233 Total_NAND_Writes_GiB   -O--CK   100   100   ---    -    3847
234 Perc_Write/Erase_Ct_BC  -O--CK   100   100   000    -    30943
241 Total_Writes_GiB        ----CK   100   100   000    -    12319
242 Total_Reads_GiB         ----CK   100   100   000    -    9492
244 Thermal_Throttle        -O--CK   000   100   ---    -    0

seth · 2024-12-08 15:00:24

[   80.212750] ata3.00: cmd 61/e8:f8:20:87:8a/00:00:05:00:00/40 tag 31 ncq dma 118784 out

https://wiki.archlinux.org/title/Solid_ … NCQ_errors
Also see https://wiki.archlinux.org/title/Power_ … Management - set that to "medium_power"

Then see whether the issue returns.

In the meantime, post the actual "[... some more of this ...]"

but at the end it says:

That assertion in smartctl and an empty bag are wrth slightly less than the bag.

Total bad blocks looks scary, but you've only written ~4TB (Total_NAND_Writes_GiB) and while SanDisk doesn't seem to guarantee any TBW for that model (only 3 years) their other 3 year models range 40TBW - 100TBW

quiqueck · 2024-12-08 16:54:22

All errors from new boot are here

For the NCQ errors i have added the kernel option libata.force=3.00:noncq Hopefully i interpreted the values correctly

Added a udev rule as suggested in https://wiki.archlinux.org/title/Power_ … Management

In the meantime i updated my rescue USB-Stick, and yes i made a backup just yesterday.

Will running fsck manually make a difference?

The SSD has an unused partition of 300 GB, will it be safe to use this partition for my /home?

seth · 2024-12-08 19:26:23

https://pastebin.mozilla.org/zh7OTmDb is 404 - "libata.force=3.00:noncq" looks ok.
Did the errors return despite that value?

Will running fsck manually make a difference?

fsck can sanitize the filesystem f it has been corrupted as consequence of the IO errors, it cannot do anything about the IO errors itself.
Those happen on a lower level.

The SSD has an unused partition of 300 GB, will it be safe to use this partition for my /home?

Impossible to say.
1. is there a structural problem w/ the drive itfp or is this just bogus ncq/alpm?
2. If yes, is it sector degradation of does the controller give up?

A mild hint would be whether always the same sectors are affected (then you're probably facing bad sectors/blocks indeed).

quiqueck · 2024-12-08 19:44:35

seth wrote:

https://pastebin.mozilla.org/zh7OTmDb is 404

Sorry, forgot to change the time, new pastebin: https://pastebin.mozilla.org/cZnN1MaH

Edit:
rsynced the old /home to the other partition. I will see if i use this partition in fstab as /home.

Last edited by quiqueck (2024-12-08 19:46:38)

seth · 2024-12-08 20:07:05

The sectors are all over the place, errors are mostly read (NAND typically loses the ability to write, not read) - but ncq is still active.
This wa before you disabled it?

quiqueck · 2024-12-09 09:40:55

seth wrote:

The sectors are all over the place, errors are mostly read (NAND typically loses the ability to write, not read) - but ncq is still active.
This wa before you disabled it?

Yes, you asked for that

seth wrote:

In the meantime, post the actual "[... some more of this ...]"

After booting this morning (NCQ not used) all seems fine:

[    1.903897] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.904368] ata3.00: FORCE: modified (noncq)
[    1.904584] ata3.00: ATA-9: SanDisk SSD PLUS 480GB, UG5000RL, max UDMA/133
[    1.904588] ata3.00: 937721856 sectors, multi 1: LBA48 NCQ (not used)
[    1.905412] ata3.00: Features: Dev-Sleep
[    1.908922] ata3.00: configured for UDMA/133

and:

$:> sudo dmesg | grep sdb
[    1.920173] sd 2:0:0:0: [sdb] 937721856 512-byte logical blocks: (480 GB/447 GiB)
[    1.920192] sd 2:0:0:0: [sdb] Write Protect is off
[    1.920194] sd 2:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[    1.920208] sd 2:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.920229] sd 2:0:0:0: [sdb] Preferred minimum I/O size 512 bytes
[    1.972498]  sdb: sdb1 sdb2 sdb3 sdb4
[    1.972666] sd 2:0:0:0: [sdb] Attached SCSI disk
[    5.705156] EXT4-fs (sdb1): mounted filesystem dfb61aac-71a9-412e-9528-730c63750354 r/w with ordered data mode. Quota mode: none.
[    6.666982] EXT4-fs (sdb1): re-mounted dfb61aac-71a9-412e-9528-730c63750354 r/w. Quota mode: none.
[    7.785018] Adding 4194300k swap on /dev/sdb2.  Priority:-2 extents:1 across:4194300k SS
[    7.987072] EXT4-fs (sdb3): mounted filesystem e46de322-93a7-43d2-adeb-49031a288e28 r/w with ordered data mode. Quota mode: none.

Many thanks, seth, for the hints! I guess i have to keep an eye on this...

Checking my journal i found that those errors appeared already in September:

Sep 06 20:02:10 kaputtnik-arch kernel: ata3.00: exception Emask 0x0 SAct 0x80c007fe SErr 0x40000 action 0x6 frozen
Sep 06 20:02:10 kaputtnik-arch kernel: ata3: SError: { CommWake }
Sep 06 20:02:10 kaputtnik-arch kernel: ata3.00: failed command: READ FPDMA QUEUED
Sep 06 20:02:10 kaputtnik-arch kernel: ata3.00: cmd 60/08:08:f8:1f:e6/00:00:08:00:00/40 tag 1 ncq dma 4096 in
                                                res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 06 20:02:10 kaputtnik-arch kernel: ata3.00: status: { DRDY }

The closest system upgrade was at the same day, 9 hours before this happens the first time. Could the kernel upgrade may cause this?

[2024-09-06T11:40:05+0200] [ALPM] upgraded linux (6.10.7.arch1-1 -> 6.10.8.arch1-1)
[2024-09-06T11:40:07+0200] [ALPM] upgraded linux-headers (6.10.7.arch1-1 -> 6.10.8.arch1-1)

Not sure if should mark this as solved right now.

quiqueck · 2024-12-09 09:54:37

Forgot to mention that i still use the old /home partition.

seth · 2024-12-09 12:51:04

Yes, you asked for that

Yes, just wanted to make sure the ncq-disabling parameter actually applied.

I guess i have to keep an eye on this...

Jup. You could setup a cron job or systemd timer, scanning the journal of the past 24h once a day and notify-send you a message in case sth. shows up.

quiqueck · 2025-06-01 08:33:02

Although i didn't noticed any errors since applying the changes from above, but a second hard drive also shows some strange smartctl results i bought a new 1TB SSD and exchanged the questionable hard drives. Removed also the ncq-related kernel parameter and settings for power management.

All seems fine now.

Thanks again for the comprehensive help!

Solved

Arch Linux

#1 2024-12-08 10:34:24

[Solved] Is my SSD dying?

#2 2024-12-08 11:34:52

Re: [Solved] Is my SSD dying?

#3 2024-12-08 12:10:24

Re: [Solved] Is my SSD dying?

#4 2024-12-08 15:00:24

Re: [Solved] Is my SSD dying?

#5 2024-12-08 16:54:22

Re: [Solved] Is my SSD dying?

#6 2024-12-08 19:26:23

Re: [Solved] Is my SSD dying?

#7 2024-12-08 19:44:35

Re: [Solved] Is my SSD dying?

#8 2024-12-08 20:07:05

Re: [Solved] Is my SSD dying?

#9 2024-12-09 09:40:55

Re: [Solved] Is my SSD dying?

#10 2024-12-09 09:54:37

Re: [Solved] Is my SSD dying?

#11 2024-12-09 12:51:04

Re: [Solved] Is my SSD dying?

#12 2025-06-01 08:33:02

Re: [Solved] Is my SSD dying?

Board footer