NVMe SSD I/O errors, not sure if hardware failure

daanjderuiter · 2020-03-31 18:36:03

A few days ago, I returned to my PC to notice it had reached some failed state where it remounted my root-partition read-only. Not thinking too much of it, I force-rebooted it. Today I noticed it a second time, and start to suspect it to be a hardware failure of the offending disk, a Kingston A2000 500GB NVMe SSD. The dmesg output can be found here (sorry for having it as a picture, I couldn't, y'know, write this anywhere as pretty much any operation in my terminal just gave I/O errors). The disk consists of the EFI boot partition as well as a LUKS-container with LVM on top, which contains two Ext4 partitions. Is this a known bug for the recent kernel, for this specific combination of hard- and software components? Or are these more likely an indicator of hardware failure? In the latter case, I'll just get it RMA'd.

loqs · 2020-03-31 19:25:12

Please boot from live media run a S.M.A.R.T. self-test on the device then post the result. See the tip box from pastebin to post from the console. You could also try posting the system's journal from the live media.

daanjderuiter · 2020-03-31 20:14:23

I am not sure if this is normal for NVMe SSD's, but my device doesn't seem to support any of the standard SMART self-tests:

# smartctl -c /dev/nvme0n1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.8-arch1-1] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     75 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        0       0
 1 +     4.60W       -        -    1  1  1  1        0       0
 2 +     3.80W       -        -    2  2  2  2        0       0
 3 -   0.0450W       -        -    3  3  3  3     2000    2000
 4 -   0.0040W       -        -    4  4  4  4    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

Insofar as it's of any interest, it does report as healthy:

# smartctl -a /dev/nvme0n1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.8-arch1-1] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       KINGSTON SA2000M8500G
Serial Number:                      50026B72824BD850
Firmware Version:                   S5Z42102
PCI Vendor/Subsystem ID:            0x2646
IEEE OUI Identifier:                0x000000
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Utilization:            177,420,300,288 [177 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            0026b7 2824bd8505
Local Time is:                      Tue Mar 31 22:10:59 2020 CEST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     75 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        0       0
 1 +     4.60W       -        -    1  1  1  1        0       0
 2 +     3.80W       -        -    2  2  2  2        0       0
 3 -   0.0450W       -        -    3  3  3  3     2000    2000
 4 -   0.0040W       -        -    4  4  4  4    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        27 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    3,012,278 [1.54 TB]
Data Units Written:                 2,737,931 [1.40 TB]
Host Read Commands:                 27,915,108
Host Write Commands:                49,944,984
Controller Busy Time:               458
Power Cycles:                       340
Power On Hours:                     601
Unsafe Shutdowns:                   52
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged

The dmesg errors as shown before are all the logged errors available, as the root partition is remounted read-only and so journald cannot write to the journal.

loqs · 2020-03-31 20:32:35

Please post the journal or dmesg. The screenshot you linked to shows errors from the ext4 filesystem and JBD2 device mapper but not from NVME so it may not be a hardware issue. However the NVME errors may not have been captured in that screenshot.

daanjderuiter · 2020-03-31 20:39:25

The thing is, I cannot reliably reproduce this issue; the smartctl output above was obtained from a live USB, and the issue showed up apparently out of the blue, at which point I took a crappy photograph and force-stopped my PC. I will use the machine until I run into this issue again, and post the output once I do (thanks for the pastebin-tip by the way!).

seth · 2020-04-01 06:47:43

a) do you access the FS from multiple OS (eg. dual booting)?
b) did you fsck it?
c) badblocks can probably trigger the problem? https://wiki.archlinux.org/index.php/Badblocks

fedev · 2020-07-27 09:15:43

daanjderuiter wrote:

A few days ago, I returned to my PC to notice it had reached some failed state where it remounted my root-partition read-only. Not thinking too much of it, I force-rebooted it. Today I noticed it a second time, and start to suspect it to be a hardware failure of the offending disk, a Kingston A2000 500GB NVMe SSD. The dmesg output can be found here (sorry for having it as a picture, I couldn't, y'know, write this anywhere as pretty much any operation in my terminal just gave I/O errors). The disk consists of the EFI boot partition as well as a LUKS-container with LVM on top, which contains two Ext4 partitions. Is this a known bug for the recent kernel, for this specific combination of hard- and software components? Or are these more likely an indicator of hardware failure? In the latter case, I'll just get it RMA'd.

Were you ever able to figure it out? I have a Kingston A2000 as well (brand new) and I also noticed Arch trying to mount it read-only every now and then (I'm trying to still figure out what triggers it). I have it with BTRFS and not ext4 but also in a luks encrypted partition with lvm in it.

One thing I noticed is that in my case, the drive disappears from the device list in the BIOS and only comes back once I reboot. I can only see this if I go into the UEFI settings right away.

When the issue happens rather than powering off or a hard shutdown, I use the magic sysrq to reboot.

The nvme-cli tool doesn't show any errors other than unsafe_shutdowns (expected since the drive disappears).

Brayden.Dean · 2020-08-07 00:45:45

fedev wrote:

daanjderuiter wrote:
A few days ago, I returned to my PC to notice it had reached some failed state where it remounted my root-partition read-only. Not thinking too much of it, I force-rebooted it. Today I noticed it a second time, and start to suspect it to be a hardware failure of the offending disk, a Kingston A2000 500GB NVMe SSD. The dmesg output can be found here (sorry for having it as a picture, I couldn't, y'know, write this anywhere as pretty much any operation in my terminal just gave I/O errors). The disk consists of the EFI boot partition as well as a LUKS-container with LVM on top, which contains two Ext4 partitions. Is this a known bug for the recent kernel, for this specific combination of hard- and software components? Or are these more likely an indicator of hardware failure? In the latter case, I'll just get it RMA'd.
Were you ever able to figure it out? I have a Kingston A2000 as well (brand new) and I also noticed Arch trying to mount it read-only every now and then (I'm trying to still figure out what triggers it). I have it with BTRFS and not ext4 but also in a luks encrypted partition with lvm in it.
One thing I noticed is that in my case, the drive disappears from the device list in the BIOS and only comes back once I reboot. I can only see this if I go into the UEFI settings right away.
When the issue happens rather than powering off or a hard shutdown, I use the magic sysrq to reboot.
The nvme-cli tool doesn't show any errors other than unsafe_shutdowns (expected since the drive disappears).

Happens to me too, I have a Kingston SA400S3 (according to fdisk, don't remember what it is if that's actually the model)

I always have to reboot and fsck the drive manually to force it. No idea what causes it and how to fix it. Has persisted through updates on this SSD.

fedev · 2020-08-07 03:35:51

Brayden.Dean wrote:

fedev wrote:
daanjderuiter wrote:
A few days ago, I returned to my PC to notice it had reached some failed state where it remounted my root-partition read-only. Not thinking too much of it, I force-rebooted it. Today I noticed it a second time, and start to suspect it to be a hardware failure of the offending disk, a Kingston A2000 500GB NVMe SSD. The dmesg output can be found here (sorry for having it as a picture, I couldn't, y'know, write this anywhere as pretty much any operation in my terminal just gave I/O errors). The disk consists of the EFI boot partition as well as a LUKS-container with LVM on top, which contains two Ext4 partitions. Is this a known bug for the recent kernel, for this specific combination of hard- and software components? Or are these more likely an indicator of hardware failure? In the latter case, I'll just get it RMA'd.
Were you ever able to figure it out? I have a Kingston A2000 as well (brand new) and I also noticed Arch trying to mount it read-only every now and then (I'm trying to still figure out what triggers it). I have it with BTRFS and not ext4 but also in a luks encrypted partition with lvm in it.
One thing I noticed is that in my case, the drive disappears from the device list in the BIOS and only comes back once I reboot. I can only see this if I go into the UEFI settings right away.
When the issue happens rather than powering off or a hard shutdown, I use the magic sysrq to reboot.
The nvme-cli tool doesn't show any errors other than unsafe_shutdowns (expected since the drive disappears).
Happens to me too, I have a Kingston SA400S3 (according to fdisk, don't remember what it is if that's actually the model)
I always have to reboot and fsck the drive manually to force it. No idea what causes it and how to fix it. Has persisted through updates on this SSD.

This happened frequently but seems to have improved after adding:

nvme_core.default_ps_max_latency_us=0

as a kernel parameter in grub. It has to do with a power saving feature which either the drive or my laptop do not support. You can find a mention about it here:

https://wiki.archlinux.org/index.php/So … drive/NVMe

almost a week without the issue so far (since I added the parameter). It could be luck but you might want to try it as well.

Arch Linux

#1 2020-03-31 18:36:03

NVMe SSD I/O errors, not sure if hardware failure

#2 2020-03-31 19:25:12

Re: NVMe SSD I/O errors, not sure if hardware failure

#3 2020-03-31 20:14:23

Re: NVMe SSD I/O errors, not sure if hardware failure

#4 2020-03-31 20:32:35

Re: NVMe SSD I/O errors, not sure if hardware failure

#5 2020-03-31 20:39:25

Re: NVMe SSD I/O errors, not sure if hardware failure

#6 2020-04-01 06:47:43

Re: NVMe SSD I/O errors, not sure if hardware failure

#7 2020-07-27 09:15:43

Re: NVMe SSD I/O errors, not sure if hardware failure

#8 2020-08-07 00:45:45

Re: NVMe SSD I/O errors, not sure if hardware failure

#9 2020-08-07 03:35:51

Re: NVMe SSD I/O errors, not sure if hardware failure

Board footer