Understanding NVME drive health

ashic · 2020-12-05 13:02:52

I've been using my trusty Dell Precision 5510 for ages (over 5 years now). The machine's still running well, but one thing that keeps me a bit worried is the NVME drive. I understand more recent SSDs are built for extended usage, but not really sure if a 5 year old one is classed as "modern" these days. The laptop is my workhorse (it was a maxed out configuration in 2015), and runs pretty much all day everyday. I ran smartctl, and the results are given below. I'm new to smartctl, and am trying to understand the results. I see the drive "passed", and that vendor reckons I've only used 4% of the lifespan of the drive. Is my interpretation correct? Is there something that looks a bit concerning? I have forced killed the power from time to time due to lock ups resuming from sleep, etc.

Reading up on smartctl, I see that the expected run time for a -t long is 20-30 minutes. However, it's returning instantly for me. Does that mean the information is not accurate?

>smartctl -t long -a /dev/nvme0n1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.9.10-arch1-1] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       PM951 NVMe SAMSUNG 1024GB
Serial Number:                      S2FZNXAGA09940
Firmware Version:                   BXV77D0Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Utilization:            881,679,765,504 [881 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 455a1b26d4
Local Time is:                      Sat Dec  5 13:01:43 2020 GMT
Firmware Updates (0x06):            3 Slots
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         32 Pages

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.00W       -        -    0  0  0  0        5       5
 1 +     4.20W       -        -    1  1  1  1       30      30
 2 +     3.10W       -        -    2  2  2  2      100     100
 3 -   0.0700W       -        -    3  3  3  3      500    5000
 4 -   0.0050W       -        -    4  4  4  4     2000   22000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        31 Celsius
Available Spare:                    100%
Available Spare Threshold:          50%
Percentage Used:                    4%
Data Units Read:                    33,049,228 [16.9 TB]
Data Units Written:                 59,137,420 [30.2 TB]
Host Read Commands:                 593,366,012
Host Write Commands:                1,210,173,947
Controller Busy Time:               12,192
Power Cycles:                       6,451
Power On Hours:                     19,872
Unsafe Shutdowns:                   729
Media and Data Integrity Errors:    0
Error Information Log Entries:      8,351

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0       8351     0  0x000a  0x4004  0x000            0     1     -
  1       8350     0  0x0007  0x4004  0x000            0     -     -
  2       8349     0  0x0002  0x4004  0x000            0     0     -
  3       8348     0  0x0006  0x4004  0x000            0     0     -
  4       8347     0  0x000c  0x4004  0x000            0     0     -
  5       8346     0  0x000e  0x4004  0x000            0     0     -
  6       8345     0  0x0016  0x4004  0x000            0     0     -
  7       8344     0  0x0017  0x4004  0x000            0     0     -
  8       8343     0  0x001a  0x4004  0x000            0     0     -
  9       8342     0  0x001b  0x4004  0x000            0     0     -
 10       8341     0  0x000c  0x4004  0x000            0     0     -
 11       8340     0  0x0014  0x4004  0x000            0     0     -
 12       8339     0  0x0007  0x4004  0x000            0     0     -
 13       8338     0  0x000e  0x4004  0x000            0     0     -
 14       8337     0  0x0016  0x4004  0x000            0     0     -
 15       8336     0  0x0006  0x4004  0x000            0     0     -
... (48 entries not shown)

merlock · 2020-12-05 15:45:47

~20K power-on time is getting up there. I have a spinner from 2011 that has 29K power-on hours (it's not used for anything critical, anymore).

You might want to start thinking about a replacement strategy; you've got 30TB of data written to that drive.

You might also want to look into nvme-cli

ashic · 2020-12-05 16:24:55

Thanks. So, what things should I be looking at with nvme-cli?

I see that the Samsung drive is rated at an MTBF of 1.5 million hours and to 300 TB writes. Now of course, I'm not expecting it to last that long. But where can I find information as to what is a safe range before replacing it? I don't want to unnecessarily replace it if it's at 4-10% of actual life, specially seeing that I might get a new machine next year, or the year after seeing how Intel responds to the Apple M1. The drive crashing won't be much more than a little inconvenience as I do have things regularly back up offline. But curious to know of warning signs I can look out for.

graysky · 2020-12-05 16:39:20

20K hours isn't shit...

# smartctl -a /dev/sdb | grep Power_On_H
  9 Power_On_Hours          0x0032   001   001   000    Old_age   Always       -       86451

Not even an enterprise level drive:

smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.80-2-lts] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF)
Device Model:     WDC WD20EARS-00MVWB0
...

I would think SSDs would have way more life in them regarding power on hours... my understanding is I/O is how they wear.

Last edited by graysky (2020-12-05 16:39:58)

V1del · 2020-12-05 17:49:31

To answer the initial question on the outset, a -t long will always return instantly (...even on spinners with a -t long time of 6 hours). You can use everything normally and then double check state once the time that is expected to be needed has elapsed with a smartctl -a

That said I also don't see any reason for concern in these outputs.

Arch Linux

#1 2020-12-05 13:02:52

Understanding NVME drive health

#2 2020-12-05 15:45:47

Re: Understanding NVME drive health

#3 2020-12-05 16:24:55

Re: Understanding NVME drive health

#4 2020-12-05 16:39:20

Re: Understanding NVME drive health

#5 2020-12-05 17:49:31

Re: Understanding NVME drive health

Board footer