Interpretation of SMART attributes for failing disk with passing tests

Xyne · 2022-01-03 03:45:19

My new year started with my home partition suddenly getting remounted read-only. This is an LVM logical volume hosted in a logical group on a single physical partition of an internal SATA HDD on a laptop. This led me to search my error logs where I have found several WRITE_FPDMA_QUEUED errors going back as far as mid-December:

Jan  2 00:41:24 foo kernel: ata6.00: failed command: WRITE FPDMA QUEUED
Jan  2 00:41:24 foo kernel: ata6.00: cmd 61/20:80:b0:91:44/00:00:0c:00:00/40 tag 16 ncq dma 16384 out
         res 40/00:80:b0:91:44/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
Jan  2 00:41:24 foo kernel: ata6.00: status: { DRDY }

I also have several instances of the following error but it seems to be related to the NVMe disk and a bug in the 5.15 kernel when resuming from suspend that has already been patched in the upcoming 5.16 release (https://github.com/cybik/tk-quanta-elukxvi/issues/6).

Jan  1 22:45:17 foo kernel: PM: dpm_run_callback(): usb_dev_resume+0x0/0x10 returns -5
Jan  1 22:45:17 foo kernel: usb 1-10: PM: failed to resume async: error -5

My data is already backed up because I regularly sync to several other systems and backup repos. I plan to replace the disk tomorrow as soon as I can get my hands on a new one.

However, I'm taking this as a learning experience to devise preventative measures to catch such errors before they possibly cause data loss. I already had smartd configured to run regular short and long tests and notify me of any failures. The "problem" is that all of the tests passed. After the ro-remount, I booted with a live medium and reran a long test that also passed.

I understand from this thread that the tests themselves are not enough and that one has to interpret the reported attributes, which I've never looked at before now. However, in that thread it was obvious that several critical disk attributes indicated pending disk failure. Can someone tell me which attributes in the output smartctl -a below are most indicative of pending failure in my case?

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.12-arch1-1] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 2.5 5400
Device Model:     ST1000LM048-2E7172
Serial Number:    WDEYT38D
LU WWN Device Id: 5 000c50 0ad25148d
Firmware Version: SDM1
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jan  3 04:20:16 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x71) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 163) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   081   064   006    Pre-fail  Always       -       140218683
  3 Spin_Up_Time            0x0003   099   099   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       999
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   074   060   045    Pre-fail  Always       -       25978183
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4393 (40 140 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       996
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       3
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   060   050   040    Old_age   Always       -       40 (Min/Max 20/45)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       7
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       26
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       214140
194 Temperature_Celsius     0x0022   040   050   000    Old_age   Always       -       40 (0 17 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   194   000    Old_age   Always       -       286
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       1524 (160 22 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2191066965
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       3567900969
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      4392         -
# 2  Extended offline    Completed without error       00%      4385         -
# 3  Extended offline    Completed without error       00%      4377         -
# 4  Extended offline    Completed without error       00%      4372         -
# 5  Extended offline    Completed without error       00%      4368         -
# 6  Short offline       Completed without error       00%      4365         -
# 7  Extended offline    Completed without error       00%      4360         -
# 8  Short offline       Completed without error       00%      4305         -
# 9  Short offline       Completed without error       00%      4245         -
#10  Short offline       Completed without error       00%      4170         -
#11  Short offline       Completed without error       00%      4072         -
#12  Extended offline    Completed without error       00%      4017         -
#13  Short offline       Completed without error       00%      4006         -
#14  Short offline       Completed without error       00%      3993         -
#15  Short offline       Completed without error       00%      3927         -
#16  Short offline       Completed without error       00%      3853         -
#17  Extended offline    Completed without error       00%      3807         -
#18  Short offline       Completed without error       00%      3753         -
#19  Short offline       Completed without error       00%      3729         -
#20  Short offline       Completed without error       00%      3671         -
#21  Short offline       Completed without error       00%      3605         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I would guess that Raw_Read_Error_Rate and Seek_Error_Rate are the main indicators but the descriptions of these attributes indicate that they're not directly interpretable. They seem to increment continuously regardless of disk usage.

My plan is to write a service that periodically checks the disk attributes and notifies me of any values that indicate poor disk health despite the passing SMART tests. I've also written a service to periodically check for system errors via journalctl and notify me when any occur to avoid these little surprises in the future.

Incidentally, I read that the errors may be due to a faulty connection or electrical problem so I opened up the laptop, took out the disk, cleaned the connectors and put it back in. Since then I haven't had a single error. It's a laptop with an nvidia card and bumblebee so I wonder if it could have been a transient electrical error from power-cycling the nvidia card. I'm still going to change the disk, but I wonder if it's really borked already or if it still has a little life left in it for non-critical uses.

merlock · 2022-01-03 04:17:16

Xyne wrote:

I would guess that Raw_Read_Error_Rate and Seek_Error_Rate are the main indicators but the descriptions of these attributes indicate that they're not directly interpretable. They seem to increment continuously regardless of disk usage.

I wonder about 'increment continuously'. It would only increment on an error, right?

I would also factor in the power-on time as well (components do just wear out). That said, how old is this drive? 4393 hours is not a lot. My spinner has 38724 hours, and no Raw_Read_Error_Rate or Seek_Error_Rate errors.

Xyne wrote:

My plan is to write a service that periodically checks the disk attributes and notifies me of any values that indicate poor disk health despite the passing SMART tests. I've also written a service to periodically check for system errors via journalctl and notify me when any occur to avoid these little surprises in the future.

That's a neat idea. Post it on Community Contributions when you get it ready!

Xyne wrote:

Incidentally, I read that the errors may be due to a faulty connection or electrical problem so I opened up the laptop, took out the disk, cleaned the connectors and put it back in. Since then I haven't had a single error. It's a laptop with an nvidia card and bumblebee so I wonder if it could have been a transient electrical error from power-cycling the nvidia card. I'm still going to change the disk, but I wonder if it's really borked already or if it still has a little life left in it for non-critical uses.

Would probably be ok for unimportant messing around until it dies. But, if the root cause is the nvidia power-cycling, be careful with the new drive - that would tell me that your power supply/voltage regulation might not be up to snuff.

Xyne · 2022-01-03 05:40:45

merlock wrote:

I wonder about 'increment continuously'. It would only increment on an error, right?
I would also factor in the power-on time as well (components do just wear out). That said, how old is this drive? 4393 hours is not a lot. My spinner has 38724 hours, and no Raw_Read_Error_Rate or Seek_Error_Rate errors.

Basically every time I run smartctl -a, the values have increased regardless of what activity on my end, at least while the disk is mounted. The system currently reports no errors through journalctl, dmesg or syslog and I've not noticed any unexpected behavior when i actually do use the disk since physically reconnecting it.

The system is about 3 years old. I have no idea how long the errors have been there.

merlock wrote:

That's a neat idea. Post it on Community Contributions when you get it ready!

I will. I have a little list of things that I plan to release soon. I'm in the process of reworking my site and my release methods. Once that's up I'll share everything along with some HOWTOs for setting up things like centralized notifications with gotify and secure remote encrypted backups with iSCSI.

merlock wrote:

Would probably be ok for unimportant messing around until it dies. But, if the root cause is the nvidia power-cycling, be careful with the new drive - that would tell me that your power supply/voltage regulation might not be up to snuff.

I'll have to monitor the new disk's attributes and see what happens. I'm holding off on replacing it as long as I can in the hope that prices will drop back down to something close to sane.

seth · 2022-01-03 07:43:57

It's a bus error and there're no re-allocated or pending sectors, test over test passes and the disk isn't overly old … Nothing there really suggests that the disc is falling apart.
Either the firmware is so broken that the smart data is now useless or it's more a problem w/ the disk attachment (cable, badly/loosely seated) or it is indeed a power mamagement problem related to the bus - it might also be the 5.15 bug you mentioned.

You could try to reproduce the problem w/ the lts kernel or some live distro or (very wild and random guess) passing "pcie_aspm=off" to the kernel and if you get the same behavior w/ all of that, open the case and check the connection.

Xyne · 2022-01-03 08:59:40

You may be right about a faulty connection. I pulled out the disk yesterday and re-inserted it because I saw it mentioned elsewhere. It's a 2.5" internal laptop drive so there are no cables, just a fixed connector. Since then I haven't had any more errors (I have "journalctl -p err..alert -f" open and in view). I'm still heading out to get a new disk to have it on hand but I'll keep this one in for now to see if the errors pop up again. I've tried powering up the Nvidia card a few times to see if that triggers anything but so far it's behaving perfectly. Given that I can't reproduce the errors right now I'll hold off on trying with the lts kernel or some other non-5.15 live medium. In the meantime I'll pop open the back again and double-check all of the connectors that I can find.

I plan to buy a SATA-to-USB adapter too in order to be able to rescue and install internal disks from other systems. If the errors don't appear with that then it supports the theory that it's a bus or software error.

Incidentally I had problems with the same computer over 8 months ago when it would randomly lock up when resuming from suspend. I was convinced that it was an electrical problem then but eventually an update fixed it. It's worked without issue since early last year until these errors showed up in mid-December. I don't have any error logs dating back to the end of October but it may well coincide with the release of 5.15 which leaves me hopeful that it will be fixed with 5.16.

I still don't know what to make of the changing error rates:

  1 Raw_Read_Error_Rate     0x000f   077   064   006    Pre-fail  Always       -       47681160
  7 Seek_Error_Rate         0x000f   074   060   045    Pre-fail  Always       -       26149032

Of course, if the firmware is ****ed then we're in the unreliable narrator scenario and those values may truly mean absolutely nothing. Or maybe their just TOTP codes for getting vendor support.

The uncertainty is driving me crazy. At least with all of the clenching I've got a head start on getting back in shape after the holidays.

Last edited by Xyne (2022-01-03 09:02:42)

Ferdinand · 2022-01-03 11:02:44

For what it's worth, here are some insights from Backblaze:
https://www.backblaze.com/blog/hard-drive-smart-stats/

They state that "With nearly 40,000 hard drives and over 100,000,000GB of data stored for customers, we have a lot of hard-won experience", and they have found the following 5 metrics to indicate impending disk drive failure:

SMART 5: Reallocated_Sector_Count.
SMART 187: Reported_Uncorrectable_Errors.
SMART 188: Command_Timeout.
SMART 197: Current_Pending_Sector_Count.
SMART 198: Offline_Uncorrectable.

Further stating that "We chose these five stats based on our experience and input from others in the industry because they are consistent across manufacturers and they are good predictors of failure"

I think I'll add them to my .xprofile startup-status-info

Last edited by Ferdinand (2022-01-03 11:05:04)

Arch Linux

#1 2022-01-03 03:45:19

Interpretation of SMART attributes for failing disk with passing tests

#2 2022-01-03 04:17:16

Re: Interpretation of SMART attributes for failing disk with passing tests

#3 2022-01-03 05:40:45

Re: Interpretation of SMART attributes for failing disk with passing tests

#4 2022-01-03 07:43:57

Re: Interpretation of SMART attributes for failing disk with passing tests

#5 2022-01-03 08:59:40

Re: Interpretation of SMART attributes for failing disk with passing tests

#6 2022-01-03 11:02:44

Re: Interpretation of SMART attributes for failing disk with passing tests

Board footer