You are not logged in.
My new year started with my home partition suddenly getting remounted read-only. This is an LVM logical volume hosted in a logical group on a single physical partition of an internal SATA HDD on a laptop. This led me to search my error logs where I have found several WRITE_FPDMA_QUEUED errors going back as far as mid-December:
Jan 2 00:41:24 foo kernel: ata6.00: failed command: WRITE FPDMA QUEUED
Jan 2 00:41:24 foo kernel: ata6.00: cmd 61/20:80:b0:91:44/00:00:0c:00:00/40 tag 16 ncq dma 16384 out
res 40/00:80:b0:91:44/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
Jan 2 00:41:24 foo kernel: ata6.00: status: { DRDY }I also have several instances of the following error but it seems to be related to the NVMe disk and a bug in the 5.15 kernel when resuming from suspend that has already been patched in the upcoming 5.16 release (https://github.com/cybik/tk-quanta-elukxvi/issues/6).
Jan 1 22:45:17 foo kernel: PM: dpm_run_callback(): usb_dev_resume+0x0/0x10 returns -5
Jan 1 22:45:17 foo kernel: usb 1-10: PM: failed to resume async: error -5My data is already backed up because I regularly sync to several other systems and backup repos. I plan to replace the disk tomorrow as soon as I can get my hands on a new one.
However, I'm taking this as a learning experience to devise preventative measures to catch such errors before they possibly cause data loss. I already had smartd configured to run regular short and long tests and notify me of any failures. The "problem" is that all of the tests passed. After the ro-remount, I booted with a live medium and reran a long test that also passed.
I understand from this thread that the tests themselves are not enough and that one has to interpret the reported attributes, which I've never looked at before now. However, in that thread it was obvious that several critical disk attributes indicated pending disk failure. Can someone tell me which attributes in the output smartctl -a below are most indicative of pending failure in my case?
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.12-arch1-1] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 2.5 5400
Device Model: ST1000LM048-2E7172
Serial Number: WDEYT38D
LU WWN Device Id: 5 000c50 0ad25148d
Firmware Version: SDM1
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Jan 3 04:20:16 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x71) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 163) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 081 064 006 Pre-fail Always - 140218683
3 Spin_Up_Time 0x0003 099 099 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 999
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 074 060 045 Pre-fail Always - 25978183
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4393 (40 140 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 996
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 098 000 Old_age Always - 3
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 060 050 040 Old_age Always - 40 (Min/Max 20/45)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 7
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 26
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 214140
194 Temperature_Celsius 0x0022 040 050 000 Old_age Always - 40 (0 17 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 194 000 Old_age Always - 286
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 1524 (160 22 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2191066965
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3567900969
254 Free_Fall_Sensor 0x0032 100 100 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 4392 -
# 2 Extended offline Completed without error 00% 4385 -
# 3 Extended offline Completed without error 00% 4377 -
# 4 Extended offline Completed without error 00% 4372 -
# 5 Extended offline Completed without error 00% 4368 -
# 6 Short offline Completed without error 00% 4365 -
# 7 Extended offline Completed without error 00% 4360 -
# 8 Short offline Completed without error 00% 4305 -
# 9 Short offline Completed without error 00% 4245 -
#10 Short offline Completed without error 00% 4170 -
#11 Short offline Completed without error 00% 4072 -
#12 Extended offline Completed without error 00% 4017 -
#13 Short offline Completed without error 00% 4006 -
#14 Short offline Completed without error 00% 3993 -
#15 Short offline Completed without error 00% 3927 -
#16 Short offline Completed without error 00% 3853 -
#17 Extended offline Completed without error 00% 3807 -
#18 Short offline Completed without error 00% 3753 -
#19 Short offline Completed without error 00% 3729 -
#20 Short offline Completed without error 00% 3671 -
#21 Short offline Completed without error 00% 3605 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.I would guess that Raw_Read_Error_Rate and Seek_Error_Rate are the main indicators but the descriptions of these attributes indicate that they're not directly interpretable. They seem to increment continuously regardless of disk usage.
My plan is to write a service that periodically checks the disk attributes and notifies me of any values that indicate poor disk health despite the passing SMART tests. I've also written a service to periodically check for system errors via journalctl and notify me when any occur to avoid these little surprises in the future.
Incidentally, I read that the errors may be due to a faulty connection or electrical problem so I opened up the laptop, took out the disk, cleaned the connectors and put it back in. Since then I haven't had a single error. It's a laptop with an nvidia card and bumblebee so I wonder if it could have been a transient electrical error from power-cycling the nvidia card. I'm still going to change the disk, but I wonder if it's really borked already or if it still has a little life left in it for non-critical uses.
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
I would guess that Raw_Read_Error_Rate and Seek_Error_Rate are the main indicators but the descriptions of these attributes indicate that they're not directly interpretable. They seem to increment continuously regardless of disk usage.
I wonder about 'increment continuously'. It would only increment on an error, right?
I would also factor in the power-on time as well (components do just wear out). That said, how old is this drive? 4393 hours is not a lot. My spinner has 38724 hours, and no Raw_Read_Error_Rate or Seek_Error_Rate errors.
My plan is to write a service that periodically checks the disk attributes and notifies me of any values that indicate poor disk health despite the passing SMART tests. I've also written a service to periodically check for system errors via journalctl and notify me when any occur to avoid these little surprises in the future.
That's a neat idea. Post it on Community Contributions when you get it ready!
Incidentally, I read that the errors may be due to a faulty connection or electrical problem so I opened up the laptop, took out the disk, cleaned the connectors and put it back in. Since then I haven't had a single error. It's a laptop with an nvidia card and bumblebee so I wonder if it could have been a transient electrical error from power-cycling the nvidia card. I'm still going to change the disk, but I wonder if it's really borked already or if it still has a little life left in it for non-critical uses.
Would probably be ok for unimportant messing around until it dies. But, if the root cause is the nvidia power-cycling, be careful with the new drive - that would tell me that your power supply/voltage regulation might not be up to snuff.
Eenie meenie, chili beanie, the spirits are about to speak -- Bullwinkle J. Moose
It's a big club...and you ain't in it -- George Carlin
Registered Linux user #149839
perl -e 'print$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10); '
Offline
I wonder about 'increment continuously'. It would only increment on an error, right?
I would also factor in the power-on time as well (components do just wear out). That said, how old is this drive? 4393 hours is not a lot. My spinner has 38724 hours, and no Raw_Read_Error_Rate or Seek_Error_Rate errors.
Basically every time I run smartctl -a, the values have increased regardless of what activity on my end, at least while the disk is mounted. The system currently reports no errors through journalctl, dmesg or syslog and I've not noticed any unexpected behavior when i actually do use the disk since physically reconnecting it.
The system is about 3 years old. I have no idea how long the errors have been there.
That's a neat idea. Post it on Community Contributions when you get it ready!
I will. I have a little list of things that I plan to release soon. I'm in the process of reworking my site and my release methods. Once that's up I'll share everything along with some HOWTOs for setting up things like centralized notifications with gotify and secure remote encrypted backups with iSCSI. ![]()
Would probably be ok for unimportant messing around until it dies. But, if the root cause is the nvidia power-cycling, be careful with the new drive - that would tell me that your power supply/voltage regulation might not be up to snuff.
I'll have to monitor the new disk's attributes and see what happens. I'm holding off on replacing it as long as I can in the hope that prices will drop back down to something close to sane.
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
It's a bus error and there're no re-allocated or pending sectors, test over test passes and the disk isn't overly old … Nothing there really suggests that the disc is falling apart.
Either the firmware is so broken that the smart data is now useless or it's more a problem w/ the disk attachment (cable, badly/loosely seated) or it is indeed a power mamagement problem related to the bus - it might also be the 5.15 bug you mentioned.
You could try to reproduce the problem w/ the lts kernel or some live distro or (very wild and random guess) passing "pcie_aspm=off" to the kernel and if you get the same behavior w/ all of that, open the case and check the connection.
Offline
You may be right about a faulty connection. I pulled out the disk yesterday and re-inserted it because I saw it mentioned elsewhere. It's a 2.5" internal laptop drive so there are no cables, just a fixed connector. Since then I haven't had any more errors (I have "journalctl -p err..alert -f" open and in view). I'm still heading out to get a new disk to have it on hand but I'll keep this one in for now to see if the errors pop up again. I've tried powering up the Nvidia card a few times to see if that triggers anything but so far it's behaving perfectly. Given that I can't reproduce the errors right now I'll hold off on trying with the lts kernel or some other non-5.15 live medium. In the meantime I'll pop open the back again and double-check all of the connectors that I can find.
I plan to buy a SATA-to-USB adapter too in order to be able to rescue and install internal disks from other systems. If the errors don't appear with that then it supports the theory that it's a bus or software error.
Incidentally I had problems with the same computer over 8 months ago when it would randomly lock up when resuming from suspend. I was convinced that it was an electrical problem then but eventually an update fixed it. It's worked without issue since early last year until these errors showed up in mid-December. I don't have any error logs dating back to the end of October but it may well coincide with the release of 5.15 which leaves me hopeful that it will be fixed with 5.16.
I still don't know what to make of the changing error rates:
1 Raw_Read_Error_Rate 0x000f 077 064 006 Pre-fail Always - 47681160
7 Seek_Error_Rate 0x000f 074 060 045 Pre-fail Always - 26149032Of course, if the firmware is ****ed then we're in the unreliable narrator scenario and those values may truly mean absolutely nothing. Or maybe their just TOTP codes for getting vendor support.
The uncertainty is driving me crazy. At least with all of the clenching I've got a head start on getting back in shape after the holidays. ![]()
Last edited by Xyne (2022-01-03 09:02:42)
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
For what it's worth, here are some insights from Backblaze:
https://www.backblaze.com/blog/hard-drive-smart-stats/
They state that "With nearly 40,000 hard drives and over 100,000,000GB of data stored for customers, we have a lot of hard-won experience", and they have found the following 5 metrics to indicate impending disk drive failure:
SMART 5: Reallocated_Sector_Count.
SMART 187: Reported_Uncorrectable_Errors.
SMART 188: Command_Timeout.
SMART 197: Current_Pending_Sector_Count.
SMART 198: Offline_Uncorrectable.
Further stating that "We chose these five stats based on our experience and input from others in the industry because they are consistent across manufacturers and they are good predictors of failure"
I think I'll add them to my .xprofile startup-status-info ![]()
Last edited by Ferdinand (2022-01-03 11:05:04)
Offline