You are not logged in.

#1 2019-04-14 20:25:38

Al.Piotrowicz
Member
Registered: 2017-08-07
Posts: 116

SMART issue

I have seen recently a below libata error in the journal while attempted to mount the lvm volume:

kwi 14 21:29:52 pudlo kernel: ata3.00: exception Emask 0x0 SAct 0x300000 SErr 0x0 action 0x0
kwi 14 21:29:52 pudlo kernel: ata3.00: irq_stat 0x40000008
kwi 14 21:29:52 pudlo kernel: ata3.00: failed command: READ FPDMA QUEUED
kwi 14 21:29:52 pudlo kernel: ata3.00: cmd 60/40:a0:79:80:a2/05:00:0f:00:00/40 tag 20 ncq dma 688128 in
                                         res 41/40:00:08:81:a2/00:00:0f:00:00/40 Emask 0x409 (media error) <F>
kwi 14 21:29:52 pudlo kernel: ata3.00: status: { DRDY ERR }
kwi 14 21:29:52 pudlo kernel: ata3.00: error: { UNC }
kwi 14 21:29:52 pudlo kernel: ata3.00: configured for UDMA/133
kwi 14 21:29:52 pudlo kernel: sd 2:0:0:0: [sdc] tag#20 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kwi 14 21:29:52 pudlo kernel: sd 2:0:0:0: [sdc] tag#20 Sense Key : Medium Error [current] 
kwi 14 21:29:52 pudlo kernel: sd 2:0:0:0: [sdc] tag#20 Add. Sense: Unrecovered read error - auto reallocate failed
kwi 14 21:29:52 pudlo kernel: sd 2:0:0:0: [sdc] tag#20 CDB: Read(10) 28 00 0f a2 80 79 00 05 40 00
kwi 14 21:29:52 pudlo kernel: print_req_error: I/O error, dev sdc, sector 262308104 flags 5000
kwi 14 21:29:52 pudlo kernel: ata3: EH complete
kwi 14 21:29:53 pudlo kernel: XFS (dm-11): metadata I/O error in "xlog_bread_noalign" at daddr 0xfa245f9 len 8192 error 5
kwi 14 21:29:53 pudlo kernel: XFS (dm-11): failed to find log head
kwi 14 21:29:53 pudlo kernel: XFS (dm-11): log mount/recovery failed: error -5
kwi 14 21:29:53 pudlo kernel: XFS (dm-11): log mount failed

Same time smartctl has show 8 read error counts and 1 current_pending_sectors_count (both remained 0 beforehand):

$ sudo smartctl -a /dev/sdc
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.0.7-arch1-1-ARCH] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF)
Device Model:     WDC WD20EARS-00S8B1
Serial Number:    WD-WCAVY5975520
LU WWN Device Id: 5 0014ee 25a73fa6e
Firmware Version: 80.00A80
User Capacity:    2000398934016 bytes [2,00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Sun Apr 14 22:22:03 2019 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(40260) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 459) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3031)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       8
  3 Spin_Up_Time            0x0027   144   140   021    Pre-fail  Always       -       9783
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2750
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   067   067   000    Old_age   Always       -       24253
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       2714
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       264
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3723
194 Temperature_Celsius     0x0022   117   103   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Is the disk about to die and end in the trash dispose?

Of course also tried to mount it again an it unexpectedly succeed. The affected sector can be read :

#hdparm --read-sector 262308104 /dev/sdc

/dev/sdc:
reading sector 262308104: succeeded
6342 60af afd1 08d4 3ef3 2a70 040e 0e7f
042c 73c4 cfbd c476 299d 9121 52b5 35bc
d64c 626d 6614 5cd2 de15 184e e3fc b567
0dc4 c666 7c45 09ae 021c 0e02 d84b 0a35
cc61 1e07 c8e1 7dbf 9acd 609e 05f6 6fe2
eb79 cd08 ae23 bdfd aa6b 8483 f1b6 f85f
94bb 5b0a 3cf9 f913 bf06 413c 1171 28f3
d9a5 8043 690b 1938 302b fee8 36d3 6956
5d45 aa82 8081 4acd 91c0 092f fde6 94cf
c26c df7c 7871 72e2 e4ff 8678 bd31 8af3
b58e faee 5630 ef8a 54f5 1045 20cf 3174
3b86 390b 1994 5806 402a 5dde 74c6 9fe4
6846 0856 5865 a0cc ce99 2a78 78d1 a1ed
ee46 37a0 2eed 5eaf 948f c3c2 a36f 0be5
e299 688e 55a7 6b89 2fe5 0d8a 91c2 653b
e858 2429 8b5c f1ca 127b 1a01 c624 0a10
6f15 e70f b7d2 a76f c0a4 56c7 cb52 db7a
a274 7e2c d48d 480c d8ac 4d0d 884a 554a
edcc d3c5 3635 af6b 784d da56 9765 ea24
6b2e bbd8 9294 563e 06df 914f be97 fd68
7207 1433 48ce 6cbb 6de4 31e9 5bc4 f623
8416 1914 d81b 4f00 d0c4 cf50 9b9d 481c
1b97 414d d328 7692 e3ec 5d1a 48d8 9bb9
3c5c 7194 81b9 8f4a 5a2e 5e95 f9fd 5167
9ed5 7dbe f320 62ff 4e5b 05de 6469 2976
bfe5 a763 3c9d 6629 628d 9ceb 1e15 77a6
0a2c 07a0 3bcb 09fd 4bd6 8fce 921a 3aa3
c263 0106 340b 7f3c 548e 5f7e f174 8cd2
490d aaa2 ea82 484a b3f2 604e 78fc 76f6
fc68 009f 0c24 32b3 940a 1e44 19cd aba8
71e7 0f82 8964 a59e a2bb d039 249a 831c
8186 cdf7 794a 0db0 5d8b 9aff 7892 42d4

Last edited by Al.Piotrowicz (2019-04-14 20:35:05)

Offline

#2 2019-04-14 21:24:10

Ropid
Member
Registered: 2015-03-09
Posts: 1,069

Re: SMART issue

In that table in smartctl's output, you should probably ignore the "1 Raw_Read_Error_Rate" line . You can't know what the "rate" events actually say, only the manufacturer knows what those lines are about.

The interesting line in smartctl's output is "197 Current_Pending_Sector". It has a "1" in the "raw" column. That means there was one sector that the drive did not manage to read. In the future when that sector will be overwritten, the "current pending sector" number will change to zero and "5 Reallocated_Sector_Ct" will change to "1".

The usual recommendation is to immediately stop using this kind of drive and replace it.

My personal experience with this is, I had this happen with several drives over the years. I kept using those drives for fun and most of them were dead within weeks. There was one drive where the "reallocated" and "pending" sector counts stopped going up after a while, then it kept working fine for another five years or so, then one day it was just suddenly dead. If you want to experiment with this drive of yours, the practical problem you'll run into is that the drive will hang each time it runs into a broken sector that it can't read. This makes it a bit useless. You might want to start a sort of log where you collect the output of "smartctl -A" with a date so you can see what's changing with the event counts from day to day.

Offline

#3 2019-04-14 21:31:16

Al.Piotrowicz
Member
Registered: 2017-08-07
Posts: 116

Re: SMART issue

Thanks for a quick reply. I managed to admit a raw_read_error_rate as an urgent, because I got the two other identical drives (caviar damn green junk). On the other ones that SMART record points 0. It raised up just after the libata read error accident.

Offline

#4 2019-04-15 06:54:16

seth
Member
Registered: 2012-09-03
Posts: 51,050

Re: SMART issue

Check the smart data again, you probably have now one "Reallocated_Event_Count"?
It's time to be cautious. If the single sector got damaged by an isolated, unfortunate event (physical impact while the head was above it) things will stay there, but if the numbers start creeping up it's time to replace the disk.

In any event: if you do not have a backup of valuable unreplaceable data (you don't need to backup the OS, but your master thesis or whatever) *NOW* is the time for that!

Online

#5 2019-04-15 07:05:53

Al.Piotrowicz
Member
Registered: 2017-08-07
Posts: 116

Re: SMART issue

Hey Seth, the really weird thing for me is (despite the fact these disc drives are really ** crap & almost 3 years of constant workload) the smart values remain constant at the moment of this post:

# smartctl -a /dev/sdc
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.0.7-arch1-1-ARCH] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF)
Device Model:     WDC WD20EARS-00S8B1
Serial Number:    WD-WCAVY5975520
LU WWN Device Id: 5 0014ee 25a73fa6e
Firmware Version: 80.00A80
User Capacity:    2000398934016 bytes [2,00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Apr 15 09:02:50 2019 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(40260) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 459) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3031)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       8
  3 Spin_Up_Time            0x0027   144   140   021    Pre-fail  Always       -       9783
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2751
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   067   067   000    Old_age   Always       -       24257
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       2715
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       264
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3724
194 Temperature_Celsius     0x0022   115   103   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Moreover, could you somehow explain the fact about the affected sector? Its been still readable since the libata read error occured every time I used a hdparm read command. Is it a normal, desired behaviour or Im just in some sort of confusion?

Thanks community.

Last edited by Al.Piotrowicz (2019-04-15 07:07:44)

Offline

#6 2019-04-15 07:25:15

seth
Member
Registered: 2012-09-03
Posts: 51,050

Re: SMART issue

The sector has been remapped to one of the reserve pool. This happens by the drives firmware and is opaque to the OS.
The values will only grow whenever you randomly stumble across accessing a damaged sector (if there are or come more)
You could trigger an extended self test or nondestructive badblocks run to hunt already exising ones and get a full picture of the drives status AS OF TODAY.

But backup stuff first. Really.

Online

Board footer

Powered by FluxBB