Btrfs filesystem corruption

Mathuin · 2016-01-23 13:04:14

Hello,

I've been running Arch on my Asus UX303 (Btrfs filesystem) without any problems since August. During the week, I don't use it a lot because I'm a bit overworked. Anyway, last week-end I updated, and I wanted to update again this morning. Pacman crashed with a bus error. I was a bit puzzled so I checked my hard drive, thinking that it was perhaps a hardware problem. When backing up my files, I noticed that cp couldn't copy a few files from my home directory, mostly music but also files from my Miniconda install, because of an I/O error. Other that pacman, everything works.

I booted on an ISO and ran btrfs check. It found 2 or so "unfixable" errors. I then ran a SMART check with -H, which went fine: the hard drive passed the checks.

I'm sorry for not providing more details, because I left my charger at my dorm yesterday and I'm at my parents' until tomorrow evening. I only have 15% battery left and I want to save it. What do you think is the real source of those errors ? Btrfs problem or hard drive failure ?

I'm worried because while the laptop is still under warranty, I don't have Windows on it anymore and I would have to reinstall windows on it so that the Asus customer service don't reject it, right ? Furthermore, I need that laptop during the week because I've got a research project to do, and that project is part of the exams needed to join an engineering school.

Thanks in advance for your attention. If there is any further information you need, tell me and I'll post it.

Mathuin · 2016-01-24 22:16:16

Sorry for up-ing the post. I've got my charger back, so I can give more detail. Whenever I run **any** pacman command, there's a bus dump. To find out what caused this, I do:

$ dmesg | tail
[ 2901.910544] ata1.00: status: { DRDY ERR }
[ 2901.910545] ata1.00: error: { UNC }
[ 2901.912846] ata1.00: configured for UDMA/133
[ 2901.912857] sd 0:0:0:0: [sda] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
[ 2901.912859] sd 0:0:0:0: [sda] tag#6 Sense Key : 0x3 [current] [descriptor] 
[ 2901.912861] sd 0:0:0:0: [sda] tag#6 ASC=0x11 ASCQ=0x4 
[ 2901.912863] sd 0:0:0:0: [sda] tag#6 CDB: opcode=0x28 28 00 03 9a 02 18 00 00 08 00
[ 2901.912864] blk_update_request: I/O error, dev sda, sector 60424728
[ 2901.912868] BTRFS: bdev /dev/sda3 errs: wr 0, rd 227, flush 0, corrupt 0, gen 0
[ 2901.912884] ata1: EH complete

I think that it's the hard drive itself that's broken. I ran a short SMART test:

$ smartctl -t short /dev/sda
[Irrelevant]
$ smartctl -a /dev/sda
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.3.3-2-ARCH] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HGST HTS545050A7E680
Serial Number:    RB050AM51DSJSJ
LU WWN Device Id: 5 000cca 7a3d3e453
Firmware Version: GR2OA3B0
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jan 24 22:55:24 2016 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                  was never started.
                  Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                  without error or no self-test has ever 
                  been run.
Total time to complete Offline 
data collection:       (   45) seconds.
Offline data collection
capabilities:           (0x5b) SMART execute Offline immediate.
                  Auto Offline data collection on/off support.
                  Suspend Offline collection upon new
                  command.
                  Offline surface scan supported.
                  Self-test supported.
                  No Conveyance Self-test supported.
                  Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                  power-saving mode.
                  Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                  General Purpose Logging supported.
Short self-test routine 
recommended polling time:   (   2) minutes.
Extended self-test routine
recommended polling time:   ( 109) minutes.
SCT capabilities:         (0x003d) SCT Status supported.
                  SCT Error Recovery Control supported.
                  SCT Feature Control supported.
                  SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x000b   096   096   062    Pre-fail  Always       -       589824
 2 Throughput_Performance  0x0005   100   100   040    Pre-fail  Offline      -       0
 3 Spin_Up_Time            0x0007   253   253   033    Pre-fail  Always       -       0
 4 Start_Stop_Count        0x0012   095   095   000    Old_age   Always       -       9319
 5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
 8 Seek_Time_Performance   0x0005   100   100   040    Pre-fail  Offline      -       0
 9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       523
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       422
191 G-Sense_Error_Rate      0x000a   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0012   091   091   000    Old_age   Always       -       99202
194 Temperature_Celsius     0x0002   193   193   000    Old_age   Always       -       31 (Min/Max 15/43)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       11
197 Current_Pending_Sector  0x0022   095   095   000    Old_age   Always       -       352
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
223 Load_Retry_Count        0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 259 (device log contains only the most recent five errors)
  CR = Command Register [HEX]
  FR = Features Register [HEX]
  SC = Sector Count Register [HEX]
  SN = Sector Number Register [HEX]
  CL = Cylinder Low Register [HEX]
  CH = Cylinder High Register [HEX]
  DH = Device/Head Register [HEX]
  DC = Device Command Register [HEX]
  ER = Error register [HEX]
  ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 259 occurred at disk power-on lifetime: 523 hours (21 days + 19 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 08 50 5c e9 03  Error: UNC at LBA = 0x03e95c50 = 65625168

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 60 08 88 50 5c e9 40 00      00:31:38.962  READ FPDMA QUEUED
 ea 00 00 00 00 00 a0 00      00:31:38.954  FLUSH CACHE EXT
 61 08 78 80 00 10 40 00      00:31:38.954  WRITE FPDMA QUEUED
 ea 00 00 00 00 00 a0 00      00:31:38.927  FLUSH CACHE EXT
 61 20 68 20 4f 43 40 00      00:31:38.927  WRITE FPDMA QUEUED

Error 258 occurred at disk power-on lifetime: 523 hours (21 days + 19 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 08 50 5c e9 03  Error: UNC at LBA = 0x03e95c50 = 65625168

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 60 08 f0 50 5c e9 40 00      00:27:03.940  READ FPDMA QUEUED
 ea 00 00 00 00 00 a0 00      00:26:58.710  FLUSH CACHE EXT
 61 08 d0 00 00 10 40 00      00:26:58.710  WRITE FPDMA QUEUED
 61 08 c8 00 00 12 40 00      00:26:58.710  WRITE FPDMA QUEUED
 61 08 c0 80 00 10 40 00      00:26:58.710  WRITE FPDMA QUEUED

Error 257 occurred at disk power-on lifetime: 523 hours (21 days + 19 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 08 18 02 9a 03  Error: WP at LBA = 0x039a0218 = ``

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 61 08 b0 80 00 10 40 00      00:25:41.546  WRITE FPDMA QUEUED
 60 08 a8 18 02 9a 40 00      00:25:41.546  READ FPDMA QUEUED
 ea 00 00 00 00 00 a0 00      00:25:41.449  FLUSH CACHE EXT
 61 20 98 40 75 42 40 00      00:25:41.448  WRITE FPDMA QUEUED
 61 20 90 40 72 42 40 00      00:25:41.448  WRITE FPDMA QUEUED

Error 256 occurred at disk power-on lifetime: 523 hours (21 days + 19 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 08 18 02 9a 03  Error: UNC at LBA = 0x039a0218 = 60424728

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 60 08 f0 18 02 9a 40 00      00:24:44.522  READ FPDMA QUEUED
 61 10 e8 60 53 54 40 00      00:24:36.018  WRITE FPDMA QUEUED
 61 10 e0 50 92 51 40 00      00:24:16.843  WRITE FPDMA QUEUED
 ea 00 00 00 00 00 a0 00      00:24:06.139  FLUSH CACHE EXT
 61 08 c0 00 00 10 40 00      00:24:06.139  WRITE FPDMA QUEUED

Error 255 occurred at disk power-on lifetime: 522 hours (21 days + 18 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 08 50 5c e9 03  Error: UNC at LBA = 0x03e95c50 = 65625168

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 60 08 f0 50 5c e9 40 00      00:20:20.835  READ FPDMA QUEUED
 60 08 e8 98 20 52 40 00      00:20:20.830  READ FPDMA QUEUED
 60 08 e0 c0 19 52 40 00      00:20:20.829  READ FPDMA QUEUED
 60 20 d8 40 33 16 40 00      00:20:20.821  READ FPDMA QUEUED
 61 08 d0 08 83 10 40 00      00:20:20.634  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       522         -
# 2  Extended offline    Aborted by host               90%       522         -
# 3  Short offline       Completed without error       00%       522         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I'm beginning to think that the hard drive is fucked. I can send it back as it's under warranty.

What do I do ? Is my diagnosis correct ?

R00KIE · 2016-01-25 00:02:53

The drive may not be broken but the sure way to make things work would be to backup everything, clean the disk, copy things back and reinstall any packages for which you couldn't copy the files. You can most probably send it back under warranty but if you haven't backed up your data don't expect to get it back.

Just out of curiosity, do you happen to move your laptop a lot while it is on?

Morn · 2016-01-25 00:44:45

Perhaps this is just a hardware failure, but I would still use ext4 next time. If btrfs fails, it usually fails catastrophically, as a forum search shows.

mich41 · 2016-01-25 00:56:47

UNC is "uncorrectable read error", i.e. data read from disk contain too many errors to recover with ECC.

It seems that you have 352 unreadable sectors (Current_Pending_Sector), which is a bit high. Normally UNC sectors can be "fixed" by overwriting with new data (old data are lost anyway), but big number of them suggests that the disk may have some trouble writing data correctly, i.e. be fucked. Especially if new bad sectors continue to appear.

You may want to backup all data that are still readable (sector=0.5kB so you didn't lose much yet)

BTW, only long self test detects such errors.

graysky · 2016-01-25 01:38:17

Morn wrote:

Perhaps this is just a hardware failure, but I would still use ext4 next time. If btrfs fails, it usually fails catastrophically, as a forum search shows.

Yeah, a similar thing happened to me with btrfs but no data supporting the hardware failure. Use ext4 and be happy with a stable system.

Morn · 2016-01-25 12:00:51

graysky wrote:

Yeah, a similar thing happened to me with btrfs but no data supporting the hardware failure. Use ext4 and be happy with a stable system.

I think the strain created by the unusual disk usage pattern of B-tree-based file systems might actually cause these sudden HD hardware failures in the first place.

I was a big proponent of reiserFS3 at home and at university in the early 00's. Initially everything worked beautifully compared to ext2, but later most of those reiser disks failed suddenly and completely, while ext2-based disks were fine. It made me look like an idiot for advocating reiserfs, not to mention I lost my own data at home. Correlation or causation? IDK, but it seems to be a very similar story with btrfs now.

Beware Linux and its "next-gen" file systems! Caveat emptor. :-)

Mathuin · 2016-01-25 12:42:43

So, during the night I ran badblocks which found 961 bad blocks. I also ran:

$ smartctl -t long /dev/sda
[Irrelevant]
$ smartctl -H /dev/sda
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.3.3-2-ARCH] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
$ smartctl -a /dev/sda
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.3.3-2-ARCH] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     HGST HTS545050A7E680
Serial Number:    RB050AM51DSJSJ
LU WWN Device Id: 5 000cca 7a3d3e453
Firmware Version: GR2OA3B0
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jan 25 13:38:32 2016 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection: 		(   45) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 109) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   097   097   062    Pre-fail  Always       -       458752
  2 Throughput_Performance  0x0005   100   100   040    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   253   253   033    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0012   095   095   000    Old_age   Always       -       9350
  5 Reallocated_Sector_Ct   0x0033   098   098   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   040    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       537
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       423
191 G-Sense_Error_Rate      0x000a   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       9
193 Load_Cycle_Count        0x0012   091   091   000    Old_age   Always       -       99451
194 Temperature_Celsius     0x0002   214   214   000    Old_age   Always       -       28 (Min/Max 15/43)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       82
197 Current_Pending_Sector  0x0022   028   028   000    Old_age   Always       -       1856
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
223 Load_Retry_Count        0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 1696 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1696 occurred at disk power-on lifetime: 537 hours (22 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 50 5c e9 03  Error: UNC at LBA = 0x03e95c50 = 65625168

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 40 c0 95 2a 40 00      04:56:08.250  READ FPDMA QUEUED
  61 40 38 20 46 2d 40 00      04:56:08.250  WRITE FPDMA QUEUED
  61 a0 30 60 45 2d 40 00      04:56:08.250  WRITE FPDMA QUEUED
  61 00 28 40 43 2d 40 00      04:56:08.250  WRITE FPDMA QUEUED
  61 c0 20 20 3e 2d 40 00      04:56:08.249  WRITE FPDMA QUEUED

Error 1695 occurred at disk power-on lifetime: 537 hours (22 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 40 50 5c e9 03  Error: WP at LBA = 0x03e95c50 = 65625168

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 98 c8 d8 9a 52 40 00      04:56:03.980  WRITE FPDMA QUEUED
  60 20 c0 40 b8 2a 40 00      04:56:03.246  READ FPDMA QUEUED
  60 40 b8 50 5c e9 40 00      04:56:02.569  READ FPDMA QUEUED
  60 40 b0 10 25 75 40 00      04:56:02.531  READ FPDMA QUEUED
  60 40 a8 a8 85 95 40 00      04:56:02.514  READ FPDMA QUEUED

Error 1694 occurred at disk power-on lifetime: 526 hours (21 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 e8 29 0b 08  Error: UNC at LBA = 0x080b29e8 = 134949352

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 90 e8 29 0b 40 00      03:59:47.066  READ FPDMA QUEUED
  61 08 88 a0 02 5d 40 00      03:59:47.066  WRITE FPDMA QUEUED
  61 40 80 c8 7a 85 40 00      03:59:47.064  WRITE FPDMA QUEUED
  61 08 78 b8 7a 85 40 00      03:59:47.063  WRITE FPDMA QUEUED
  61 08 70 a8 79 85 40 00      03:59:47.062  WRITE FPDMA QUEUED

Error 1693 occurred at disk power-on lifetime: 526 hours (21 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 e8 29 0b 08  Error: UNC at LBA = 0x080b29e8 = 134949352

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 f0 e8 29 0b 40 00      03:59:43.647  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00      03:59:43.647  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 e0 00      03:59:43.647  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00      03:59:43.646  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      03:59:43.646  SET FEATURES [Set transfer mode]

Error 1692 occurred at disk power-on lifetime: 526 hours (21 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 e8 29 0b 08  Error: UNC at LBA = 0x080b29e8 = 134949352

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 80 e8 29 0b 40 00      03:59:40.280  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00      03:59:40.280  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 e0 00      03:59:40.280  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00      03:59:40.279  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      03:59:40.279  SET FEATURES [Set transfer mode]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%       532         36160528
# 2  Short offline       Completed without error       00%       522         -
# 3  Extended offline    Aborted by host               90%       522         -
# 4  Short offline       Completed without error       00%       522         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

It still says the disk is OK but the number of bad sectors kind of worries me. Everything is backed up. If I get you correctly, what you advise is nuking the HDD, formatting as ext4 and reinstall ? Could that fix the I/O errors ? If so, I can do that this week-end.
If that causes issues again, I'll ditch the HDD and replace it by a SSD. And R00kie, yes I do sometimes move the laptop when it's still on but I don't do that that often, and when I do that it's to move it to the living room. I know that should be avoided, but seriously, can that
cause that many bad sectors when done very seldom ?

mich41 · 2016-01-25 12:46:58

Well, my old laptop is still fine after over 30k hours with reiserfs

Do you have more data? Were they really the same kind of disks?

nstgc · 2016-01-25 14:51:59

mich41 wrote:

UNC is "uncorrectable read error", i.e. data read from disk contain too many errors to recover with ECC.
It seems that you have 352 unreadable sectors (Current_Pending_Sector), which is a bit high. Normally UNC sectors can be "fixed" by overwriting with new data (old data are lost anyway), but big number of them suggests that the disk may have some trouble writing data correctly, i.e. be fucked. Especially if new bad sectors continue to appear.
You may want to backup all data that are still readable (sector=0.5kB so you didn't lose much yet)
BTW, only long self test detects such errors.

Some driver's SMART data isn't so strait forward to interpret. I'd contact Hitachi asking them what that 352 really means. I flipped out a bit when reading the SMART data for my Seagates, thinking something was wrong, but they just report the data differently than my WD's did.

Also, yes. If you are trying for a surface test, either use the drive's long test or the utility "badblocks". Be careful with badblocks though since it can lead to data loss even with "-n".

It also could be an issue with the SATA bus, port, or controller. Last January I was having some issues which I think (but am not certain) are actually related to one of those three as opposed to the drive, cable, or FS (which for me is btrfs).

Have you tried running "btrfs scrub start -Bdr /path/to/volume"?

R00KIE · 2016-01-25 14:52:29

Mathuin wrote:

And R00kie, yes I do sometimes move the laptop when it's still on but I don't do that that often, and when I do that it's to move it to the living room. I know that should be avoided, but seriously, can that
cause that many bad sectors when done very seldom ?

Yes. Mechanical drives are sensitive to shock, vibration and a few other things. Moving the laptop while it is on is not a problem if you are careful, but it is always a risk. Everyone seems to forget, doesn't know, or plainly ignore it when someone warns them and get surprised when they get bad surprises.

It would be helpful to know how you ran badblocks. Given that you have backed up everything you should run it with the -w option and let it do all the write+read passes. If you still get errors from badblocks after all the passes or you still get pending sectors then the drive may be starting to fail. You also have an increasing number of Reallocated_Event_Count which is never a good sign. If you start getting Reallocated_Sector_Ct then it's better to change the drive.

Mathuin · 2016-01-25 15:16:12

I ran badblock in non-destructive mode (ie with the -n flag only). If I use the destructive mode, and then re-run badblocks, it should not show as much errors, right ? If so, what does it do under the hood ? I'm not sure I understand how destructive tests work.
Either way, I prefer avoiding to do destructive tests until next week-end, I need to use the laptop.

R00KIE · 2016-01-25 16:05:11

Destructive tests write the whole disk with a pattern and then read the whole disk and check the data is correct. This is repeated for a few different patterns, see 'man badblocks'.

This will cause the pending sectors to be written and if the disk is not failing it will restore proper operation(1). I've seen this on some of my disks before and writing to the affected sector has solved the problem, although in my case it was always only a few pending sectors.

(1) According to internet knowledge, the sectors should get replaced (reallocated) when you write to them, although in my case writing to pending sectors never caused the sector to be reallocated.

mich41 · 2016-01-25 17:41:59

R00KIE wrote:

(1) According to internet knowledge, the sectors should get replaced (reallocated) when you write to them, although in my case writing to pending sectors never caused the sector to be reallocated.

I have two disks that have lots of bad sector issues which always go away after rewriting and don't increase Reallocated_Sector_Count.

I think it's caused by the disk occasionally making errors when writing to undamaged areas on the platter. Reallocation is only needed when part of the platter is damaged and cannot be written at all. If platter is undamaged but some data were incorrect due to write error, it suffices to simply write new data there and the problem is gone.

Of course an alternative explanation is that Reallocated_Sector_Count raw value is something else than the exact number of reallocated sectors, but one of my disks is so bad that it already "fixed" thousands of UNC sectors and yet the "normalized" value of Reallocated_Sector_Count is still 100.

Last edited by mich41 (2016-01-25 17:47:09)

R00KIE · 2016-01-25 20:14:21

@mich41
In the case of pending and reallocated sectors you want the raw values.

And yes, I do know that sometimes disks might not write properly for whatever reason(1) and the location on the platter is fine, but if the internet knowledge is correct, which it might not be, there can be cases where the pending sector really is reallocated, but this is most probably very dependent on the firmware and class of drives. This requires a write plus a confirmation read, which normally consumer drives don't do or even support, and I've had a disk on my hands that would write fine and wouldn't reallocate the sectors but pending sectors kept cropping up.

(1) Shock, vibration or voltages out of operational limits and most probably a slew of other conditions can cause this.

Mathuin · 2016-01-26 21:22:40

Thanks everybody for your replies. I don't trust this HDD, the number of bad blocks is way too high. I'm running a last non-destructive read-write badblocks pass, and then I'll re-test.

I've always wanted to put a SSD on this machine, so I think I'll ditch the hard drive.

It's going to be a pain to open and I'm going to need a set of Torx screwdrivers but the peace of mind and SSD speed are nice perks.

R00KIE · 2016-01-27 21:25:16

Mathuin wrote:

It's going to be a pain to open and I'm going to need a set of Torx screwdrivers but the peace of mind and SSD speed are nice perks.

SSDs are not problem free either, make sure you know what you are getting into before you spend money on an SSD.

Mathuin · 2016-01-27 21:31:21

From what I've seen, with SSDs the main problem is TRIM, right ? From the wiki I've seen that trimming it once a wrek via a systemd service would be enough.

If there are more details to take into account, please do tell ! I'd prefer SSD over HDD mostly for performance reasons, the call of the new shiny thing and above all the lack of any moving parts.

Soukyuu · 2016-01-27 21:35:00

537 power on hours and already 11 reallocated sectors? On a Hitachi drive, that's a good sign it's dying, or maybe was defective from the start.
I have 4 or so of those old IDE disks that have much higher power on hours count and about the same amount of reallocated sectors.
All my old SATA disks are still free from any reallocated sectors. So yes, I'd say it's defective.

If the laptop is still under warranty, I'd get the drive exchanged if I were you - except for that research project you mentioned.

R00KIE · 2016-01-27 22:04:18

For SSDs you should trim them regularly if you want to keep top performance and extend service life. You maybe have to disable ncq trim if the drive claims to support it but it doesn't work well, and more importantly you should be aware that you can lose everything on the ssd if you cut power suddenly. Some ssds claim to have some form of protection against this but I wouldn't feel tempted to confirm if that is true.

That said, an ssd will make a big difference even for older machines and you will be free from the worry of damaging the HDD if you move your laptop around and happen to bump into something or if you drop it on a table too hard.

mich41 · 2016-01-27 22:04:55

The main problems with SSDs are data corruption on power loss (unless equipped with supercapacitors) and buggy firmware (like this SSD going catatonic under heavy writes, few threads below).

Mathuin · 2016-01-27 22:05:14

Thanks for your input. I really don't trust that drive anymore.

If I could get it replaced reasonably fast I would. But I got it from a French shop, and they redirect us to Asus' customer support. It's really annoying that they can't act as intermediates, they could have replaced the drive and sent it back to Asus, and that would have been much faster.

I've also had a slew of problems with Asus' customer service in the past, they replaced a failing mobo on my parents' laptop after 2 months and after that it failed again. That's not an option.

Sorry, I just saw the rest of your replies. So if I get a SSD, I need to do some research to see how buggy the firmware is ? As to what both of you said about power loss, that means that the laptop must always be cleanly shut down and that I mustn't let the battery die on me while using the computer ? Sounds reasonable. Any brands/models you'd recommend ?

Last edited by Mathuin (2016-01-27 22:08:30)

severach · 2016-01-28 07:27:20

I use an Intel S3500. When someone runs another test like this and names another not loser, I'll switch.

I use btrfs for urbackup. Good test with disposable data.

mich41 · 2016-01-28 11:22:43

Mathuin wrote:

As to what both of you said about power loss, that means that the laptop must always be cleanly shut down and that I mustn't let the battery die on me while using the computer ? Sounds reasonable.

Yep, something like that. In case of laptop with good battery I think it may be reasonable, for desktop box I wouldn't feel very comfortable without power loss protection.

Mathuin wrote:

Any brands/models you'd recommend ?

Well, I can't remember people reporting problems with Intel here. But these are pricey, otoh.

Personally, I only own three old Intel 320s. All are OK after 10k hours and, in one case, 4k power losses caused by bad cable

I have no direct experience with current production stuff but it seems that people are still bitten by bugs.

Last edited by mich41 (2016-01-28 11:27:56)

R00KIE · 2016-01-28 14:08:33

Lets try not to get this thread TGNed [1]. The OP should do its own research and decide what to buy. Google is your friend and the right search words will probably tell you which brands to avoid even if their ssds look tempting from a performance/price point of view.

That said you might want to take a look at the kernel source code, specifically drivers/ata/libata-core.c around line 4222 as it might help guide your buying decision.

[1] https://wiki.archlinux.org/index.php/Fo … n_requests

Arch Linux

#1 2016-01-23 13:04:14

Btrfs filesystem corruption

#2 2016-01-24 22:16:16

Re: Btrfs filesystem corruption

#3 2016-01-25 00:02:53

Re: Btrfs filesystem corruption

#4 2016-01-25 00:44:45

Re: Btrfs filesystem corruption

#5 2016-01-25 00:56:47

Re: Btrfs filesystem corruption

#6 2016-01-25 01:38:17

Re: Btrfs filesystem corruption

#7 2016-01-25 12:00:51

Re: Btrfs filesystem corruption

#8 2016-01-25 12:42:43

Re: Btrfs filesystem corruption

#9 2016-01-25 12:46:58

Re: Btrfs filesystem corruption

#10 2016-01-25 14:51:59

Re: Btrfs filesystem corruption

#11 2016-01-25 14:52:29

Re: Btrfs filesystem corruption

#12 2016-01-25 15:16:12

Re: Btrfs filesystem corruption

#13 2016-01-25 16:05:11

Re: Btrfs filesystem corruption

#14 2016-01-25 17:41:59

Re: Btrfs filesystem corruption

#15 2016-01-25 20:14:21

Re: Btrfs filesystem corruption

#16 2016-01-26 21:22:40

Re: Btrfs filesystem corruption

#17 2016-01-27 21:25:16

Re: Btrfs filesystem corruption

#18 2016-01-27 21:31:21

Re: Btrfs filesystem corruption

#19 2016-01-27 21:35:00

Re: Btrfs filesystem corruption

#20 2016-01-27 22:04:18

Re: Btrfs filesystem corruption

#21 2016-01-27 22:04:55

Re: Btrfs filesystem corruption

#22 2016-01-27 22:05:14

Re: Btrfs filesystem corruption

#23 2016-01-28 07:27:20

Re: Btrfs filesystem corruption

#24 2016-01-28 11:22:43

Re: Btrfs filesystem corruption

#25 2016-01-28 14:08:33

Re: Btrfs filesystem corruption

Board footer