You are not logged in.
Pages: 1
Hello,
I've been running Arch on my Asus UX303 (Btrfs filesystem) without any problems since August. During the week, I don't use it a lot because I'm a bit overworked. Anyway, last week-end I updated, and I wanted to update again this morning. Pacman crashed with a bus error. I was a bit puzzled so I checked my hard drive, thinking that it was perhaps a hardware problem. When backing up my files, I noticed that cp couldn't copy a few files from my home directory, mostly music but also files from my Miniconda install, because of an I/O error. Other that pacman, everything works.
I booted on an ISO and ran btrfs check. It found 2 or so "unfixable" errors. I then ran a SMART check with -H, which went fine: the hard drive passed the checks.
I'm sorry for not providing more details, because I left my charger at my dorm yesterday and I'm at my parents' until tomorrow evening. I only have 15% battery left and I want to save it. What do you think is the real source of those errors ? Btrfs problem or hard drive failure ?
I'm worried because while the laptop is still under warranty, I don't have Windows on it anymore and I would have to reinstall windows on it so that the Asus customer service don't reject it, right ? Furthermore, I need that laptop during the week because I've got a research project to do, and that project is part of the exams needed to join an engineering school.
Thanks in advance for your attention. If there is any further information you need, tell me and I'll post it.
Offline
Sorry for up-ing the post. I've got my charger back, so I can give more detail. Whenever I run **any** pacman command, there's a bus dump. To find out what caused this, I do:
$ dmesg | tail
[ 2901.910544] ata1.00: status: { DRDY ERR }
[ 2901.910545] ata1.00: error: { UNC }
[ 2901.912846] ata1.00: configured for UDMA/133
[ 2901.912857] sd 0:0:0:0: [sda] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
[ 2901.912859] sd 0:0:0:0: [sda] tag#6 Sense Key : 0x3 [current] [descriptor]
[ 2901.912861] sd 0:0:0:0: [sda] tag#6 ASC=0x11 ASCQ=0x4
[ 2901.912863] sd 0:0:0:0: [sda] tag#6 CDB: opcode=0x28 28 00 03 9a 02 18 00 00 08 00
[ 2901.912864] blk_update_request: I/O error, dev sda, sector 60424728
[ 2901.912868] BTRFS: bdev /dev/sda3 errs: wr 0, rd 227, flush 0, corrupt 0, gen 0
[ 2901.912884] ata1: EH complete
I think that it's the hard drive itself that's broken. I ran a short SMART test:
$ smartctl -t short /dev/sda
[Irrelevant]
$ smartctl -a /dev/sda
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.3.3-2-ARCH] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: HGST HTS545050A7E680
Serial Number: RB050AM51DSJSJ
LU WWN Device Id: 5 000cca 7a3d3e453
Firmware Version: GR2OA3B0
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Jan 24 22:55:24 2016 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 45) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 109) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 096 096 062 Pre-fail Always - 589824
2 Throughput_Performance 0x0005 100 100 040 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 253 253 033 Pre-fail Always - 0
4 Start_Stop_Count 0x0012 095 095 000 Old_age Always - 9319
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 040 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 099 099 000 Old_age Always - 523
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 422
191 G-Sense_Error_Rate 0x000a 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 9
193 Load_Cycle_Count 0x0012 091 091 000 Old_age Always - 99202
194 Temperature_Celsius 0x0002 193 193 000 Old_age Always - 31 (Min/Max 15/43)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 11
197 Current_Pending_Sector 0x0022 095 095 000 Old_age Always - 352
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
223 Load_Retry_Count 0x000a 100 100 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 259 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 259 occurred at disk power-on lifetime: 523 hours (21 days + 19 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 50 5c e9 03 Error: UNC at LBA = 0x03e95c50 = 65625168
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 88 50 5c e9 40 00 00:31:38.962 READ FPDMA QUEUED
ea 00 00 00 00 00 a0 00 00:31:38.954 FLUSH CACHE EXT
61 08 78 80 00 10 40 00 00:31:38.954 WRITE FPDMA QUEUED
ea 00 00 00 00 00 a0 00 00:31:38.927 FLUSH CACHE EXT
61 20 68 20 4f 43 40 00 00:31:38.927 WRITE FPDMA QUEUED
Error 258 occurred at disk power-on lifetime: 523 hours (21 days + 19 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 50 5c e9 03 Error: UNC at LBA = 0x03e95c50 = 65625168
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 f0 50 5c e9 40 00 00:27:03.940 READ FPDMA QUEUED
ea 00 00 00 00 00 a0 00 00:26:58.710 FLUSH CACHE EXT
61 08 d0 00 00 10 40 00 00:26:58.710 WRITE FPDMA QUEUED
61 08 c8 00 00 12 40 00 00:26:58.710 WRITE FPDMA QUEUED
61 08 c0 80 00 10 40 00 00:26:58.710 WRITE FPDMA QUEUED
Error 257 occurred at disk power-on lifetime: 523 hours (21 days + 19 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 18 02 9a 03 Error: WP at LBA = 0x039a0218 = ``
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 08 b0 80 00 10 40 00 00:25:41.546 WRITE FPDMA QUEUED
60 08 a8 18 02 9a 40 00 00:25:41.546 READ FPDMA QUEUED
ea 00 00 00 00 00 a0 00 00:25:41.449 FLUSH CACHE EXT
61 20 98 40 75 42 40 00 00:25:41.448 WRITE FPDMA QUEUED
61 20 90 40 72 42 40 00 00:25:41.448 WRITE FPDMA QUEUED
Error 256 occurred at disk power-on lifetime: 523 hours (21 days + 19 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 18 02 9a 03 Error: UNC at LBA = 0x039a0218 = 60424728
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 f0 18 02 9a 40 00 00:24:44.522 READ FPDMA QUEUED
61 10 e8 60 53 54 40 00 00:24:36.018 WRITE FPDMA QUEUED
61 10 e0 50 92 51 40 00 00:24:16.843 WRITE FPDMA QUEUED
ea 00 00 00 00 00 a0 00 00:24:06.139 FLUSH CACHE EXT
61 08 c0 00 00 10 40 00 00:24:06.139 WRITE FPDMA QUEUED
Error 255 occurred at disk power-on lifetime: 522 hours (21 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 50 5c e9 03 Error: UNC at LBA = 0x03e95c50 = 65625168
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 f0 50 5c e9 40 00 00:20:20.835 READ FPDMA QUEUED
60 08 e8 98 20 52 40 00 00:20:20.830 READ FPDMA QUEUED
60 08 e0 c0 19 52 40 00 00:20:20.829 READ FPDMA QUEUED
60 20 d8 40 33 16 40 00 00:20:20.821 READ FPDMA QUEUED
61 08 d0 08 83 10 40 00 00:20:20.634 WRITE FPDMA QUEUED
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 522 -
# 2 Extended offline Aborted by host 90% 522 -
# 3 Short offline Completed without error 00% 522 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
I'm beginning to think that the hard drive is fucked. I can send it back as it's under warranty.
What do I do ? Is my diagnosis correct ?
Offline
The drive may not be broken but the sure way to make things work would be to backup everything, clean the disk, copy things back and reinstall any packages for which you couldn't copy the files. You can most probably send it back under warranty but if you haven't backed up your data don't expect to get it back.
Just out of curiosity, do you happen to move your laptop a lot while it is on?
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
Perhaps this is just a hardware failure, but I would still use ext4 next time. If btrfs fails, it usually fails catastrophically, as a forum search shows.
Offline
UNC is "uncorrectable read error", i.e. data read from disk contain too many errors to recover with ECC.
It seems that you have 352 unreadable sectors (Current_Pending_Sector), which is a bit high. Normally UNC sectors can be "fixed" by overwriting with new data (old data are lost anyway), but big number of them suggests that the disk may have some trouble writing data correctly, i.e. be fucked. Especially if new bad sectors continue to appear.
You may want to backup all data that are still readable (sector=0.5kB so you didn't lose much yet)
BTW, only long self test detects such errors.
Offline
Perhaps this is just a hardware failure, but I would still use ext4 next time. If btrfs fails, it usually fails catastrophically, as a forum search shows.
Yeah, a similar thing happened to me with btrfs but no data supporting the hardware failure. Use ext4 and be happy with a stable system.
CPU-optimized Linux-ck packages @ Repo-ck • AUR packages • Zsh and other configs
Offline
Yeah, a similar thing happened to me with btrfs but no data supporting the hardware failure. Use ext4 and be happy with a stable system.
I think the strain created by the unusual disk usage pattern of B-tree-based file systems might actually cause these sudden HD hardware failures in the first place.
I was a big proponent of reiserFS3 at home and at university in the early 00's. Initially everything worked beautifully compared to ext2, but later most of those reiser disks failed suddenly and completely, while ext2-based disks were fine. It made me look like an idiot for advocating reiserfs, not to mention I lost my own data at home. Correlation or causation? IDK, but it seems to be a very similar story with btrfs now.
Beware Linux and its "next-gen" file systems! Caveat emptor. :-)
Offline
So, during the night I ran badblocks which found 961 bad blocks. I also ran:
$ smartctl -t long /dev/sda
[Irrelevant]
$ smartctl -H /dev/sda
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.3.3-2-ARCH] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
$ smartctl -a /dev/sda
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.3.3-2-ARCH] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: HGST HTS545050A7E680
Serial Number: RB050AM51DSJSJ
LU WWN Device Id: 5 000cca 7a3d3e453
Firmware Version: GR2OA3B0
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Jan 25 13:38:32 2016 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: ( 45) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 109) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 097 097 062 Pre-fail Always - 458752
2 Throughput_Performance 0x0005 100 100 040 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 253 253 033 Pre-fail Always - 0
4 Start_Stop_Count 0x0012 095 095 000 Old_age Always - 9350
5 Reallocated_Sector_Ct 0x0033 098 098 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 040 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 099 099 000 Old_age Always - 537
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 423
191 G-Sense_Error_Rate 0x000a 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 9
193 Load_Cycle_Count 0x0012 091 091 000 Old_age Always - 99451
194 Temperature_Celsius 0x0002 214 214 000 Old_age Always - 28 (Min/Max 15/43)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 82
197 Current_Pending_Sector 0x0022 028 028 000 Old_age Always - 1856
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
223 Load_Retry_Count 0x000a 100 100 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 1696 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 1696 occurred at disk power-on lifetime: 537 hours (22 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 50 5c e9 03 Error: UNC at LBA = 0x03e95c50 = 65625168
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 20 40 c0 95 2a 40 00 04:56:08.250 READ FPDMA QUEUED
61 40 38 20 46 2d 40 00 04:56:08.250 WRITE FPDMA QUEUED
61 a0 30 60 45 2d 40 00 04:56:08.250 WRITE FPDMA QUEUED
61 00 28 40 43 2d 40 00 04:56:08.250 WRITE FPDMA QUEUED
61 c0 20 20 3e 2d 40 00 04:56:08.249 WRITE FPDMA QUEUED
Error 1695 occurred at disk power-on lifetime: 537 hours (22 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 40 50 5c e9 03 Error: WP at LBA = 0x03e95c50 = 65625168
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
61 98 c8 d8 9a 52 40 00 04:56:03.980 WRITE FPDMA QUEUED
60 20 c0 40 b8 2a 40 00 04:56:03.246 READ FPDMA QUEUED
60 40 b8 50 5c e9 40 00 04:56:02.569 READ FPDMA QUEUED
60 40 b0 10 25 75 40 00 04:56:02.531 READ FPDMA QUEUED
60 40 a8 a8 85 95 40 00 04:56:02.514 READ FPDMA QUEUED
Error 1694 occurred at disk power-on lifetime: 526 hours (21 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 e8 29 0b 08 Error: UNC at LBA = 0x080b29e8 = 134949352
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 90 e8 29 0b 40 00 03:59:47.066 READ FPDMA QUEUED
61 08 88 a0 02 5d 40 00 03:59:47.066 WRITE FPDMA QUEUED
61 40 80 c8 7a 85 40 00 03:59:47.064 WRITE FPDMA QUEUED
61 08 78 b8 7a 85 40 00 03:59:47.063 WRITE FPDMA QUEUED
61 08 70 a8 79 85 40 00 03:59:47.062 WRITE FPDMA QUEUED
Error 1693 occurred at disk power-on lifetime: 526 hours (21 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 e8 29 0b 08 Error: UNC at LBA = 0x080b29e8 = 134949352
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 f0 e8 29 0b 40 00 03:59:43.647 READ FPDMA QUEUED
ef 10 02 00 00 00 a0 00 03:59:43.647 SET FEATURES [Enable SATA feature]
27 00 00 00 00 00 e0 00 03:59:43.647 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
ec 00 00 00 00 00 a0 00 03:59:43.646 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 03:59:43.646 SET FEATURES [Set transfer mode]
Error 1692 occurred at disk power-on lifetime: 526 hours (21 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 e8 29 0b 08 Error: UNC at LBA = 0x080b29e8 = 134949352
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 80 e8 29 0b 40 00 03:59:40.280 READ FPDMA QUEUED
ef 10 02 00 00 00 a0 00 03:59:40.280 SET FEATURES [Enable SATA feature]
27 00 00 00 00 00 e0 00 03:59:40.280 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
ec 00 00 00 00 00 a0 00 03:59:40.279 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 03:59:40.279 SET FEATURES [Set transfer mode]
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 532 36160528
# 2 Short offline Completed without error 00% 522 -
# 3 Extended offline Aborted by host 90% 522 -
# 4 Short offline Completed without error 00% 522 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
It still says the disk is OK but the number of bad sectors kind of worries me. Everything is backed up. If I get you correctly, what you advise is nuking the HDD, formatting as ext4 and reinstall ? Could that fix the I/O errors ? If so, I can do that this week-end.
If that causes issues again, I'll ditch the HDD and replace it by a SSD. And R00kie, yes I do sometimes move the laptop when it's still on but I don't do that that often, and when I do that it's to move it to the living room. I know that should be avoided, but seriously, can that
cause that many bad sectors when done very seldom ?
Offline
Well, my old laptop is still fine after over 30k hours with reiserfs
Do you have more data? Were they really the same kind of disks?
Offline
UNC is "uncorrectable read error", i.e. data read from disk contain too many errors to recover with ECC.
It seems that you have 352 unreadable sectors (Current_Pending_Sector), which is a bit high. Normally UNC sectors can be "fixed" by overwriting with new data (old data are lost anyway), but big number of them suggests that the disk may have some trouble writing data correctly, i.e. be fucked. Especially if new bad sectors continue to appear.
You may want to backup all data that are still readable (sector=0.5kB so you didn't lose much yet)
BTW, only long self test detects such errors.
Some driver's SMART data isn't so strait forward to interpret. I'd contact Hitachi asking them what that 352 really means. I flipped out a bit when reading the SMART data for my Seagates, thinking something was wrong, but they just report the data differently than my WD's did.
Also, yes. If you are trying for a surface test, either use the drive's long test or the utility "badblocks". Be careful with badblocks though since it can lead to data loss even with "-n".
It also could be an issue with the SATA bus, port, or controller. Last January I was having some issues which I think (but am not certain) are actually related to one of those three as opposed to the drive, cable, or FS (which for me is btrfs).
Have you tried running "btrfs scrub start -Bdr /path/to/volume"?
Offline
And R00kie, yes I do sometimes move the laptop when it's still on but I don't do that that often, and when I do that it's to move it to the living room. I know that should be avoided, but seriously, can that
cause that many bad sectors when done very seldom ?
Yes. Mechanical drives are sensitive to shock, vibration and a few other things. Moving the laptop while it is on is not a problem if you are careful, but it is always a risk. Everyone seems to forget, doesn't know, or plainly ignore it when someone warns them and get surprised when they get bad surprises.
It would be helpful to know how you ran badblocks. Given that you have backed up everything you should run it with the -w option and let it do all the write+read passes. If you still get errors from badblocks after all the passes or you still get pending sectors then the drive may be starting to fail. You also have an increasing number of Reallocated_Event_Count which is never a good sign. If you start getting Reallocated_Sector_Ct then it's better to change the drive.
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
I ran badblock in non-destructive mode (ie with the -n flag only). If I use the destructive mode, and then re-run badblocks, it should not show as much errors, right ? If so, what does it do under the hood ? I'm not sure I understand how destructive tests work.
Either way, I prefer avoiding to do destructive tests until next week-end, I need to use the laptop.
Offline
Destructive tests write the whole disk with a pattern and then read the whole disk and check the data is correct. This is repeated for a few different patterns, see 'man badblocks'.
This will cause the pending sectors to be written and if the disk is not failing it will restore proper operation(1). I've seen this on some of my disks before and writing to the affected sector has solved the problem, although in my case it was always only a few pending sectors.
(1) According to internet knowledge, the sectors should get replaced (reallocated) when you write to them, although in my case writing to pending sectors never caused the sector to be reallocated.
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
(1) According to internet knowledge, the sectors should get replaced (reallocated) when you write to them, although in my case writing to pending sectors never caused the sector to be reallocated.
I have two disks that have lots of bad sector issues which always go away after rewriting and don't increase Reallocated_Sector_Count.
I think it's caused by the disk occasionally making errors when writing to undamaged areas on the platter. Reallocation is only needed when part of the platter is damaged and cannot be written at all. If platter is undamaged but some data were incorrect due to write error, it suffices to simply write new data there and the problem is gone.
Of course an alternative explanation is that Reallocated_Sector_Count raw value is something else than the exact number of reallocated sectors, but one of my disks is so bad that it already "fixed" thousands of UNC sectors and yet the "normalized" value of Reallocated_Sector_Count is still 100.
Last edited by mich41 (2016-01-25 17:47:09)
Offline
@mich41
In the case of pending and reallocated sectors you want the raw values.
And yes, I do know that sometimes disks might not write properly for whatever reason(1) and the location on the platter is fine, but if the internet knowledge is correct, which it might not be, there can be cases where the pending sector really is reallocated, but this is most probably very dependent on the firmware and class of drives. This requires a write plus a confirmation read, which normally consumer drives don't do or even support, and I've had a disk on my hands that would write fine and wouldn't reallocate the sectors but pending sectors kept cropping up.
(1) Shock, vibration or voltages out of operational limits and most probably a slew of other conditions can cause this.
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
Thanks everybody for your replies. I don't trust this HDD, the number of bad blocks is way too high. I'm running a last non-destructive read-write badblocks pass, and then I'll re-test.
I've always wanted to put a SSD on this machine, so I think I'll ditch the hard drive.
It's going to be a pain to open and I'm going to need a set of Torx screwdrivers but the peace of mind and SSD speed are nice perks.
Offline
It's going to be a pain to open and I'm going to need a set of Torx screwdrivers but the peace of mind and SSD speed are nice perks.
SSDs are not problem free either, make sure you know what you are getting into before you spend money on an SSD.
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
From what I've seen, with SSDs the main problem is TRIM, right ? From the wiki I've seen that trimming it once a wrek via a systemd service would be enough.
If there are more details to take into account, please do tell ! I'd prefer SSD over HDD mostly for performance reasons, the call of the new shiny thing and above all the lack of any moving parts.
Offline
537 power on hours and already 11 reallocated sectors? On a Hitachi drive, that's a good sign it's dying, or maybe was defective from the start.
I have 4 or so of those old IDE disks that have much higher power on hours count and about the same amount of reallocated sectors.
All my old SATA disks are still free from any reallocated sectors. So yes, I'd say it's defective.
If the laptop is still under warranty, I'd get the drive exchanged if I were you - except for that research project you mentioned.
[ Arch x86_64 | linux | Framework 13 | AMD Ryzen™ 5 7640U | 32GB RAM | KDE Plasma Wayland ]
Offline
For SSDs you should trim them regularly if you want to keep top performance and extend service life. You maybe have to disable ncq trim if the drive claims to support it but it doesn't work well, and more importantly you should be aware that you can lose everything on the ssd if you cut power suddenly. Some ssds claim to have some form of protection against this but I wouldn't feel tempted to confirm if that is true.
That said, an ssd will make a big difference even for older machines and you will be free from the worry of damaging the HDD if you move your laptop around and happen to bump into something or if you drop it on a table too hard.
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
The main problems with SSDs are data corruption on power loss (unless equipped with supercapacitors) and buggy firmware (like this SSD going catatonic under heavy writes, few threads below).
Offline
Thanks for your input. I really don't trust that drive anymore.
If I could get it replaced reasonably fast I would. But I got it from a French shop, and they redirect us to Asus' customer support. It's really annoying that they can't act as intermediates, they could have replaced the drive and sent it back to Asus, and that would have been much faster.
I've also had a slew of problems with Asus' customer service in the past, they replaced a failing mobo on my parents' laptop after 2 months and after that it failed again. That's not an option.
Sorry, I just saw the rest of your replies. So if I get a SSD, I need to do some research to see how buggy the firmware is ? As to what both of you said about power loss, that means that the laptop must always be cleanly shut down and that I mustn't let the battery die on me while using the computer ? Sounds reasonable. Any brands/models you'd recommend ?
Last edited by Mathuin (2016-01-27 22:08:30)
Offline
I use an Intel S3500. When someone runs another test like this and names another not loser, I'll switch.
I use btrfs for urbackup. Good test with disposable data.
Offline
As to what both of you said about power loss, that means that the laptop must always be cleanly shut down and that I mustn't let the battery die on me while using the computer ? Sounds reasonable.
Yep, something like that. In case of laptop with good battery I think it may be reasonable, for desktop box I wouldn't feel very comfortable without power loss protection.
Any brands/models you'd recommend ?
Well, I can't remember people reporting problems with Intel here. But these are pricey, otoh.
Personally, I only own three old Intel 320s. All are OK after 10k hours and, in one case, 4k power losses caused by bad cable
I have no direct experience with current production stuff but it seems that people are still bitten by bugs.
Last edited by mich41 (2016-01-28 11:27:56)
Offline
Lets try not to get this thread TGNed [1]. The OP should do its own research and decide what to buy. Google is your friend and the right search words will probably tell you which brands to avoid even if their ssds look tempting from a performance/price point of view.
That said you might want to take a look at the kernel source code, specifically drivers/ata/libata-core.c around line 4222 as it might help guide your buying decision.
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
Pages: 1