You are not logged in.
Hi,
I've been having issues for a while with my external HDD randomly becoming unresponsive. When it happens, the LEDs on the HDD case indicating disk activity stop blinking, and any application reading from or writing to the HDD will enter disk sleep. Usually the I/O will resume after a while (less than a minute), and the frozen applications will start working again; however, it has gotten worse recently, to the extent that I sometimes have to restart the external HDD.
The external HDD consists of two Seagate ST3000DM001 (3TB, 64MB cache, 7200RPM) disks running in RAID0, installed in an Icy Box IB-RD3219StU3-B case. I basically got the cheapest parts I could find, so I wouldn't be surprised if either the HDDs or the case have given out, but it would still be helpful to pinpoint where the problem is.
When unresponsive, even programs like hdparm and smartctl will freeze while trying to read from the HDD. I've tried disabling Advanced Power Management using hdparm as suggested in this thread, to no avail. However, running e2fsck (while the HDD is responsive) returns no errors, and I've had no issues using the HDD while responsive, so I'm relatively confident that the data is still intact.
Is there a way I can find out what the problem is, and if it lies with the Seagate disks, the case, or perhaps even with my PC's USB 3 controller? Also, would it be safe to take my RAID0 formatted disks and install them in my PC to test them using a software RAID setup?
Offline
Can you post smartctl -A from both disks and dmesg contents when this happens?
Offline
Since, due to the RAID0 setup, both disks are presented as a single USB block device, I don't know if there's a way to run smartctl separately on both disks, but here is the output from running it on said block device:
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 099 086 006 Pre-fail Always - 19225633
3 Spin_Up_Time 0x0003 094 093 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 153
5 Reallocated_Sector_Ct 0x0033 099 099 010 Pre-fail Always - 552
7 Seek_Error_Rate 0x000f 043 041 030 Pre-fail Always - 1576259900467
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 5999
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 140
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 540
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 1 1 2
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 074 036 045 Old_age Always In_the_past 26 (47 163 26 25 0)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 139
193 Load_Cycle_Count 0x0032 092 092 000 Old_age Always - 17641
194 Temperature_Celsius 0x0022 026 064 000 Old_age Always - 26 (0 17 0 0 0)
197 Current_Pending_Sector 0x0012 098 097 000 Old_age Always - 472
198 Offline_Uncorrectable 0x0010 098 097 000 Old_age Offline - 472
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 2761h+52m+39.103s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3932536760
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 4609712430Here's what dmesg outputted when the external HDD became unresponsive (/dev/sdg is the HDD in question):
[ 653.545009] sd 13:0:0:0: [sdg] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=0x08
[ 653.545024] sd 13:0:0:0: [sdg] tag#0 Sense Key : 0x4 [current]
[ 653.545033] sd 13:0:0:0: [sdg] tag#0 ASC=0x0 ASCQ=0x0
[ 653.545041] sd 13:0:0:0: [sdg] tag#0 CDB: opcode=0x28 28 00 31 31 28 c0 00 00 1e 00
[ 653.545048] blk_update_request: I/O error, dev sdg, sector 6602442240After a while the HDD became responsive again, and I got the following output from running dmesg (including the output above for context; I don't know what was printed exactly when relative to when the HDD became responsive again):
[ 653.545009] sd 13:0:0:0: [sdg] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=0x08
[ 653.545024] sd 13:0:0:0: [sdg] tag#0 Sense Key : 0x4 [current]
[ 653.545033] sd 13:0:0:0: [sdg] tag#0 ASC=0x0 ASCQ=0x0
[ 653.545041] sd 13:0:0:0: [sdg] tag#0 CDB: opcode=0x28 28 00 31 31 28 c0 00 00 1e 00
[ 653.545048] blk_update_request: I/O error, dev sdg, sector 6602442240
[ 684.724398] usb 3-2: reset SuperSpeed USB device number 2 using xhci_hcd
[ 700.156455] sd 13:0:0:0: [sdg] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=0x08
[ 700.156463] sd 13:0:0:0: [sdg] tag#0 Sense Key : 0x4 [current]
[ 700.156467] sd 13:0:0:0: [sdg] tag#0 ASC=0x0 ASCQ=0x0
[ 700.156472] sd 13:0:0:0: [sdg] tag#0 CDB: opcode=0x28 28 00 31 31 28 e0 00 00 1e 00
[ 700.156476] blk_update_request: I/O error, dev sdg, sector 6602442496
[ 730.807830] usb 3-2: reset SuperSpeed USB device number 2 using xhci_hcd
[ 761.733445] usb 3-2: reset SuperSpeed USB device number 2 using xhci_hcd
[ 792.785810] usb 3-2: reset SuperSpeed USB device number 2 using xhci_hcd
[ 823.711366] usb 3-2: reset SuperSpeed USB device number 2 using xhci_hcd
[ 854.763631] usb 3-2: reset SuperSpeed USB device number 2 using xhci_hcd
[ 885.795953] usb 3-2: reset SuperSpeed USB device number 2 using xhci_hcd
[ 885.917332] sd 13:0:0:0: [sdg] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x05 driverbyte=0x00
[ 885.917346] sd 13:0:0:0: [sdg] tag#0 CDB: opcode=0x28 28 00 31 31 28 fe 00 00 02 00
[ 885.917354] blk_update_request: I/O error, dev sdg, sector 6602442736I've also noticed that for every process waiting in disk sleep, one CPU core seems to be running at 100%.
Offline
I have no idea how this enclosure handles SMART with two disks, but since Reallocated_Sector_Ct, Reported_Uncorrect, Current_Pending_Sector, Offline_Uncorrectable report non-zero raw values and below-100 normalized values, it seems that at least one of the disks is having bad sector issues. Also, temperature says "failed in the past", so the disks may have overheated. The "worst" value of 36 suggests actual temperature of 64°C, which is way too high.
Test those disks one by one on normal SATA ports and see what's going on. Don't attempt to mount or fsck single disks because it obviously can't work and you may get data corruption if you use to many "-f" or "-y" switches.
It's unlikely that you'll be able to assemble the array outside of the enclosure because RAID implementations (HW and SW alike) usually put some metadata on the disks which are impl-specific.
Last edited by mich41 (2016-02-28 11:37:07)
Offline
When you say to test the disks one by one, do you mean simply running smartctl -A on them individually? Is there something else I should run while I'm at it to diagnose the problem without risking data loss?
Offline
Just run SMART on them to determine the physical condition of both disks.
Or, if you want to play really safe, open the enclosure as much as possible to provide good airflow and backup the data somewhere (it may freeze for a moment few times because few kB seem to be unreadable). Then run SMART on native SATA ports.
Offline
OK, I went ahead and opened up my computer and plugged in the disks, and this is the output I got from running smartctl -A on them:
First disk:
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 110 099 006 Pre-fail Always - 231847032
3 Spin_Up_Time 0x0003 094 093 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 146
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 042 041 030 Pre-fail Always - 1687928784695
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 5952
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 142
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 076 076 099 Old_age Always FAILING_NOW 24
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 075 035 045 Old_age Always In_the_past 25 (53 13 25 23 0)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 142
193 Load_Cycle_Count 0x0032 092 092 000 Old_age Always - 17593
194 Temperature_Celsius 0x0022 025 065 000 Old_age Always - 25 (0 18 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 2810h+56m+58.665s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3930902920
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 4552233155Second disk:
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 095 086 006 Pre-fail Always - 27343731
3 Spin_Up_Time 0x0003 094 093 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 156
5 Reallocated_Sector_Ct 0x0033 099 099 010 Pre-fail Always - 552
7 Seek_Error_Rate 0x000f 043 041 030 Pre-fail Always - 1576259901197
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 5999
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 143
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 558
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 1 1 2
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 076 036 045 Old_age Always In_the_past 24 (47 163 24 23 0)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 142
193 Load_Cycle_Count 0x0032 092 092 000 Old_age Always - 17646
194 Temperature_Celsius 0x0022 024 064 000 Old_age Always - 24 (0 17 0 0 0)
197 Current_Pending_Sector 0x0012 098 097 000 Old_age Always - 456
198 Offline_Uncorrectable 0x0010 098 097 000 Old_age Offline - 456
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 2761h+54m+45.120s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3932536792
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 4616548651The second disk seems to have non-zero values for the attributes you mentioned, so is it safe to assume that this is the disk that's failing? The output for the first disk has "FAILING_NOW" for one of the attributes though, which sounds worrying. I also noticed that the second disk's PCB seems to be a bit discoloured: http://i.imgur.com/mxArtLG.jpg (sorry for the blurry image).
Offline
OK, so both have been cooked 10 degrees past the limit and both had errors.
First of all, improve cooling in this enclosure. Then back up the data, if you have spare storage available. Then test the hell out of them (the damage may be permanent) or replace them straight away.
End-to-End_Error seems to be cache RAM error. I think it's possible to get DRAM errors at high temperature without permanent hardware damage, so this disk may possibly be OK. But bad sectors - I haven't yet seen a disk which recovered from 500 of those.
Offline
Yeah, I'm not really expecting to be able to continue using the disks. Any particular backup strategy you would recommend? Just plug them into the enclosure, mount, run rsync and hope that it completes? Or perhaps I could try to use dd to write the contents of an individual disk to another disk of the same size and hope that RAID0 still works as-is? Searching around a bit I'm also seeing mentions of a program called ddrescue, might that be appropriate to use here?
Also, you wouldn't happen to know if there's a way to find out what files on the file system are affected by the bad sectors? I assume the RAID setup might complicate things a bit though.
Edit: One more thing: The way the external HDD becomes completely unresponsive at times, is this common when dealing with bad sectors, or might it be an issue with the way this specific enclosure handles pending reads when waiting on bad sectors? Might it be better to do backups without using the enclosure to avoid this issue?
Last edited by BeholdMyGlory (2016-02-29 12:49:46)
Offline
ddrescue is expressly designed for bad disks. It even handles disks where the errors are so bad that the drive crashes and must be repowered. I'd first ddrescue to a same size drive then copy the files from the bad drive to a 3rd drive. Copying the files will tell you which of the files are damaged. Don't operate the drive until you're ready to recover. Do ddrescue first since damaged drives vary from "will limp along for a long time" to "you've only a few hours left before the drive is completely dead." You don't find out which until after it dies.
Unfortunately all my drive rescues have been ddrescue on Windows drives. I don't know how the Linux copy tools work under bad conditions.
So far I haven't found a USB to SATA controller that handles bad sectors well. The USB chips add their own error retries which rapidly multiply into intolerable wait times.
Unfortunately many modern external USB drives have saved a few pennies by doing away with the SATA controller. The board on the drive only has a USB connector. Data becomes unrecoverable with just a few bad sectors on these drives. I don't purchase branded external USB drives any more. You get a shorter warranty and you can't pull the drive to recover it, both because it will void your warranty and there's no plug for a SATA controller that handles bad sectors well. I'd rather have a drive with a longer warranty and a case I can open.
Offline
So if I'm getting you right, you're saying I should use ddrescue for the actual backup, and then copy the files simply to find out which files are affected? Would it be possible in that case to do the copying to /dev/null instead, and just keep track of the resulting errors?
Offline
cp -a /media/bad-disk /media/good-disk 2>&1 |tee /media/good-disk/error-logThis was always good enough for me. It copies all fully readable files and prints errors on broken ones. If and only if the error log indicates that some really important files had I/O errors, then you can try to recover as much as possible from those particular files with some tricky dd invocations. Or read them from backup, if possible.
Offline
It all depends on the value of the data and how much you want to spend on spare drives. That's the optimal sequence. Shortcut for cost as necessary.
Offline
Is it possible to recover files from ddrescue image without silently getting zeros in place of bad sectors?
Offline
Thank you both for all your help. I think what I'll end up doing is first trying to do a file-level copy to a new pair of disks; if that doesn't work (due to insufferable wait times or whatever), I'll attempt a block-level copy one disk at a time using ddrescue, and see what I can do from there---with some luck one of my old disks is still usable, in which case I might be able to plug my new disks into my old enclosure with the RAID setup hopefully working as-is, backup the files to one of the old disks as well as one of my internal disks, after which I can redo the RAID setup with a new enclosure.
Offline