External USB 3 HDD becomes unresponsive intermittently

BeholdMyGlory · 2016-02-28 05:07:01

Hi,

I've been having issues for a while with my external HDD randomly becoming unresponsive. When it happens, the LEDs on the HDD case indicating disk activity stop blinking, and any application reading from or writing to the HDD will enter disk sleep. Usually the I/O will resume after a while (less than a minute), and the frozen applications will start working again; however, it has gotten worse recently, to the extent that I sometimes have to restart the external HDD.

The external HDD consists of two Seagate ST3000DM001 (3TB, 64MB cache, 7200RPM) disks running in RAID0, installed in an Icy Box IB-RD3219StU3-B case. I basically got the cheapest parts I could find, so I wouldn't be surprised if either the HDDs or the case have given out, but it would still be helpful to pinpoint where the problem is.

When unresponsive, even programs like hdparm and smartctl will freeze while trying to read from the HDD. I've tried disabling Advanced Power Management using hdparm as suggested in this thread, to no avail. However, running e2fsck (while the HDD is responsive) returns no errors, and I've had no issues using the HDD while responsive, so I'm relatively confident that the data is still intact.

Is there a way I can find out what the problem is, and if it lies with the Seagate disks, the case, or perhaps even with my PC's USB 3 controller? Also, would it be safe to take my RAID0 formatted disks and install them in my PC to test them using a software RAID setup?

mich41 · 2016-02-28 08:56:14

Can you post smartctl -A from both disks and dmesg contents when this happens?

BeholdMyGlory · 2016-02-28 09:47:17

Since, due to the RAID0 setup, both disks are presented as a single USB block device, I don't know if there's a way to run smartctl separately on both disks, but here is the output from running it on said block device:

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   099   086   006    Pre-fail  Always       -       19225633
  3 Spin_Up_Time            0x0003   094   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       153
  5 Reallocated_Sector_Ct   0x0033   099   099   010    Pre-fail  Always       -       552
  7 Seek_Error_Rate         0x000f   043   041   030    Pre-fail  Always       -       1576259900467
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       5999
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       140
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       540
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       1 1 2
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   074   036   045    Old_age   Always   In_the_past 26 (47 163 26 25 0)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       139
193 Load_Cycle_Count        0x0032   092   092   000    Old_age   Always       -       17641
194 Temperature_Celsius     0x0022   026   064   000    Old_age   Always       -       26 (0 17 0 0 0)
197 Current_Pending_Sector  0x0012   098   097   000    Old_age   Always       -       472
198 Offline_Uncorrectable   0x0010   098   097   000    Old_age   Offline      -       472
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       2761h+52m+39.103s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3932536760
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       4609712430

Here's what dmesg outputted when the external HDD became unresponsive (/dev/sdg is the HDD in question):

[  653.545009] sd 13:0:0:0: [sdg] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=0x08
[  653.545024] sd 13:0:0:0: [sdg] tag#0 Sense Key : 0x4 [current] 
[  653.545033] sd 13:0:0:0: [sdg] tag#0 ASC=0x0 ASCQ=0x0 
[  653.545041] sd 13:0:0:0: [sdg] tag#0 CDB: opcode=0x28 28 00 31 31 28 c0 00 00 1e 00
[  653.545048] blk_update_request: I/O error, dev sdg, sector 6602442240

After a while the HDD became responsive again, and I got the following output from running dmesg (including the output above for context; I don't know what was printed exactly when relative to when the HDD became responsive again):

[  653.545009] sd 13:0:0:0: [sdg] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=0x08
[  653.545024] sd 13:0:0:0: [sdg] tag#0 Sense Key : 0x4 [current] 
[  653.545033] sd 13:0:0:0: [sdg] tag#0 ASC=0x0 ASCQ=0x0 
[  653.545041] sd 13:0:0:0: [sdg] tag#0 CDB: opcode=0x28 28 00 31 31 28 c0 00 00 1e 00
[  653.545048] blk_update_request: I/O error, dev sdg, sector 6602442240
[  684.724398] usb 3-2: reset SuperSpeed USB device number 2 using xhci_hcd
[  700.156455] sd 13:0:0:0: [sdg] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x07 driverbyte=0x08
[  700.156463] sd 13:0:0:0: [sdg] tag#0 Sense Key : 0x4 [current] 
[  700.156467] sd 13:0:0:0: [sdg] tag#0 ASC=0x0 ASCQ=0x0 
[  700.156472] sd 13:0:0:0: [sdg] tag#0 CDB: opcode=0x28 28 00 31 31 28 e0 00 00 1e 00
[  700.156476] blk_update_request: I/O error, dev sdg, sector 6602442496
[  730.807830] usb 3-2: reset SuperSpeed USB device number 2 using xhci_hcd
[  761.733445] usb 3-2: reset SuperSpeed USB device number 2 using xhci_hcd
[  792.785810] usb 3-2: reset SuperSpeed USB device number 2 using xhci_hcd
[  823.711366] usb 3-2: reset SuperSpeed USB device number 2 using xhci_hcd
[  854.763631] usb 3-2: reset SuperSpeed USB device number 2 using xhci_hcd
[  885.795953] usb 3-2: reset SuperSpeed USB device number 2 using xhci_hcd
[  885.917332] sd 13:0:0:0: [sdg] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x05 driverbyte=0x00
[  885.917346] sd 13:0:0:0: [sdg] tag#0 CDB: opcode=0x28 28 00 31 31 28 fe 00 00 02 00
[  885.917354] blk_update_request: I/O error, dev sdg, sector 6602442736

I've also noticed that for every process waiting in disk sleep, one CPU core seems to be running at 100%.

mich41 · 2016-02-28 11:36:25

I have no idea how this enclosure handles SMART with two disks, but since Reallocated_Sector_Ct, Reported_Uncorrect, Current_Pending_Sector, Offline_Uncorrectable report non-zero raw values and below-100 normalized values, it seems that at least one of the disks is having bad sector issues. Also, temperature says "failed in the past", so the disks may have overheated. The "worst" value of 36 suggests actual temperature of 64°C, which is way too high.

Test those disks one by one on normal SATA ports and see what's going on. Don't attempt to mount or fsck single disks because it obviously can't work and you may get data corruption if you use to many "-f" or "-y" switches.

It's unlikely that you'll be able to assemble the array outside of the enclosure because RAID implementations (HW and SW alike) usually put some metadata on the disks which are impl-specific.

Last edited by mich41 (2016-02-28 11:37:07)

BeholdMyGlory · 2016-02-28 12:08:10

When you say to test the disks one by one, do you mean simply running smartctl -A on them individually? Is there something else I should run while I'm at it to diagnose the problem without risking data loss?

mich41 · 2016-02-28 12:23:58

Just run SMART on them to determine the physical condition of both disks.

Or, if you want to play really safe, open the enclosure as much as possible to provide good airflow and backup the data somewhere (it may freeze for a moment few times because few kB seem to be unreadable). Then run SMART on native SATA ports.

BeholdMyGlory · 2016-02-29 06:13:25

OK, I went ahead and opened up my computer and plugged in the disks, and this is the output I got from running smartctl -A on them:

First disk:

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   110   099   006    Pre-fail  Always       -       231847032
  3 Spin_Up_Time            0x0003   094   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       146
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   042   041   030    Pre-fail  Always       -       1687928784695
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       5952
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       142
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   076   076   099    Old_age   Always   FAILING_NOW 24
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   075   035   045    Old_age   Always   In_the_past 25 (53 13 25 23 0)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       142
193 Load_Cycle_Count        0x0032   092   092   000    Old_age   Always       -       17593
194 Temperature_Celsius     0x0022   025   065   000    Old_age   Always       -       25 (0 18 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       2810h+56m+58.665s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3930902920
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       4552233155

Second disk:

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   095   086   006    Pre-fail  Always       -       27343731
  3 Spin_Up_Time            0x0003   094   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       156
  5 Reallocated_Sector_Ct   0x0033   099   099   010    Pre-fail  Always       -       552
  7 Seek_Error_Rate         0x000f   043   041   030    Pre-fail  Always       -       1576259901197
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       5999
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       143
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       558
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       1 1 2
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   076   036   045    Old_age   Always   In_the_past 24 (47 163 24 23 0)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       142
193 Load_Cycle_Count        0x0032   092   092   000    Old_age   Always       -       17646
194 Temperature_Celsius     0x0022   024   064   000    Old_age   Always       -       24 (0 17 0 0 0)
197 Current_Pending_Sector  0x0012   098   097   000    Old_age   Always       -       456
198 Offline_Uncorrectable   0x0010   098   097   000    Old_age   Offline      -       456
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       2761h+54m+45.120s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3932536792
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       4616548651

The second disk seems to have non-zero values for the attributes you mentioned, so is it safe to assume that this is the disk that's failing? The output for the first disk has "FAILING_NOW" for one of the attributes though, which sounds worrying. I also noticed that the second disk's PCB seems to be a bit discoloured: http://i.imgur.com/mxArtLG.jpg (sorry for the blurry image).

mich41 · 2016-02-29 10:29:02

OK, so both have been cooked 10 degrees past the limit and both had errors.

First of all, improve cooling in this enclosure. Then back up the data, if you have spare storage available. Then test the hell out of them (the damage may be permanent) or replace them straight away.

End-to-End_Error seems to be cache RAM error. I think it's possible to get DRAM errors at high temperature without permanent hardware damage, so this disk may possibly be OK. But bad sectors - I haven't yet seen a disk which recovered from 500 of those.

BeholdMyGlory · 2016-02-29 12:41:40

Yeah, I'm not really expecting to be able to continue using the disks. Any particular backup strategy you would recommend? Just plug them into the enclosure, mount, run rsync and hope that it completes? Or perhaps I could try to use dd to write the contents of an individual disk to another disk of the same size and hope that RAID0 still works as-is? Searching around a bit I'm also seeing mentions of a program called ddrescue, might that be appropriate to use here?

Also, you wouldn't happen to know if there's a way to find out what files on the file system are affected by the bad sectors? I assume the RAID setup might complicate things a bit though.

Edit: One more thing: The way the external HDD becomes completely unresponsive at times, is this common when dealing with bad sectors, or might it be an issue with the way this specific enclosure handles pending reads when waiting on bad sectors? Might it be better to do backups without using the enclosure to avoid this issue?

Last edited by BeholdMyGlory (2016-02-29 12:49:46)

severach · 2016-03-03 06:14:34

ddrescue is expressly designed for bad disks. It even handles disks where the errors are so bad that the drive crashes and must be repowered. I'd first ddrescue to a same size drive then copy the files from the bad drive to a 3rd drive. Copying the files will tell you which of the files are damaged. Don't operate the drive until you're ready to recover. Do ddrescue first since damaged drives vary from "will limp along for a long time" to "you've only a few hours left before the drive is completely dead." You don't find out which until after it dies.

Unfortunately all my drive rescues have been ddrescue on Windows drives. I don't know how the Linux copy tools work under bad conditions.

So far I haven't found a USB to SATA controller that handles bad sectors well. The USB chips add their own error retries which rapidly multiply into intolerable wait times.

Unfortunately many modern external USB drives have saved a few pennies by doing away with the SATA controller. The board on the drive only has a USB connector. Data becomes unrecoverable with just a few bad sectors on these drives. I don't purchase branded external USB drives any more. You get a shorter warranty and you can't pull the drive to recover it, both because it will void your warranty and there's no plug for a SATA controller that handles bad sectors well. I'd rather have a drive with a longer warranty and a case I can open.

BeholdMyGlory · 2016-03-03 06:46:14

So if I'm getting you right, you're saying I should use ddrescue for the actual backup, and then copy the files simply to find out which files are affected? Would it be possible in that case to do the copying to /dev/null instead, and just keep track of the resulting errors?

mich41 · 2016-03-03 10:30:48

cp -a /media/bad-disk /media/good-disk 2>&1 |tee /media/good-disk/error-log

This was always good enough for me. It copies all fully readable files and prints errors on broken ones. If and only if the error log indicates that some really important files had I/O errors, then you can try to recover as much as possible from those particular files with some tricky dd invocations. Or read them from backup, if possible.

severach · 2016-03-04 08:35:19

It all depends on the value of the data and how much you want to spend on spare drives. That's the optimal sequence. Shortcut for cost as necessary.

mich41 · 2016-03-04 10:19:24

Is it possible to recover files from ddrescue image without silently getting zeros in place of bad sectors?

BeholdMyGlory · 2016-03-04 15:39:47

Thank you both for all your help. I think what I'll end up doing is first trying to do a file-level copy to a new pair of disks; if that doesn't work (due to insufferable wait times or whatever), I'll attempt a block-level copy one disk at a time using ddrescue, and see what I can do from there---with some luck one of my old disks is still usable, in which case I might be able to plug my new disks into my old enclosure with the RAID setup hopefully working as-is, backup the files to one of the old disks as well as one of my internal disks, after which I can redo the RAID setup with a new enclosure.

Arch Linux

#1 2016-02-28 05:07:01

External USB 3 HDD becomes unresponsive intermittently

#2 2016-02-28 08:56:14

Re: External USB 3 HDD becomes unresponsive intermittently

#3 2016-02-28 09:47:17

Re: External USB 3 HDD becomes unresponsive intermittently

#4 2016-02-28 11:36:25

Re: External USB 3 HDD becomes unresponsive intermittently

#5 2016-02-28 12:08:10

Re: External USB 3 HDD becomes unresponsive intermittently

#6 2016-02-28 12:23:58

Re: External USB 3 HDD becomes unresponsive intermittently

#7 2016-02-29 06:13:25

Re: External USB 3 HDD becomes unresponsive intermittently

#8 2016-02-29 10:29:02

Re: External USB 3 HDD becomes unresponsive intermittently

#9 2016-02-29 12:41:40

Re: External USB 3 HDD becomes unresponsive intermittently

#10 2016-03-03 06:14:34

Re: External USB 3 HDD becomes unresponsive intermittently

#11 2016-03-03 06:46:14

Re: External USB 3 HDD becomes unresponsive intermittently

#12 2016-03-03 10:30:48

Re: External USB 3 HDD becomes unresponsive intermittently

#13 2016-03-04 08:35:19

Re: External USB 3 HDD becomes unresponsive intermittently

#14 2016-03-04 10:19:24

Re: External USB 3 HDD becomes unresponsive intermittently

#15 2016-03-04 15:39:47

Re: External USB 3 HDD becomes unresponsive intermittently

Board footer