Issues with multiple hard drives

OrakMoya · 2022-07-30 15:13:36

Hello,

A week or so ago one of my drives (some toshiba drive) started randomly becoming read-only, which was a problem since it was part of a 2 drive RAID0 btrfs array. I didnt think too much of it since it did have a year and 5 months of on time. When it was read only, if I unmounted it the partition would show up as "unknown" in gnome disk utility. I copied the data off of it, replaced it (a WD blue drive) and created a new BTRFS raid0 partition with the new drive and the still working one (also a WD blue drive). dmesg did report errors when the old drive became read only:

[  376.157764] ata9.00: status: { DRDY SENSE ERR }
[  376.157766] ata9.00: error: { UNC }
[  376.158840] ata9.00: configured for UDMA/100
[  376.158856] sd 8:0:0:0: [sdc] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=8s
[  376.158861] sd 8:0:0:0: [sdc] tag#12 Sense Key : Medium Error [current] 
[  376.158864] sd 8:0:0:0: [sdc] tag#12 Add. Sense: Unrecovered read error - auto reallocate failed
[  376.158868] sd 8:0:0:0: [sdc] tag#12 CDB: Read(10) 28 00 02 c6 8b 80 00 00 20 00
[  376.158870] I/O error, dev sdc, sector 46566272 op 0x0:(READ) flags 0x1000 phys_seg 2 prio class 0
[  376.158879] BTRFS error (device sda: state EA): bdev /dev/sdc errs: wr 0, rd 8, flush 1, corrupt 0, gen 0
[  376.158895] ata9: EH complete
[  384.714411] ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[  384.714416] ata9.00: irq_stat 0x40000001
[  384.714418] ata9.00: failed command: READ DMA
[  384.714419] ata9.00: cmd c8/00:20:e0:8c:c6/00:00:00:00:00/e2 tag 13 dma 16384 in
                        res 53/40:20:e0:8c:c6/00:00:00:00:00/e2 Emask 0x9 (media error)

While copying back the data, tar started complaining about not being able to write to the filesystem. The new drive started giving out the same errors, and it had less than half an hour of total on time. While it was running, you could hear a click and the drive would become read only. Unplugging the power cable and plugging it back in while the pc was on would make it show up again but thats not really great.

At this point I'm thinking I have a faulty SATA cable. I got a new one and plugged in the old toshiba drive in. Made a seperate btrfs partition on it and it would fail after a while, and so would the new drive. I swapped around the SATA cables between the drives, I swapped around the power cables just in case one was faulty on my PSU and those two drives always eventually failed.

Now, I'm getting the same error for the third drive, the still working one in the RAID0 array (22 days of on time). At this point I'm thinking there's a problem with my motherboard. I can't have three drives fail in a week, one of which is brand new and the other relatively new. So far I've only formatted the drives as BTRFS, so just in case if its a kernel bug I'm gonna try an EXT4 partition, but I doubt its a BTRFS bug.

What do you lads think?

Last edited by OrakMoya (2022-07-30 15:17:11)

dimich · 2022-07-30 16:50:55

Suppositions: PSU or motherboard electrolytic capacitors.
What does smartctl report for devices?

OrakMoya · 2022-07-30 17:51:47

https://pastebin.com/6XaKgHvT

dimich · 2022-07-30 19:25:00

WD-WCC6Y4JE2S8R is definitly faulty, it has at least 1 bad block. Two other seem good. Were the same "UNC" errors in dmesg for all devices?
I'd suspect power supply quality first.

Last edited by dimich (2022-07-30 19:25:25)

OrakMoya · 2022-07-31 11:05:35

dimich wrote:

WD-WCC6Y4JE2S8R is definitly faulty, it has at least 1 bad block. Two other seem good. Were the same "UNC" errors in dmesg for all devices?
I'd suspect power supply quality first.

But because it has 1 bad sector it shouldnt just fail completely. Hard drives have spare sectors when they start to go bad don't they?

And yes, IIRC all of the messages were the same for all the hard drives. It could be the PSU, but there is still one 2TB WD drive that has not failed even once.

OrakMoya · 2022-07-31 12:15:56

Im starting to think it is the PSU. I ran a blender render on my gpu and cpu and tried to format one of the drives. Dmesg spat out errors instantly, the drive was stuck being formatted forever and a second drive already mounted became read-only. Once I quit after a minute or two blender, a couple seconds after the drive formatted and could be mounted just fine. The other drive stayed read-only.

Here is dmesg with all the errors

dimich · 2022-07-31 14:08:14

OrakMoya wrote:

But because it has 1 bad sector it shouldnt just fail completely. Hard drives have spare sectors when they start to go bad don't they?

Disk shouldn't fail completely but kernel remounts filesystems to r/o to prevent potential data loss due to filesystem structures inconsistency. However, it is Runtime_Bad_Block, so may be it's ok on physical layer.
As i understand, relocations happen when there was an error during write operation or after corrected read error. But i'm not sure.

OrakMoya · 2022-08-03 09:55:44

I unplugged the drive that didnt initially spit out errors, the one that later showed one bad sector, and so far I havent had any of the two drives become read only. It seems it was creating some kind of problem for all drives while plugged in.

That drive is pretty much dead. It showed read speeds of 5MiBps and it was clicking.

Arch Linux

#1 2022-07-30 15:13:36

Issues with multiple hard drives

#2 2022-07-30 16:50:55

Re: Issues with multiple hard drives

#3 2022-07-30 17:51:47

Re: Issues with multiple hard drives

#4 2022-07-30 19:25:00

Re: Issues with multiple hard drives

#5 2022-07-31 11:05:35

Re: Issues with multiple hard drives

#6 2022-07-31 12:15:56

Re: Issues with multiple hard drives

#7 2022-07-31 14:08:14

Re: Issues with multiple hard drives

#8 2022-08-03 09:55:44

Re: Issues with multiple hard drives

Board footer