Two BTRFS SSDs corrupted in 2 weeks

noxnivi · 2024-09-17 16:43:53

Hello,

I just replaced a one year and a half!!!! old WD Green 480GB SSD (80 TBW) with a Kingston DC600M 480GB SSD (876 TBW) on a work computer:
(with the WD)
- Arch linux up-to-date with KDE Plasma
- BTRFS with the default filesystem layout decided by 'archinstall'
- including 4 GB Swap.

This drive is installed on an AMD A8-5600K with 16GB RAM and a Radeon 560, and is used only for the OS and some themporary files, PDFs, file scanning... important data is on other disks and server.

I have 3 browsers, Firefox, Brave and Chromium to isolate activities and Chromium is ran always inside firejail.

Last week I had to replace the WD Green because of extreme corruption in the drive that rendered the System unbootable. Haven't had time to attempt to recover info on that drive yet. This SSD was showing some issues as of lately and sometimes it wouldn't boot properly and needed a reboot to complete the boot process and jump into Plasma. The last time I decided to stop, take it off and replace it asap before things get worse. Then got the Kingston and plugged it in last week on September 11th. Made a clean install, also with BTRFS but this time with no Swap.

Today I started to get errors again as shown in DMESG (the drive, now Kingston is on sata3):

[    8.605766] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
[    8.605774] cfg80211: failed to load regulatory.db
[    8.710984] ata3.00: exception Emask 0x10 SAct 0x100080 SErr 0x2d0100 action 0x6 frozen
[    8.711642] ata3.00: irq_stat 0x08000000, interface fatal error
[    8.712221] ata3: SError: { UnrecovData PHYRdyChg CommWake 10B8B BadCRC }
[    8.712769] ata3.00: failed command: READ FPDMA QUEUED
[    8.713320] ata3.00: cmd 60/00:38:a0:0c:00/06:00:00:00:00/40 tag 7 ncq dma 786432 in
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[    8.719452] ata3.00: status: { DRDY }
[    8.720193] ata3.00: failed command: READ FPDMA QUEUED
[    8.721040] ata3.00: cmd 60/80:a0:a0:12:00/03:00:00:00:00/40 tag 20 ncq dma 458752 in
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA bus error)
[    8.722827] ata3.00: status: { DRDY }
[    8.723759] ata3: hard resetting link
[    8.990413] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    8.990647] ata3.00: configured for UDMA/133
[    8.990697] sd 2:0:0:0: [sdc] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[    8.990701] sd 2:0:0:0: [sdc] tag#7 Sense Key : Illegal Request [current]
[    8.990703] sd 2:0:0:0: [sdc] tag#7 Add. Sense: Unaligned write command
[    8.990706] sd 2:0:0:0: [sdc] tag#7 CDB: Read(10) 28 00 00 00 0c a0 00 06 00 00
[    8.990708] I/O error, dev sdc, sector 3232 op 0x0:(READ) flags 0x80700 phys_seg 84 prio class 0
[    8.993686] sd 2:0:0:0: [sdc] tag#20 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[    8.993695] sd 2:0:0:0: [sdc] tag#20 Sense Key : Illegal Request [current]
[    8.993697] sd 2:0:0:0: [sdc] tag#20 Add. Sense: Unaligned write command
[    8.993701] sd 2:0:0:0: [sdc] tag#20 CDB: Read(10) 28 00 00 00 12 a0 00 03 80 00
[    8.993704] I/O error, dev sdc, sector 4768 op 0x0:(READ) flags 0x80700 phys_seg 43 prio class 0
[    8.994464] ata3: EH complete

And journalctl (just the last boot, related to ata3):

Sep 17 16:39:15 atlantis kernel: ata3.00: exception Emask 0x10 SAct 0x100080 SErr 0x2d0100 action 0>
Sep 17 16:39:15 atlantis kernel: ata3.00: irq_stat 0x08000000, interface fatal error
Sep 17 16:39:15 atlantis kernel: ata3: SError: { UnrecovData PHYRdyChg CommWake 10B8B BadCRC }
Sep 17 16:39:15 atlantis kernel: ata3.00: failed command: READ FPDMA QUEUED
Sep 17 16:39:15 atlantis kernel: ata3.00: cmd 60/00:38:a0:0c:00/06:00:00:00:00/40 tag 7 ncq dma 786>
                                          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA b>
Sep 17 16:39:15 atlantis kernel: ata3.00: status: { DRDY }
Sep 17 16:39:15 atlantis kernel: ata3.00: failed command: READ FPDMA QUEUED
Sep 17 16:39:15 atlantis kernel: ata3.00: cmd 60/80:a0:a0:12:00/03:00:00:00:00/40 tag 20 ncq dma 45>
                                          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x10 (ATA b>
Sep 17 16:39:15 atlantis kernel: ata3.00: status: { DRDY }
Sep 17 16:39:15 atlantis kernel: ata3: hard resetting link
Sep 17 16:39:15 atlantis systemd[1]: Starting Enable Persistent Storage in systemd-networkd...
Sep 17 16:39:15 atlantis systemd[1]: Found device ST2000LM015-2E8174 Group.
Sep 17 16:39:15 atlantis kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Sep 17 16:39:15 atlantis kernel: ata3.00: configured for UDMA/133
Sep 17 16:39:15 atlantis kernel: sd 2:0:0:0: [sdc] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=>
Sep 17 16:39:15 atlantis kernel: sd 2:0:0:0: [sdc] tag#7 Sense Key : Illegal Request [current]
Sep 17 16:39:15 atlantis kernel: sd 2:0:0:0: [sdc] tag#7 Add. Sense: Unaligned write command
Sep 17 16:39:15 atlantis kernel: sd 2:0:0:0: [sdc] tag#7 CDB: Read(10) 28 00 00 00 0c a0 00 06 00 00
Sep 17 16:39:15 atlantis kernel: I/O error, dev sdc, sector 3232 op 0x0:(READ) flags 0x80700 phys_s>
Sep 17 16:39:15 atlantis kernel: sd 2:0:0:0: [sdc] tag#20 FAILED Result: hostbyte=DID_OK driverbyte>
Sep 17 16:39:15 atlantis kernel: sd 2:0:0:0: [sdc] tag#20 Sense Key : Illegal Request [current]
Sep 17 16:39:15 atlantis kernel: sd 2:0:0:0: [sdc] tag#20 Add. Sense: Unaligned write command
Sep 17 16:39:15 atlantis kernel: sd 2:0:0:0: [sdc] tag#20 CDB: Read(10) 28 00 00 00 12 a0 00 03 80 >
Sep 17 16:39:15 atlantis kernel: I/O error, dev sdc, sector 4768 op 0x0:(READ) flags 0x80700 phys_s>
Sep 17 16:39:15 atlantis kernel: ata3: EH complete

I understand that the WD is the "el cheapo" drive and, to a certain degree, "prone" to suffer issues. On the other hand, the Kingston should be fine for the lifetime of this PC, which is still very capable with Arch for all the tasks I need it to.

With the new drive there hasn't been any power failure issues. While running on the WD several power failures took place, due to storms, and office electric supply went down, shutting down everything.

I'm puzzled, mostly because I'm starting to think that BTRFS is not the right choice for storage as there seems to be quite a few issues regarding BTRFS corruption and "it’s essential to understand its (btrfs) limitations and potential pitfalls, such as unreliable repair tools and the importance of regular backups and hardware monitoring". Also, because I never had any hardware failure but the internal IDE drive of an old laptop. It's also true that this is the first time I have any issue that I can't solve with Arch (yet, and without asking). Besides, as I've read already in the forum, if a drive with BTRFS shows corruption sympthoms, stop and replace asap. That would mean I need to get another drive also, asap, just after installing the new one, and think about removing BTRFS on every device and use another filesystem.

Over with the philosophical thoughts. I certainly won't do any btrfs check as proven unreliable as stated on the wiki, and some other posts. Besides, first I have to do a ddrescue on the WD to get the image and see what I can get out of there.

So the questions here are:
- what is causing this corruption? to attempt to avoid the cause and prevent new issues like this in the future
- is there any way to check the corruption level of the disk to evaluate if it would be reliable.. to what extent?, or do I have to replace the disk again?
- is this somewhat related to SSDs, are HDDs less prone to this kind of issue or is just BTRFS?

I would appreciate any hint on these.
Thanks in advance.
NN

Last edited by noxnivi (2024-09-17 16:44:44)

gromit · 2024-09-17 16:47:22

Which kernel version is this on (uname -a)?

noxnivi · 2024-09-17 17:11:20

Hello @gromit,

uname -a:

Linux atlantis 6.10.10-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 12 Sep 2024 17:21:02 +0000 x86_64 GNU/Linux

Forgot to add it to the post.
regards,
NN

Head_on_a_Stick · 2024-09-17 17:16:24

noxnivi wrote:

if a drive with BTRFS shows corruption sympthoms, stop and replace asap

I find it highly unlikely that filesystem problems would render the hardware inoperable. Do you have any links to support that assertion?

Schlaefer01 · 2024-09-18 12:53:25

Nothing about this says btrfs - btrfs will probably complain or even go into read only mode - but what can it do if the underlying hardware errors out.

I would reseat the sata-cables, maybe even change port and cable. Try to limit speed to sata2 and see if that helps.

Also the second log seems to be about a ST2000LM015, which is a 2 TB HDD and not a kingston SSD?

Last edited by Schlaefer01 (2024-09-18 12:53:45)

xerxes_ · 2024-09-18 19:40:57

Can you post output of smartctl command from these two failing drives? For example: 'smartctl -x /dev/sdc'. If your system don't work you can run it from archiso.

I think that when system is turned off too fast when power off may produce drive errors or faulty power supply or cable (power or data) sometimes may break drives. Also faulty motherboard. Also you may have problems with electric installation. Also if you have internet overhead copper line and you mentioned about storm... you know what may happen.

noxnivi · 2025-01-31 18:23:48

Head_on_a_Stick wrote:

Do you have any links to support that assertion?

I found it here: https://discussion.fedoraproject.org/t/ … pted/77909, also on some other places that I can't recall now.

@Schlaefer01, I removed the drive, because that was an office work computer, and put the Kingston one, install Arch back, configure and back to work. I needed to get back to work ASAP. The WD is still there laying around waiting for time to check it back. Regarding the other disk, I understand that systemd has just found that device. The sdc errors are what I understand related to the same SATA connector, so reports itself as sdc in both cases and have nothing to do with the 2TB HDD. Am I right on this one?

@xerxes_: I just got an adapter cable to connect the SSD to an usb port. As soon as I have time I'll check the smartctl command to see what result it produces.
Yes, I agree with the errors you point out. I just turn off the computer through KDE's Shutdown button and let it go, so I presume shutdown is not an issue. The storm thing, well, that could help as when we have storms, sometimes some parts of the office building (depending on storm intensity) get power outage and that ends in complete shutdown whatever I was doing. I count this as one of the most likely causes. The mobo seems fine still, at least for now, so the PS.
I think internet is now on fiber, though I'm not sure, does not depend on me. It was copper wire before. Internet enters to a router, then goes to a building switch (rack mounted), then goes to the switch where my machine is connected to, though yes, I got an old iMac back in 2003 burned thanks to a storm sending high a peak through the phone line. Mobo is old, that is true, and all these storms and power outages might have affected it. I haven't seen anything burned down inside, I mean condensers. But i'll keep an eye on it for failures.

I did a clean install again on the Kingston back then and back to work. It looks like no issues since then with the Kingston. Fingers crossed.
Will check systemctl logs again to see if there is any issue, just to be calm.

I'm wondering if maybe btrfs is not such a good option for a filesystem??

Thanks all for your replies.
I'll come back with the smartctl outcome.

Regards,
NN

Head_on_a_Stick · 2025-01-31 18:59:33

noxnivi wrote:

Head_on_a_Stick wrote:
Do you have any links to support that assertion?
I found it here: https://discussion.fedoraproject.org/t/ … pted/77909

No drives were killed in that thread. I really do think this is nonsense.

xerxes_ · 2025-01-31 21:28:24

You may try to rescue files from btrfs partition(s) by commands:

btrfs restore -D <device> <path>      # see what can be restored, but do nothing (dry run)
btrfs restore -D -u 2 /dev/sdc1 /path/to/save/files_to      # example, dry run to compare between upper and this
btrfs restore -u 2 <device> <path>   # use given superblock mirror identified by <mirror>, it can be 0,1 or 2

It don't change anything on filesystem, works on unmounted partition, can be tried on filesystem image.

Arch Linux

#1 2024-09-17 16:43:53

Two BTRFS SSDs corrupted in 2 weeks

#2 2024-09-17 16:47:22

Re: Two BTRFS SSDs corrupted in 2 weeks

#3 2024-09-17 17:11:20

Re: Two BTRFS SSDs corrupted in 2 weeks

#4 2024-09-17 17:16:24

Re: Two BTRFS SSDs corrupted in 2 weeks

#5 2024-09-18 12:53:25

Re: Two BTRFS SSDs corrupted in 2 weeks

#6 2024-09-18 19:40:57

Re: Two BTRFS SSDs corrupted in 2 weeks

#7 2025-01-31 18:23:48

Re: Two BTRFS SSDs corrupted in 2 weeks

#8 2025-01-31 18:59:33

Re: Two BTRFS SSDs corrupted in 2 weeks

#9 2025-01-31 21:28:24

Re: Two BTRFS SSDs corrupted in 2 weeks

Board footer