You are not logged in.

#1 2016-07-17 14:11:31

Democracy Bringer
Member
Registered: 2016-02-16
Posts: 1

Haunting BTRFS checksum error on root partition

My adventure with Arch started 2 years ago, and since then I've used BTRFS for my / partition.

Now it has been the second time the filesystem got corrupted, shouting errors blaming checksum mismatches. The first time it happened the entire partition was unmountable, but this time it correctly mounts, both for the on-disk OS and the Arch ISO.

The partition mounts with multiple errors like these:

BTRFS warning (device dm-2): checksum error at logical 62590240 on dev /dev/mapper/ssd-sysvol, sector 143904: metadata leaf (level 0) in tree 136216576
BTRFS warning (device dm-2): checksum error at logical 62590240 on dev /dev/mapper/ssd-sysvol, sector 143904: metadata leaf (level 0) in tree 136216576
BTRFS error (device dm-2): bdev /dev/mapper/ssd-sysvol errs: wr 0, rd 0, flush 0, corrupt 21, gen 0

Some of the errors regard metadata leaves, some others nodes, with a total of 10 warnings and 4 errors.
But, in spite of these, the FS mounts and seem to work fine, unless I try to execute some commands inside the broken OS, revealing what look like corrupted binaries.

I've tried to run a read-only btrfs scrub, generating the same 4 errors and the final output:

scrub done for b5d2fe61-f400-40ab-807b-77e0f21a15c4
         scrub started at Sat Jul 16 15:57:54 2016 and finished after 00:00:07
         total bytes scrubbed: 3.65GiB with 4 errors
         error details: csum=4
         corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

Then, I ran a btrfs restore in dry-run mode, resulting in a failure:

checksum verify failed on 339296256 found 8A02712F wanted 00000000
checksum verify failed on 339296256 found 8A02712F wanted 00000000
checksum verify failed on 339296256 found 8A02712F wanted 00000000
bytenr mismatch, want 339296256, have 0
Error searching -5
Error searching /tmp/usr/include/X11/ICE
Error searching /tmp/usr/include/X11/ICE
Error searching /tmp/usr/include/X11/ICE
Error searching /tmp/usr/include/X11/ICE

In order to get a diagnostic copy of the metadata I issued a btrfs-image, without any luck:

checksum verify failed on 65290240 found 2FDEBACF wanted E4C47028
checksum verify failed on 65290240 found 2FDEBACF wanted E4C47028
checksum verify failed on 65323008 found B7225A2E wanted 00000000
checksum verify failed on 65323008 found B7225A2E wanted 00000000
checksum verify failed on 339296256 found 8A02712F wanted 00000000
checksum verify failed on 339296256 found 8A02712F wanted 00000000
checksum verify failed on 339296256 found 8A02712F wanted 00000000
checksum verify failed on 339296256 found 8A02712F wanted 00000000
bytenr mismatch, want 339296256, have 0
Error reading metadata block
Error adding block -5
checksum verify failed on 339296256 found 8A02712F wanted 00000000
checksum verify failed on 339296256 found 8A02712F wanted 00000000
checksum verify failed on 339296256 found 8A02712F wanted 00000000
checksum verify failed on 339296256 found 8A02712F wanted 00000000
bytenr mismatch, want 339296256, have 0
Error reading metadata block
Error flushing pending -5
create failed (Success)

Finally, I ran a read-only btrfs check, wich produced a ton of output on the terminal and resulted in this final output:

Checking filesystem on /dev/ssd/sysvol
UUID: b5d2fe61-f400-40ab-807b-77e0f21a15c4
The following tree block(s) is corrupted in tree 257:
	tree block bytenr: 38895616, level: 2, node key: (103800, 1, 0)
The following data extent is lost in tree 257:
	inode: 110837, offset:655360, disk_bytenr: 1738903552, disk_len: 102400
found 3566469120 bytes used err is 1
total csum bytes: 3133040
total tree bytes: 356892672
total fs tree bytes: 338821120
total extent tree bytes: 13664256b
tree space waste bytes: 57739479
file data blocks allocated: 10591703040 referenced 15198257152

Now, since the damaged FS was mountable, I've managed to copy on a backup disk the precious /etc directory, so that I could "restore" my system once the catastrophe has passed.

At this point, I have many questions:
- Is the damaged root partition recoverable, maybe by rewriting some metadata? Do I have to run a read-write btrfs check?
- In case it isn't, can I easily check if the fresh copy of /etc is consistent, without manually checking every config file?
- Now that this accident has happened twice already, should I replace the underlying block device? Setting up a virtual RAID 1 could help preventing this from happening? Or should I abandon BTRFS for the time being?

I have to mention that my filesystem infrastructure is heavily compartmentalized and I have "mildly warm" backups of /etc, /var and /home, in addition to various automatically generated snaphsots of / and /usr. Both of the latter directories are in separate subvolumes under the BTRFS volume called "sysvol".
So, even if I have to nuke the main partition and the recovered /etc is damaged, I can light-heartedly say that most of my data is safe anyway.

Offline

Board footer

Powered by FluxBB