You are not logged in.
Hi!
A little bit of background:
I have on my box bcache set up because I just wanted the little extra speed. I have one partition (well, most of my SSD) wet up as a cache partition (see below). There is a 4GB HDD with Arch Linux root and a data partition, both running as a bcache backing device with the same cache.
The story: After running this setup for almost two years, something went wrong while I was not attending the computer. After a regular upgrade with pacman (IIRC there was no kernel upgrade) and a reboot. After I came to back (a few minutes after the reboot!), mount of root had failed and the boot process had bailed out at the emergency shell. I hastily run fsck assuming it is some kind of trivial error ... after realizing I was being bombarded with seemingly neverending errors from fsck trying to rescue the filesystem, I interrupted it and shut down the computer. I was presuming there is a hard disk failure or something, and thought that running fsck will not make things any better.
This is the situation I am currently in. After checking from BIOS that there are no clear HW failures and rebooting to a live installation from USB (the next day which is today) all HDDs and the SSD seem to be in order according to smartctl. Except that ... well the data partition is recognized after attaching to bcache but is still full of errors. Moreover, the Arch Root partition is not recognized by bcache, but instead kernel logs claim that it has no bcache super block (which it definitely should have).
TL;DR: Root was installed on a bcache device, but bcache superblock is not recognized anymore because of some unknown error. Anyway to recover the FS on the broken bcache to check the logs in case there's any tips for what went wrong?
Here is the (relevant part of) the partition scheme:
Model: ATA Samsung SSD 850 (scsi)
Disk /dev/sdc: 250GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 537MB 536MB fat16 EFI boot, esp
2 537MB 16.9GB 16.4GB linux-swap(v1) swapssd
3 16.9GB 250GB 233GB bcache
Model: ATA WDC WD40EFRX-68W (scsi)
Disk /dev/sdb: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 1000GB 1000GB bcache-arch
2 1000GB 4001GB 3001GB bcache-pool2
Here: bcache=cache (on the SSD), bcache-arch = root, bcache-pool2=data partition. EDIT: Both bcache-arch and bcache-pool2 were formatted as ext4; I can only fsck the latter since the former is not recognized!
There are other HDDs too but those are not running with bcache.
Sorry for a lack of information (such as the fstab) - but since the root FS is not accessible ATM, I can not read it and post it here. FWIW all FS were referenced with their FS UUIDs.
Thanks!
(good thing my data is intact)
Last edited by Wild Penguin (2017-11-20 15:17:49)
Offline
Ok, a little bit of something I've found while googleing and reading documents:
The filesystem should be available at offset 8192 in the device node, even if the superblock is corrupted (if someone is found in the same boat).
I've also noticed that there has been some changes to the bcache code recently. I'd doubt a serious bug would make it in to a released Kernel, but I'm using the testing repositories. It is possible something is broken in the current kernel , perhaps in some obscure situation. But I still haven't ruled out hardware malfunction (although everything seems to be working in the live installation and SMART data is OK).
But too late to check if my root FS is still there intact. Will try to mount it RO tomorrot
Cheers!
Offline
Ok,
There's a bug in the Kernel 4.14(.1), affecting bcache and other things too (actually it is some blocklayer subsystem, it's quite confusing for someone not being a developer - it can cause other breakage, too!).
A bug report is in the cooking!
In the meantime I'd advice against using the kernel in testing! See this Gentoo bug report: https://bugs.gentoo.org/638206
Offline
You could file a bug report here asking for https://git.kernel.org/pub/scm/linux/ke … rtno.patch to be pulled before 4.14.2 as the issue results in data loss.
Offline
I don't use kernel .0 releases for this kind of reason but a data corruption issue in a kernel .1? Haven't seen that in a while. Sh:t happens.
Good luck on the recovery.
Offline
I'd presume there is little I can report upstream (it is already known there and also the data loss(es) affecting some users, and a patch is already queued for 4.14 ).
Now it is in Arch bugzilla, too, so the Arch maintainers can take whatever steps they want / can / see appropriate...
If I've lost some data, that will be seen - and will be quite minimal in any case. Luckily I had some backup setups in place, and restoring the installation is trivial (but timeconsuming) - and actually originally I just wanted to see the logs, since I was perplexed what is wrong - HW or SW - and how to prevent it from happening again. I got the answers I was after now...
EDIT: some typos and minor clarifications
Last edited by Wild Penguin (2017-11-21 21:16:49)
Offline