FSCK Fails on Boot

aviator07 · 2024-05-10 15:53:24

I'm having a major problem with my system that I hope is not a hardware problem. I don't think it is, but I am at a loss about what to do to resolve the issue. Here is the situation and what I have tried so far. I appreciate any help!

Yesterday I came to my computer (I was already logged into a KDE session) and when I tried to unlock it, it wouldn't take my password. I tried to a different TTY just to test my password, and instead of giving me TTY2, I got a black screen with a pulsing underscore. So, I did a physical hard shutdown. When I tried to power back up, it was failing FSCK on the boot partition.

Specifically I get:

boot: recovering journal
boot: Superblock needs_recovery flag is clear, but journal has data.
boot: Run journal anyway

boot: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
           (i.e., without -a or -p options)
ERROR: Bailing out. Run 'fsck UUID=<uuid> manually
********** FILESYSTEM CHECK FAILED **********
*                                           *
*  Please run fsck manually. After leaving  *
*  this maintenance shell, the system will  *
*  reboot automatically.                    *
*                                           *
*********************************************
sh: can't access tty; job control turned off
[rootfs ~]#

I have tried to run fsck on the given UUID above, and I get:

fsck from util-linux 2.39.3
e2fsck 1.47.0 (5-Feb-2023)
boot: recovering journal
Superblock needs_recovery flag is clear, but journal has data.
Run journal anyway<y>? yes
fsck.ext4: Input/output error while recovering journal of boot
fsck.ext4: unable to set superblock flags on boot


boot: ********** WARNING: Filesystem still has errors **********

I have also run from a live USB of GRML to troubleshoot. Honestly, I can't remember all of the different things I have tried, but I tried doing fsck with the -b option, specifying certain blocks, but I get not different results. There is an I/O error. I have read about drives physically failing (this is an M.2 drive), but when I run diagnostics on the drive with the -H option (can't remember the command...if it's relevant, I can try to look it up) it shows drive health is good. But I get about 16 R/W I/O errors. I don't think my drive is physically failing and going into read-only mode, but I guess that's a possibility.

Part of me thinks that since there is a clear event (the hard shutdown) that seems to have caused this, that it should be recoverable. Part of me thinks that the password and TTY thing are pretty weird and has me concerned....

I appreciate any advice or help on this issue. I don't really know where to go from here. I hope I can repair my boot drive and just keep on going, but if I can't I want to be able to preserve as much as possible.

Regarding my partition structure:

boot
swap
/home
I have of other drives mounted further down on /home, that seem to be unaffected by this.

d_fajardo · 2024-05-10 16:15:37

It sounds like a drive problem. since this is M.2 drive, have you tried nvme-cli to see if you can get further information regarding the drive?
Also backup your most important files if you can immediately.
https://wiki.archlinux.org/title/Solid_state_drive/NVMe

Last edited by d_fajardo (2024-05-10 16:31:55)

aviator07 · 2024-05-10 16:44:20

I appreciate the response. I am able to run nvme commands from a live usb. Running

nvme smart-log

, I get:

Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 47 °C (320 K)
available_spare                         : 100%
available_spare_threshold               : 50%
percentage_used                         : 8%
endurance group critical warning summary: 0
Data Units Read                         : 5711302 (2.92 TB)
Data Units Written                      : 28737783 (14.71 TB)
host_read_commands                      : 99402283
host_write_commands                     : 719524111
controller_busy_time                    : 2085
power_cycles                            : 439
power_on_hours                          : 12085
unsafe_shutdowns                        : 43
media_errors                            : 0
num_err_log_entries                     : 24012
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Temperature Sensor 2                    : 47 °C (320 K)
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0

When I run

 nvme error-log

I get a long output, but there are lots of errors of the following:

0x182(Attempted Write to Read Only Range: The LBA range specified contains read-only blocks)

I know drives can go in to RO mode when they are failing, but this drive apparently isn't in RO mode, according to other tests I have run....I guess I have 720M write operations to it. Perhaps that is close to lifetime allowable on it.

Regarding recovering data. How can I do that? I can't mount the drive anywhere. Could I try to dd it to a .iso file? Would that take the errors with it (assuming it worked)?

seth · 2024-05-10 20:23:07

Check the dmesg for what the IO error actually is - could be the bus (ie. the drive is badly seated or got under tension - they're typically secured by a screw that might be to loose or tight)

d_fajardo · 2024-05-11 09:13:02

Regarding recovering data. How can I do that?

You can try Disk Cloning. It does seem drive is intact so problem could be the bus as seth said or the controller in which case data could still be unaffected.
The simplest is dd but just be very careful with dd. It's been called 'disk destroyer'. At any case if you have an exact clone then at least you have a copy of the drive and have another copy to work with.

aviator07 · 2024-05-13 16:21:28

Thanks for the replies. When I run dmesg, I get a big dump, but this pair of lines keeps showing up:

nvme0n1: Flush(0x0) @ LBA 18446744073709551615, 0 blocks, Attempted to Write to Read Only Range (sct 0x1 / sc 0x82)
critical medium error, dev nvme0n1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 2

Is this definitely a drive failure? Or could it still be a recoverable fault?

seth · 2024-05-13 16:55:43

I get a big dump

A big massive dump? Late at night? Forcing you to flush 10, 15 times?

dmesg | curl -F 'file=@-' 0x0.st

This certainly isn't some bus error but the smart log also doesn't indicate that the drive has moved itself RO (maybe also post the "smartctl -a" output) so it's not clear why its RO at this point.
Maybe there's a hint buried in that big dump

aviator07 · 2024-05-13 17:29:21

seth wrote:

I get a big dump
A big massive dump? Late at night? Forcing you to flush 10, 15 times?

This has been causing all sorts of crazy problems

I ran

 dmesg | curl -F 'file=@-' 0x0.st

and the output is at http://0x0.st/XKy0.txt.

I also ran

smartctl -a /dev/nvme0n1

and got this:

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.15-amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Patriot Scorch M2
Serial Number:                      0385078C135D52085766
Firmware Version:                   E8FM11.5
PCI Vendor/Subsystem ID:            0x1987
IEEE OUI Identifier:                0x6479a7
Total NVM Capacity:                 256,060,514,304 [256 GB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            6479a7 15f0834f07
Local Time is:                      Mon May 13 17:03:59 2024 UTC
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x04):         Ext_Get_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     88 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.00W       -        -    0  0  0  0        0       0
 1 +     2.00W       -        -    1  1  1  1        0       0
 2 +     2.00W       -        -    2  2  2  2        0       0
 3 -   0.1000W       -        -    3  3  3  3     1000    1000
 4 -   0.0050W       -        -    4  4  4  4   400000   90000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         1
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        47 Celsius
Available Spare:                    100%
Available Spare Threshold:          50%
Percentage Used:                    8%
Data Units Read:                    5,710,747 [2.92 TB]
Data Units Written:                 28,737,774 [14.7 TB]
Host Read Commands:                 99,362,017
Host Write Commands:                719,523,591
Controller Busy Time:               2,085
Power Cycles:                       439
Power On Hours:                     12,084
Unsafe Shutdowns:                   43
Media and Data Integrity Errors:    0
Error Information Log Entries:      23,777
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 2:               47 Celsius

Error Information (NVMe Log 0x01, 16 of 16 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0      23777     0  0x000d  0x0017      -      9175046     1     -  Invalid Namespace or Format
  1      23776     0  0x0018  0x0005      -            6     0     -  Invalid Field in Command
  2      23775     8  0xb141  0x0305      -            0     1     -  Attempted Write to Read Only Range
  3      23774     8  0xa141  0x0305      -            0     1     -  Attempted Write to Read Only Range
  4      23773     8  0x9141  0x0305      -            0     1     -  Attempted Write to Read Only Range
  5      23772     8  0x8141  0x0305      -            0     1     -  Attempted Write to Read Only Range
  6      23771     8  0x7141  0x0305      -            0     1     -  Attempted Write to Read Only Range
  7      23770     8  0x6141  0x0305      -            0     1     -  Attempted Write to Read Only Range
  8      23769     4  0xb140  0x0304      -            0     1     -  Attempted Write to Read Only Range
  9      23768     4  0xa140  0x0304      -            0     1     -  Attempted Write to Read Only Range
 10      23767     4  0x9140  0x0304      -            0     1     -  Attempted Write to Read Only Range
 11      23766     4  0x8140  0x0304      -            0     1     -  Attempted Write to Read Only Range
 12      23765     4  0x7140  0x0304      -            0     1     -  Attempted Write to Read Only Range
 13      23764     4  0x6140  0x0304      -            0     1     -  Attempted Write to Read Only Range
 14      23763     4  0x5141  0x0304      -            0     1     -  Attempted Write to Read Only Range
 15      23762     4  0x4141  0x0304      -            0     1     -  Attempted Write to Read Only Range

Read Self-test Log failed: Invalid Namespace or Format (0x00b)

I'm hoping there are some good nuggets in that. It seems like the "Attempted to Write to Read Only Range" is confined to 16 addresses. That might be a bad sector? If so though, I am hoping it is recoverable.

seth · 2024-05-13 21:55:07

https://wiki.archlinux.org/title/Badblo … ad_sectors
https://wiki.archlinux.org/title/Badblo … ad_sectors

It seems to only complain about one LBA but there's also 100% free spare and in that case the drive would re-allocate the bad block
Ideally that's gonna happen when you test for them.

Sidebar:

Data Units Read:                    5,710,747 [2.92 TB]
Data Units Written:                 28,737,774 [14.7 TB]

Is this correct? You've > 6 times more write than read access?
(And way beyond the drive size, so it's not that you just copied a lot of data there once)

aviator07 · 2024-05-13 22:21:54

The Data Units Read/Written kinda surprised me too. Here is the history of this drive:

It was new 5 years ago when I first installed Arch on it. It might be getting a touch geriatric as an M.2 drive.... About 3 years into it, I ran out of space in my root partition, and I had to make a dd image of home, save it off, reformat and repartition the M.2 drive, reimage it, and then resize the file system. Maybe that is why those numbers are the way they are. Also, I tend to leave the machine running for months at a time between power cycles sometimes. Not sure how problematic that is really...

Let me try running badblocks. (I believe I did in my initial testing...) I will report back. So a bad sector ought to be automatically repaired right? is there any way to force that?

seth · 2024-05-14 06:40:04

Bad sectors cannot be "repaired" but the disk is supposed to replace them with a spare.
You cannot enforce that (except MAYBE with a dedicated low level tool by the disk vendor and most likely not) but you can tell the FS to skip that block (2nd link I posted in #9)

aviator07 · 2024-05-14 14:17:19

Well I got the results back from running

 badblocks -nsv /dev/nvme0n1

And I got:

done
Pass completed, 250059096 bad blocks found. (0/0/250059096 errors)
badblocks -nsv /dev/nvme0n1 196.15s user 49429.80s system 94% cpu 14:32:11.8 total

I guess I'll be getting a new drive...

seth · 2024-05-14 14:32:52

That's ~120GB (or 128GB in SI units and exactly half the drive) - this might be only one module and the other one is still fine, but iff this is "fixable" you'll need to investigate for the vendor.
Did you only recently use up more than 128GB on the drive?

aviator07 · 2024-05-14 14:46:13

Good catch. I have only been using this in normal operation lately. There shouldn't have been any new big files or high volume of files or anything like that.

Iff, this is fixable....what might a fix look like? How might I pull that off? At this point, if it is possible to make an image of the drive, or failing that, grab files off of the drive, I'd call it a mitigated success. But right now, I can't mount the drive, even as read-only. The file system itself is buffaloed so I can't see files on it.

I appreciate all your help by the way.

seth · 2024-05-14 15:25:43

Firmware patch (in case that's a device bug) or vendor-specific reset tool.
There's no way to fix this on the OS level.
Also you might want to inspect the NVME if anything looks loose or fused.

aviator07 · 2024-05-14 15:39:16

I'm in the process of running ddrescue on it right now, and creating a *.iso of the drive. Not sure if it will succeed or not, but it is 60% through with no errors now. Supposing I do get a good image, I should be able to just dd this to a new M2 and be good right?

seth · 2024-05-14 16:29:31

Right.
NAND typically only dies on writing, the data remains very stable for reading.

Arch Linux

#1 2024-05-10 15:53:24

FSCK Fails on Boot

#2 2024-05-10 16:15:37

Re: FSCK Fails on Boot

#3 2024-05-10 16:44:20

Re: FSCK Fails on Boot

#4 2024-05-10 20:23:07

Re: FSCK Fails on Boot

#5 2024-05-11 09:13:02

Re: FSCK Fails on Boot

#6 2024-05-13 16:21:28

Re: FSCK Fails on Boot

#7 2024-05-13 16:55:43

Re: FSCK Fails on Boot

#8 2024-05-13 17:29:21

Re: FSCK Fails on Boot

#9 2024-05-13 21:55:07

Re: FSCK Fails on Boot

#10 2024-05-13 22:21:54

Re: FSCK Fails on Boot

#11 2024-05-14 06:40:04

Re: FSCK Fails on Boot

#12 2024-05-14 14:17:19

Re: FSCK Fails on Boot

#13 2024-05-14 14:32:52

Re: FSCK Fails on Boot

#14 2024-05-14 14:46:13

Re: FSCK Fails on Boot

#15 2024-05-14 15:25:43

Re: FSCK Fails on Boot

#16 2024-05-14 15:39:16

Re: FSCK Fails on Boot

#17 2024-05-14 16:29:31

Re: FSCK Fails on Boot

Board footer