HDD appears to have suddenly corrupted a second time.

DamiDoop · 2024-11-10 20:09:10

I've been having trouble with one of the HDDs in my desktop. Previously, the drive has appeared to have suddenly corrupted, and could no longer be mounted. When this initally happened, I wasn't able to figure out a clear solution, and just ended up recovering any files that I could and then reformatted the drive, which made the drive function correctly again. I also replaced the SATA cable connected to the drive, as it was the only potential point of failure I could identify that maybe could've corrupted it.

Unfortunately, after about 2 months of the drive working correctly again, the exact same thing has happened, and the drive is now failing to mount again with the same errors, which makes me concerned that something more serious may be wrong here.

Logs from dmesg that appeared after I tried to mount the drive:

[51574.020492] EXT4-fs warning (device sda1): ext4_clear_journal_err:6271: Filesystem error recorded from previous mount: IO failure
[51574.094258] EXT4-fs warning (device sda1): ext4_clear_journal_err:6279: Marked fs in need of filesystem check.
[51574.124451] EXT4-fs (sda1): warning: mounting fs with errors, running e2fsck is recommended
[51574.469415] EXT4-fs (sda1): recovery complete
[51574.469459] EXT4-fs (sda1): mounted filesystem 5864842b-0bc1-400b-a8c5-f06acf53c1d3 r/w with ordered data mode. Quota mode: none.
[51574.982438] EXT4-fs error (device sda1): ext4_validate_block_bitmap:423: comm ext4lazyinit: bg 40: bad block bitmap checksum
[51574.982447] Aborting journal on device sda1-8.
[51575.083164] EXT4-fs (sda1): Remounting filesystem read-only
[51881.614547] EXT4-fs (sda1): error count since last fsck: 2
[51881.614552] EXT4-fs (sda1): initial error at time 1731101486: ext4_validate_block_bitmap:423
[51881.614556] EXT4-fs (sda1): last error at time 1731262821: ext4_validate_block_bitmap:423
[55891.088498] EXT4-fs (sda1): unmounting filesystem 5864842b-0bc1-400b-a8c5-f06acf53c1d3.

Since dmesg advised running a filesystem check, I attempted to run one with sudo fsck /dev/sda, but got the following output:

fsck from util-linux 2.40.2
e2fsck 1.47.1 (20-May-2024)
ext2fs_open2: Bad magic number in super-block
fsck.ext2: Superblock invalid, trying backup blocks...
fsck.ext2: Bad magic number in super-block while trying to open /dev/sda

The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or
    e2fsck -b 32768 <device>

/dev/sda contains `DOS/MBR boot sector MS-MBR Windows 7 english at offset 0x163 "Invalid partition table" at offset 0x17b "Error loading operating system" at offset 0x19a "Missing operating system"; partition 1 : ID=0xee, start-CHS (0x0,0,2), end-CHS (0x3ff,255,63), startsector 1, 4294967295 sectors' data

Trying the alternative superblocks suggested by the output provide the exact same error.

I've also checked the SMART stats for the drive, and they are as follows:

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4          390556  ---  Lifetime Power-On Resets
0x01  0x010  4           11463  ---  Power-on Hours
0x01  0x018  6     55443095249  ---  Logical Sectors Written
0x01  0x020  6       255264557  ---  Number of Write Commands
0x01  0x028  6     76132885507  ---  Logical Sectors Read
0x01  0x030  6       256803643  ---  Number of Read Commands
0x01  0x038  6      2612094336  ---  Date and Time TimeStamp
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4           11247  ---  Spindle Motor Power-on Hours
0x03  0x010  4            7721  ---  Head Flying Hours
0x03  0x018  4          406235  ---  Head Load Events
0x03  0x020  4               0  ---  Number of Reallocated Logical Sectors
0x03  0x028  4               1  ---  Read Recovery Attempts
0x03  0x030  4               5  ---  Number of Mechanical Start Failures
0x03  0x038  4               0  ---  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4          389846  ---  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4              11  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              26  ---  Current Temperature
0x05  0x010  1              24  ---  Average Short Term Temperature
0x05  0x018  1              27  ---  Average Long Term Temperature
0x05  0x020  1              44  ---  Highest Temperature
0x05  0x028  1              15  ---  Lowest Temperature
0x05  0x030  1              40  ---  Highest Average Short Term Temperature
0x05  0x038  1              21  ---  Lowest Average Short Term Temperature
0x05  0x040  1              34  ---  Highest Average Long Term Temperature
0x05  0x048  1              25  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              60  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4          403495  ---  Number of Hardware Resets
0x06  0x010  4           11977  ---  Number of ASR Events
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0xff  =====  =               =  ===  == Vendor Specific Statistics (rev 1) ==
0xff  0x008  7               0  ---  Vendor Specific
0xff  0x010  7               0  ---  Vendor Specific
0xff  0x018  7               0  ---  Vendor Specific
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

and there are no warnings from smartctl about values being bad. Running a SMART test also returned no errors.

I'm not really sure what to do from here, given that clearly the cause of the drive's corruption is still present, or it wouldn't have corrupted for a second time. The one other HDD and my 2 SSDs have had no issues whatsoever. Any suggestions for what to look at?

xerxes_ · 2024-11-11 12:23:39

Post full output of commands (if you can't boot then from archiso, if you run terminal as root, omit sudo):

lsblk -f
sudo file -s /dev/sda{,1,2,3}
sudo fdisk -l /dev/sda
sudo parted -l
sudo smartctl -x /dev/sda
sudo journalctl -b

How to upload text?

Last edited by xerxes_ (2024-11-11 12:28:11)

Arisa · 2024-11-11 15:41:33

0x01 0x008 4 390556 --- Lifetime Power-On Resets
0x06 0x008 4 403495 --- Number of Hardware Resets

This is insanely high value, this is your issue, why is it losing power so often?

edit:
Don't use the disk anymore, it's faulty or the SATA/power ports, SATA/power cable.

edit2: maybe it could be also bug in some power saving, stuff like TLP but unlikely.

Last edited by Arisa (2024-11-11 15:49:57)

Arch Linux

#1 2024-11-10 20:09:10

HDD appears to have suddenly corrupted a second time.

#2 2024-11-11 12:23:39

Re: HDD appears to have suddenly corrupted a second time.

#3 2024-11-11 15:41:33

Re: HDD appears to have suddenly corrupted a second time.

Board footer