Truncated files after upgrade. Everything's broken. Faulty BTRFS? SSD?

nze · 2022-09-03 08:13:37

The arch install on my laptop is badly broken, and I can't seem to find the source of the problem. I have observed the following issues:

(1) Installing upgrades via yay or pacman leads to some files being truncated. For instance, upgrading curl two days ago lead to /usr/lib/libcurl.so.4 having missing ELF headers. As a result, any program relying on that dynamic library (such as pacman itself) died with an error. I restored the file from a backup and pacman worked again. I reinstalled curl, and this time it installed correctly. However, many other files seem to be affected by this, sometimes being of size 0, sometimes just being invalid.

(2) Emacs recently showed a "write error: input/output failure" when I tried to save a file.

As of now, I'm not using the broken system for fear of further data corruption. I'd much rather locate and fix the issue than doing a full reinstall, but I'm not sure how to debug the situation.

My first suspicion was that there's something wrong with BTRFS, but the kernel logs did not show any errors. However, `btrfs device stats /` reports 270 "corruption errors" on `/dev/mapper/mure`. My system is installed on BTRFS on top of LUKS. To check for errors, I ran `btrfs scrub` but it reported no errors.

The good folks over at #btrfs@libera.chat recommended I check my laptop's RAM, but after running memtest86 all night (all 12 tests, 4 passes), there seem to be no RAM errors. I also did Lenovo's UEFI extended hardware checks, and both the CPU and the SSD claim to be working fine.

I don't know how to further debug this. I suppose it could be a hardware issue. I have some free disk space on the SSD, so I can make a fresh (btrfs? ext4?) partition and see if I can recreate these write errors there. I do however suspect that the issue is a corrupt btrfs partition, but I don't know how to test for this. Is there a disk test utility that I could use for this?

Addendum: `btrfs check` does report errors:

sudo btrfs check --progress --readonly /dev/mapper/broken
Opening filesystem to check...
Checking filesystem on /dev/mapper/broken
UUID: 8c32182c-8756-467f-82fa-bae3cdc96d17
[1/7] checking root items                      (0:00:02 elapsed, 3222796 items checked)
[2/7] checking extents                         (0:00:20 elapsed, 328669 items checked)
[3/7] checking free space cache                (0:00:01 elapsed, 408 items checked)
root 473 inode 55308395 errors 200, dir isize wrong0:10 elapsed, 209556 items checked)
root 473 inode 79287178 errors 200, dir isize wrong
root 473 inode 104276240 errors 200, dir isize wrong
        unresolved ref dir 104276240 index 3 namelen 44 name {761acf31-4dbb-4c10-9553-44cece8009a9}.final filetype 1 errors 2, no dir index
root 473 inode 104276263 errors 200, dir isize wrong
        unresolved ref dir 104276263 index 3 namelen 44 name {fae6ec81-e0c0-44eb-966c-c5eab502e0b9}.final filetype 1 errors 2, no dir index
        unresolved ref dir 79287178 index 7 namelen 44 name {69e7c713-45fd-4fec-89c0-0a60de73b31a}.final filetype 1 errors 2, no dir index
        unresolved ref dir 55308395 index 11 namelen 44 name {7b47a5fd-0e26-4a3d-81ee-fb772f3d246d}.final filetype 1 errors 2, no dir index
root 1415 inode 55308395 errors 200, dir isize wrong:17 elapsed, 274668 items checked)
root 1415 inode 79287178 errors 200, dir isize wrong
root 1415 inode 104276240 errors 200, dir isize wrong
        unresolved ref dir 104276240 index 3 namelen 44 name {761acf31-4dbb-4c10-9553-44cece8009a9}.final filetype 1 errors 2, no dir index
root 1415 inode 104276263 errors 200, dir isize wrong
        unresolved ref dir 104276263 index 3 namelen 44 name {fae6ec81-e0c0-44eb-966c-c5eab502e0b9}.final filetype 1 errors 2, no dir index
        unresolved ref dir 79287178 index 7 namelen 44 name {69e7c713-45fd-4fec-89c0-0a60de73b31a}.final filetype 1 errors 2, no dir index
        unresolved ref dir 55308395 index 11 namelen 44 name {7b47a5fd-0e26-4a3d-81ee-fb772f3d246d}.final filetype 1 errors 2, no dir index
[4/7] checking fs roots                        (0:00:18 elapsed, 278926 items checked)
ERROR: errors found in fs roots
found 414572195840 bytes used, error(s) found
total csum bytes: 344833296
total tree bytes: 5384552448
total fs tree bytes: 4586602496
total extent tree bytes: 349339648
btree space waste bytes: 917060606
file data blocks allocated: 21566957367296
 referenced 510951297024

Unfortunately the output is rather unintelligible to me, and btrfs-check(8) comes with a big fat warning

Do not use --repair unless you are advised to do so by a developer or an experienced user

.

Last edited by nze (2022-09-03 09:24:35)

jasonwryan · 2022-09-03 08:23:36

https://wiki.archlinux.org/title/S.M.A.R.T.#smartctl

nze · 2022-09-03 09:29:37

jasonwryan wrote:

https://wiki.archlinux.org/title/S.M.A.R.T.#smartctl

Looks fine to my uneducated eyes

 philipp@mure ~ % sudo smartctl -t long /dev/nvme0
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.17.1-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

NVMe device successfully opened

Use 'smartctl -a' (or '-x') to print SMART (and more) information

philipp@mure ~ % sudo smartctl -a /dev/nvme0
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.17.1-arch1-1] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVL22T0HBLB-00BL7
Serial Number:                      S64SNF0R310450
Firmware Version:                   8L2QGXB7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,048,408,248,320 [2.04 TB]
Unallocated NVM Capacity:           0
Controller ID:                      6
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,048,408,248,320 [2.04 TB]
Namespace 1 Utilization:            1,499,718,754,304 [1.49 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 b311b70ca6
Local Time is:                      Sat Sep  3 11:27:26 2022 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0057):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     76 Celsius
Critical Comp. Temp. Threshold:     87 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.37W       -        -    0  0  0  0        0       0
 1 +     8.37W       -        -    1  1  1  1        0     200
 2 +     8.37W       -        -    2  2  2  2        0     200
 3 -   0.0500W       -        -    3  3  3  3     2000    1200
 4 -   0.0050W       -        -    4  4  4  4      500    9500

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        41 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    25,410,977 [13.0 TB]
Data Units Written:                 47,329,799 [24.2 TB]
Host Read Commands:                 154,211,372
Host Write Commands:                447,024,602
Controller Busy Time:               2,808
Power Cycles:                       720
Power On Hours:                     483
Unsafe Shutdowns:                   225
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               41 Celsius
Temperature Sensor 2:               38 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Rem.: I added the output of `btrfs check` to the original post.

graysky · 2022-09-03 09:35:35

The fact that memtest86 did not detect anything is good, but did journalctl or dmesg to indicate hardware issues? If none, I would say ditch btrfs. Nothing but trouble for me a few years ago on an unclean shutdown, data loss, etc. Went back to ext4.

nze · 2022-09-03 11:17:36

No hardware errors in dmesg/journalctl. What do I do now? Create a new partition and copy everything over, and hope for the best? Surely there must be a way to test if the fault comes from the partition, by doing a bunch of writes and integrity checks that I could run on a new partition before copying everything?

Arch Linux

#1 2022-09-03 08:13:37

Truncated files after upgrade. Everything's broken. Faulty BTRFS? SSD?

#2 2022-09-03 08:23:36

Re: Truncated files after upgrade. Everything's broken. Faulty BTRFS? SSD?

#3 2022-09-03 09:29:37

Re: Truncated files after upgrade. Everything's broken. Faulty BTRFS? SSD?

#4 2022-09-03 09:35:35

Re: Truncated files after upgrade. Everything's broken. Faulty BTRFS? SSD?

#5 2022-09-03 11:17:36

Re: Truncated files after upgrade. Everything's broken. Faulty BTRFS? SSD?

Board footer