[SOLVED]xfs_repair doesn't work

L1ghtmareI · 2017-02-20 18:19:47

so today out of the blue my login manager wouldn't start and i got these messages in system log:

-- Logs begin at Sat 2016-12-31 18:52:12 MSK, end at Mon 2017-02-20 21:12:19 MSK. --
Feb 20 21:10:47  kernel: ffff88040e8bc030: 58 67 db ca 2a 3a dd b8 00 00 00 00 00 00 00 00  Xg..*:..........
Feb 20 21:10:47  kernel: XFS (sda1): Internal error xfs_iread at line 514 of file fs/xfs/libxfs/xfs_inode_buf.c.  Caller xfs_iget+0x2b1/0x940 [xfs]
Feb 20 21:10:47  kernel: XFS (sda1): Corruption detected. Unmount and run xfs_repair
Feb 20 21:10:47  kernel: XFS (sda1): xfs_iread: validation failed for inode 34110192 failed
Feb 20 21:10:47  kernel: ffff88040e8bc000: 49 4e a1 ff 03 01 00 00 00 00 00 00 00 00 00 00  IN..............
Feb 20 21:10:47  kernel: ffff88040e8bc010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
Feb 20 21:10:47  kernel: ffff88040e8bc020: 58 aa 04 b8 2e e3 65 3a 57 41 fe 12 00 00 00 00  X.....e:WA......
Feb 20 21:10:47  kernel: ffff88040e8bc030: 58 67 db ca 2a 3a dd b8 00 00 00 00 00 00 00 00  Xg..*:..........
Feb 20 21:10:47  kernel: XFS (sda1): Internal error xfs_iread at line 514 of file fs/xfs/libxfs/xfs_inode_buf.c.  Caller xfs_iget+0x2b1/0x940 [xfs]
Feb 20 21:10:47  kernel: XFS (sda1): Corruption detected. Unmount and run xfs_repair

tried to run xfs_repair -v booting from arch LiveUSB or xfs_repair -d after remounting /dev/sda1 as ro -- to no avail, it doesn't seem to find any issues.

/dev/sda1 is my root partition.

Thanks in advance.

edit: the drive in question is an ssd btw, i have fstrim enabled

edit 2: the patch for xfsprogs has been devised, see the last post.

Last edited by L1ghtmareI (2017-02-26 12:41:50)

mich41 · 2017-02-20 19:00:53

There is some chance that it was a random bit flip in RAM (overnight memcheck perhaps?) or that the corrupted inode has since then been deleted. And of course, triple check that you are running xfs_repair on the right disk.

Maybe mount it ro and run

find /mnt/whatever -type f -exec dd if={} of=/dev/null status=none \;

This should locate any unreadable files if such still exists. And print the same message in dmesg again.

And maybe post disk's SMART parameters. I don't really believe any modern disk would respond to read request with corrupted data (they should be able to detect and report errors instead) but no harm checking.

L1ghtmareI · 2017-02-20 20:12:23

i'm no expert but this command has been running for 50 minutes now

this is an ssd btw

L1ghtmareI · 2017-02-20 20:20:49

SMART test:

smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.9.8-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     SPCC Solid State Disk
Serial Number:    004270011E
LU WWN Device Id: 0 000000 000000000
Firmware Version: 560ABBF0
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Feb 20 23:19:59 2017 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x7d) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Abort Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (  48) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x0025)	SCT Status supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   095   095   050    Old_age   Always       -       171001485
  5 Reallocated_Sector_Ct   0x0033   099   099   003    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       10386 (169 82 0)
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1250
171 Unknown_Attribute       0x000a   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
174 Unknown_Attribute       0x0030   000   000   000    Old_age   Offline      -       183
177 Wear_Leveling_Count     0x0000   000   000   000    Old_age   Offline      -       99
181 Program_Fail_Cnt_Total  0x000a   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0012   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   030   030   000    Old_age   Always       -       30 (Min/Max 30/30)
195 Hardware_ECC_Recovered  0x001c   120   120   000    Old_age   Offline      -       171001485
196 Reallocated_Event_Count 0x0033   099   099   003    Pre-fail  Always       -       0
201 Unknown_SSD_Attribute   0x001c   120   120   000    Old_age   Offline      -       171001485
204 Soft_ECC_Correction     0x001c   120   120   000    Old_age   Offline      -       171001485
230 Unknown_SSD_Attribute   0x0013   100   100   000    Pre-fail  Always       -       100
231 Temperature_Celsius     0x0013   100   100   010    Pre-fail  Always       -       8589934592
233 Media_Wearout_Indicator 0x0032   000   000   000    Old_age   Always       -       7967
234 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       6905
241 Total_LBAs_Written      0x0032   000   000   000    Old_age   Always       -       6905
242 Total_LBAs_Read         0x0032   000   000   000    Old_age   Always       -       7108

SMART Error Log not supported

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     10386         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

L1ghtmareI · 2017-02-20 21:58:31

Memtester all green, didn't run for 800 MB the OS was running on though.

mich41 · 2017-02-21 16:39:27

L1ghtmareI wrote:

i'm no expert but this command has been running for 50 minutes now

Has it finished yet?

Yes, it's slow because it reads all files - that's how I thought we could find the corrupt file (if any).

SMART seems OK, no obvious problems here.

L1ghtmareI · 2017-02-21 17:47:17

its actually been running this entire time (~20 hours) because i remounted back to root.

it threw a bunch of messages like

 dd: error reading '/sys/kernel/*': Function not implemented/Invalid argument/No such device

and also

INFO: task kworker/0:2:2440 blocked for more than 20 seconds.
Tainted: P    0    4.9.8-1-ARCH #1

Should i stop it now and paste the complete output?

mich41 · 2017-02-21 18:41:08

Well, I though about running it from live USB on a filesystem mounted somewhere in /mnt. If you ran it on / and it goes through /sys /proc and everything then hell knows how long it will take and I bet it'll encounter many errors or just hang. Only errors on files from the suspicious XFS filesystem are relevant. And, in case this isn't obvious, it needs to run as root or there will be many "permission denied" errors.

Not sure what this blocked task, but check dmesg for noise from the XFS driver.

If you want to run it on / of currently running OS, use

find / -xdev -type f ... and so on

Last edited by mich41 (2017-02-21 18:43:02)

L1ghtmareI · 2017-02-21 18:51:07

i just did it how you explained, took 2 minutes and found nothing.

mich41 · 2017-02-21 19:07:51

And no XFS errors in dmesg? Then one more thing to try is downgrading xfsprogs to the oldest version you can find in /var/cache/pacman/pkg and also trying the latest version available in repo. If neither version of xfs_repair can find the corruption and all files are cleanly readable (as determined above) then I guess the corruption is somehow gone. Dunno what the problem was.

L1ghtmareI · 2017-02-21 19:46:02

it still throws the messages from the OP every time i try to upgrade something with pacman for example.

It seems kinda strange that despite them popping up it's the login manager that fails to start supposedly after boot process is completed and yet i am able to log in another tty just fine. It also says that command startx can't be found. Perhaps some of the X related files were corrupted (and missed by 'find' for some reason)?

mich41 · 2017-02-21 20:03:53

And xfs_repair still says everything is right? Now that looks like an XFS bug indeed. I guess all you can do is go upstream - see here and email the linux-xfs group.

It may be a bug in xfs_repair or in the kernel driver. I presume live USB uses some older kernel, what happens if you boot it, chroot to this system and start pacman inside chroot while running the live USB kernel?

Oh, and what kernel version do you have so I know to avoid it for now?

Last edited by mich41 (2017-02-21 20:04:31)

L1ghtmareI · 2017-02-21 20:07:12

4.9.8-1
doing a full upgrade rn, will report back in a few minutes
xfsprogs wasnt updated though so theres little hope

L1ghtmareI · 2017-02-21 22:24:24

OK it didn't work.

How do i report this as a bug? do i just message the guy and link this thread?

Also regarding restoring the system - will it work if i back the files up, remake the fs and load them back in?

mich41 · 2017-02-22 08:02:56

L1ghtmareI wrote:

OK it didn't work.

Full upgrade? So what exactly happened? I thought you said pacman doesn't work at all.

L1ghtmareI wrote:

How do i report this as a bug? do i just message the guy and link this thread?

Not a guy but a mailing list. The email address is in my link and also an archive if you'd like to see how things are normally done in there. As for reporting bugs, nobody really enjoys reading long forum threads, so you would better just describe what the problem is - namely, your kernel reports corruption but xfs_repair fails to find it. Paste this log and make it clear that the problem is persistent because it happens every time you run package manager. Hopefully someone will figure out how to find what's wrong.

L1ghtmareI wrote:

Also regarding restoring the system - will it work if i back the files up, remake the fs and load them back in?

In principle - yes, this should produce a clean filesystem. But if there are bugs, they may screw things up again.

Last edited by mich41 (2017-02-22 08:04:06)

L1ghtmareI · 2017-02-22 09:33:45

Pacman works, it just throws messages similar to what's seen in boot log with no apparent consequences.

L1ghtmareI · 2017-02-26 12:40:29

It's been solved upstream, here's the patch:

diff --git a/repair/dinode.c b/repair/dinode.c
index 8d01409..d664f87 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -1385,6 +1385,11 @@ process_symlink(
                return(1);
        }

+       if (be64_to_cpu(dino->di_size) == 0) {
+               do_warn(_("zero size symlink in inode %" PRIu64 "\n"), lino);
+               return 1;
+       }
+
        /*
         * have to check symlink component by component.
         * get symlink contents into data area

it required me to use `--ignore-whitespaces` option with patch. After that i just had to run `xfs_repair -d` in accordance with wiki.

Likely the patch is gonna make it into the next xfsprogs.

Since the file in question for me was /usr/lib/libxcb-randr.so.0 , restoring functionalty was a matter of reinstalling libxcb.

Kudos to mich41 for huge help.

Arch Linux

#1 2017-02-20 18:19:47

[SOLVED]xfs_repair doesn't work

#2 2017-02-20 19:00:53

Re: [SOLVED]xfs_repair doesn't work

#3 2017-02-20 20:12:23

Re: [SOLVED]xfs_repair doesn't work

#4 2017-02-20 20:20:49

Re: [SOLVED]xfs_repair doesn't work

#5 2017-02-20 21:58:31

Re: [SOLVED]xfs_repair doesn't work

#6 2017-02-21 16:39:27

Re: [SOLVED]xfs_repair doesn't work

#7 2017-02-21 17:47:17

Re: [SOLVED]xfs_repair doesn't work

#8 2017-02-21 18:41:08

Re: [SOLVED]xfs_repair doesn't work

#9 2017-02-21 18:51:07

Re: [SOLVED]xfs_repair doesn't work

#10 2017-02-21 19:07:51

Re: [SOLVED]xfs_repair doesn't work

#11 2017-02-21 19:46:02

Re: [SOLVED]xfs_repair doesn't work

#12 2017-02-21 20:03:53

Re: [SOLVED]xfs_repair doesn't work

#13 2017-02-21 20:07:12

Re: [SOLVED]xfs_repair doesn't work

#14 2017-02-21 22:24:24

Re: [SOLVED]xfs_repair doesn't work

#15 2017-02-22 08:02:56

Re: [SOLVED]xfs_repair doesn't work

#16 2017-02-22 09:33:45

Re: [SOLVED]xfs_repair doesn't work

#17 2017-02-26 12:40:29

Re: [SOLVED]xfs_repair doesn't work

Board footer