You are not logged in.
Pages: 1
so today out of the blue my login manager wouldn't start and i got these messages in system log:
-- Logs begin at Sat 2016-12-31 18:52:12 MSK, end at Mon 2017-02-20 21:12:19 MSK. --
Feb 20 21:10:47 kernel: ffff88040e8bc030: 58 67 db ca 2a 3a dd b8 00 00 00 00 00 00 00 00 Xg..*:..........
Feb 20 21:10:47 kernel: XFS (sda1): Internal error xfs_iread at line 514 of file fs/xfs/libxfs/xfs_inode_buf.c. Caller xfs_iget+0x2b1/0x940 [xfs]
Feb 20 21:10:47 kernel: XFS (sda1): Corruption detected. Unmount and run xfs_repair
Feb 20 21:10:47 kernel: XFS (sda1): xfs_iread: validation failed for inode 34110192 failed
Feb 20 21:10:47 kernel: ffff88040e8bc000: 49 4e a1 ff 03 01 00 00 00 00 00 00 00 00 00 00 IN..............
Feb 20 21:10:47 kernel: ffff88040e8bc010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
Feb 20 21:10:47 kernel: ffff88040e8bc020: 58 aa 04 b8 2e e3 65 3a 57 41 fe 12 00 00 00 00 X.....e:WA......
Feb 20 21:10:47 kernel: ffff88040e8bc030: 58 67 db ca 2a 3a dd b8 00 00 00 00 00 00 00 00 Xg..*:..........
Feb 20 21:10:47 kernel: XFS (sda1): Internal error xfs_iread at line 514 of file fs/xfs/libxfs/xfs_inode_buf.c. Caller xfs_iget+0x2b1/0x940 [xfs]
Feb 20 21:10:47 kernel: XFS (sda1): Corruption detected. Unmount and run xfs_repair
tried to run xfs_repair -v booting from arch LiveUSB or xfs_repair -d after remounting /dev/sda1 as ro -- to no avail, it doesn't seem to find any issues.
/dev/sda1 is my root partition.
Thanks in advance.
edit: the drive in question is an ssd btw, i have fstrim enabled
edit 2: the patch for xfsprogs has been devised, see the last post.
Last edited by L1ghtmareI (2017-02-26 12:41:50)
Offline
There is some chance that it was a random bit flip in RAM (overnight memcheck perhaps?) or that the corrupted inode has since then been deleted. And of course, triple check that you are running xfs_repair on the right disk.
Maybe mount it ro and run
find /mnt/whatever -type f -exec dd if={} of=/dev/null status=none \;
This should locate any unreadable files if such still exists. And print the same message in dmesg again.
And maybe post disk's SMART parameters. I don't really believe any modern disk would respond to read request with corrupted data (they should be able to detect and report errors instead) but no harm checking.
Offline
i'm no expert but this command has been running for 50 minutes now
this is an ssd btw
Offline
SMART test:
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.9.8-1-ARCH] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: SPCC Solid State Disk
Serial Number: 004270011E
LU WWN Device Id: 0 000000 000000000
Firmware Version: 560ABBF0
User Capacity: 240,057,409,536 bytes [240 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Feb 20 23:19:59 2017 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7d) SMART execute Offline immediate.
No Auto Offline data collection support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 48) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0025) SCT Status supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 095 095 050 Old_age Always - 171001485
5 Reallocated_Sector_Ct 0x0033 099 099 003 Pre-fail Always - 0
9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 10386 (169 82 0)
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1250
171 Unknown_Attribute 0x000a 100 100 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
174 Unknown_Attribute 0x0030 000 000 000 Old_age Offline - 183
177 Wear_Leveling_Count 0x0000 000 000 000 Old_age Offline - 99
181 Program_Fail_Cnt_Total 0x000a 100 100 000 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0012 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 030 030 000 Old_age Always - 30 (Min/Max 30/30)
195 Hardware_ECC_Recovered 0x001c 120 120 000 Old_age Offline - 171001485
196 Reallocated_Event_Count 0x0033 099 099 003 Pre-fail Always - 0
201 Unknown_SSD_Attribute 0x001c 120 120 000 Old_age Offline - 171001485
204 Soft_ECC_Correction 0x001c 120 120 000 Old_age Offline - 171001485
230 Unknown_SSD_Attribute 0x0013 100 100 000 Pre-fail Always - 100
231 Temperature_Celsius 0x0013 100 100 010 Pre-fail Always - 8589934592
233 Media_Wearout_Indicator 0x0032 000 000 000 Old_age Always - 7967
234 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 6905
241 Total_LBAs_Written 0x0032 000 000 000 Old_age Always - 6905
242 Total_LBAs_Read 0x0032 000 000 000 Old_age Always - 7108
SMART Error Log not supported
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 10386 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Offline
Memtester all green, didn't run for 800 MB the OS was running on though.
Offline
i'm no expert but this command has been running for 50 minutes now
Has it finished yet?
Yes, it's slow because it reads all files - that's how I thought we could find the corrupt file (if any).
SMART seems OK, no obvious problems here.
Offline
its actually been running this entire time (~20 hours) because i remounted back to root.
it threw a bunch of messages like
dd: error reading '/sys/kernel/*': Function not implemented/Invalid argument/No such device
and also
INFO: task kworker/0:2:2440 blocked for more than 20 seconds.
Tainted: P 0 4.9.8-1-ARCH #1
Should i stop it now and paste the complete output?
Offline
Well, I though about running it from live USB on a filesystem mounted somewhere in /mnt. If you ran it on / and it goes through /sys /proc and everything then hell knows how long it will take and I bet it'll encounter many errors or just hang. Only errors on files from the suspicious XFS filesystem are relevant. And, in case this isn't obvious, it needs to run as root or there will be many "permission denied" errors.
Not sure what this blocked task, but check dmesg for noise from the XFS driver.
If you want to run it on / of currently running OS, use
find / -xdev -type f ... and so on
Last edited by mich41 (2017-02-21 18:43:02)
Offline
i just did it how you explained, took 2 minutes and found nothing.
Offline
And no XFS errors in dmesg? Then one more thing to try is downgrading xfsprogs to the oldest version you can find in /var/cache/pacman/pkg and also trying the latest version available in repo. If neither version of xfs_repair can find the corruption and all files are cleanly readable (as determined above) then I guess the corruption is somehow gone. Dunno what the problem was.
Offline
it still throws the messages from the OP every time i try to upgrade something with pacman for example.
It seems kinda strange that despite them popping up it's the login manager that fails to start supposedly after boot process is completed and yet i am able to log in another tty just fine. It also says that command startx can't be found. Perhaps some of the X related files were corrupted (and missed by 'find' for some reason)?
Offline
And xfs_repair still says everything is right? Now that looks like an XFS bug indeed. I guess all you can do is go upstream - see here and email the linux-xfs group.
It may be a bug in xfs_repair or in the kernel driver. I presume live USB uses some older kernel, what happens if you boot it, chroot to this system and start pacman inside chroot while running the live USB kernel?
Oh, and what kernel version do you have so I know to avoid it for now?
Last edited by mich41 (2017-02-21 20:04:31)
Offline
4.9.8-1
doing a full upgrade rn, will report back in a few minutes
xfsprogs wasnt updated though so theres little hope
Offline
OK it didn't work.
How do i report this as a bug? do i just message the guy and link this thread?
Also regarding restoring the system - will it work if i back the files up, remake the fs and load them back in?
Offline
OK it didn't work.
Full upgrade? So what exactly happened? I thought you said pacman doesn't work at all.
How do i report this as a bug? do i just message the guy and link this thread?
Not a guy but a mailing list. The email address is in my link and also an archive if you'd like to see how things are normally done in there. As for reporting bugs, nobody really enjoys reading long forum threads, so you would better just describe what the problem is - namely, your kernel reports corruption but xfs_repair fails to find it. Paste this log and make it clear that the problem is persistent because it happens every time you run package manager. Hopefully someone will figure out how to find what's wrong.
Also regarding restoring the system - will it work if i back the files up, remake the fs and load them back in?
In principle - yes, this should produce a clean filesystem. But if there are bugs, they may screw things up again.
Last edited by mich41 (2017-02-22 08:04:06)
Offline
Pacman works, it just throws messages similar to what's seen in boot log with no apparent consequences.
Offline
It's been solved upstream, here's the patch:
diff --git a/repair/dinode.c b/repair/dinode.c
index 8d01409..d664f87 100644
--- a/repair/dinode.c
+++ b/repair/dinode.c
@@ -1385,6 +1385,11 @@ process_symlink(
return(1);
}
+ if (be64_to_cpu(dino->di_size) == 0) {
+ do_warn(_("zero size symlink in inode %" PRIu64 "\n"), lino);
+ return 1;
+ }
+
/*
* have to check symlink component by component.
* get symlink contents into data area
it required me to use `--ignore-whitespaces` option with patch. After that i just had to run `xfs_repair -d` in accordance with wiki.
Likely the patch is gonna make it into the next xfsprogs.
Since the file in question for me was /usr/lib/libxcb-randr.so.0 , restoring functionalty was a matter of reinstalling libxcb.
Kudos to mich41 for huge help.
Offline
Pages: 1