You are not logged in.
Last night, I updated Linux to 4.0.2. On the same night, weird things started happening, in the end, my system would not even boot properly (systemd stuck at starting the login service, other virtual terminals did not work either). I chrooted to my / using a live USB stick to get the logs. According to them, several shared libraries were missing or corrupted.
I fscked my ext4 partitions, there were some corrupted files and directories on my root. badblocks did not find anything.
I reinstalled the affected packages from the chroot and rebooted. After using my machine for some time, I noticed data loss on my /home. I went back to the live USB, and now fsck found corrupted directories on my /home partition. From the content I was able to salvage from them, all of them seemed "recent" (browser cache entries, nautilus thumbnails and a text file which I created during the last session).
After another reboot, there were some new problems, related to corrupted shared libraries again. I decided to get back to the live USB environment and reinstall all my packages just to make sure everything is ok. After this, I was able to use my machine without problems for several hours. Then I noticed a corrupted image on /home which I created during this session.
I don't think it's a hardware problem. I have two SSDs in RAID0 (mdadm). There is a Windows set up in dual-boot on the same drives, and there aren't any problems on the NTFS partitions. Also, the drives are only one and a half year old, with slightly above average usage, but far from anything extreme (just a normal development machine), so they should be nowhere near the end of their lifetimes.
I downgraded Linux to 4.0.1 to see if 4.0.2 caused the problems. As Linux 4.0.3 is already released, I checked the changelog for relevant entries. This one seems interesting.
My questions:
- Is anybody else affected by similar problems after updating to linux-4.0.2?
- Does the fix linked above in Linux 4.0.3 seem related to my problem?
- If it were a hardware problem, badblocks should have detected bad sectors, right?
EDIT:
This was the most likely reason: https://bugzilla.kernel.org/show_bug.cgi?id=98501
Last edited by zozi56 (2015-05-21 08:00:56)
Offline
I have same problem.
Asus u500v notebook with 2xADATA XM11 256GB-V2
Gentoo, kernel 4.0.2, fakeraid RAID0 with mdadm.
Offline
Gentoo, kernel 4.0.2, fakeraid RAID0 with mdadm.
Please ask for help on the Gentoo boards, we only support Arch:
https://wiki.archlinux.org/index.php/Fo … pport_ONLY
Offline
After spending about 20 hours with 4.0.1 again, I did not encounter any new data corruption. Now I strongly suspect that 4.0.2 caused them, specifically the problem described in the commit message to which I linked before.
Last edited by zozi56 (2015-05-16 08:39:39)
Offline
I'm having what seems like the same issue. On May 12th, I decided to try linux-ck-bulldozer and installed v4.0.2 (coming from linux-ck 3.19.6). I restarted and the struggle ensued. I could write chapters about what I've done since then. I finally got back to a semi-normal state on May 15th. I've experienced a lot of data corruption but I haven't been able to deduce the source. I have ext4 LVM logical volumes for root, /var, and /home on top of LUKS on a RAID-0 (mdadm) striped pair of Samsung 830 SSDs. I've had to fsck each of those filesystems more than once. I haven't run badblocks, but mdadm reports a healthy raid array and smartctl reports both drives are healthy.
My journal was destroyed and pacman.log has garbage littered throughout which makes it difficult to analyze the situation. At different times, I had to reinstall pacman from a live-environment and generally reinstall many other packages because of invalid ELF headers and wrong magic bytes. After lots of reading and trying different things, I was almost prepared to wipe everything and start over (although I definitely would have inquired here first). Before resorting to that, however, I figured I would try reinstalling all currently installed packages. I rsynced a backup of the entire system, purged the pacman cache, and did pacman -Qenq | pacman -S --force - pacman -Qnq | pacman -S --force -. I restarted and lo and behold, there was gdm and then gnome.
Since then, I've had some residual issues including more wrong magic bytes and /home mounting ro. When I fscked /home it came back clean, but then I used -f to force a check anyway and it found several errors. I also updated to linux-ck-bulldozer 4.0.3. I'm not sure where in this sequence I took that update, but I think I'm still having issues. I use yaourt at times, and I used it yesterday to update google-chrome-beta and then had several problems with Chrome. I then realized that I had reinstalled all packages except those from the AUR. I downloaded PKGBUILDs for yaourt and dependencies and manually built/installed them. Then I used yaourt to reinstall all AUR packages and they now seem to be working fine -- including Chrome.
I just upgraded to linux-ck-bulldozer 4.0.4, so I'll see how it goes. My concern now is that even if I get no more corruption, how will I be able to find and fix everything that is already trashed. I really prefer to not start over if I can avoid it. I've been running this particular installation since 2012.
Sorry for rambling on your thread. I sure appreciate your OP. I have never experienced data corruption resulting from a kernel update, so it wasn't even on my radar. Now you have me wondering.
Last edited by matthew02 (2015-05-19 04:18:04)
Offline
Good to hear I'm not alone in this.
I think using pacman -Qenq in this situation is not a good idea, because it lists only those packages which were installed explicitly, and does not include the implicit dependencies. I used pacman -Qnq, without the -e option.
Offline
You're right. Those are actually the options I used. Most of my problem files were from dependency packages, so the other way would not have done the trick.
Unfortunately, I had to go through the entire process twice more this morning. I had trouble with 4.0.2, 4.0.3, and 4.0.4. I never ran 4.0.1 because of some problem with the ck kernel. I'm back at 3.19.8 now and everything seems to be going smoothly.
Considering that the symptoms persisted with 4.03 and 4.04., it seems unlikely that my situation is related to the patch you referred to in the op. I just found this thread that could definitely explain my problems. I guess I'm going to disable NCQ and give it another go with all my fingers crossed. You don't happen to use encryption, do you?
Offline
No, I don't use encryption.
The thread you mention concerns kernels >= 4.0, however 4.0.1 is working perfectly for me. Maybe RAID has someting to do with it? All three of us use mdadm (including the second poster).
Offline
I have the exact same problem: kernel 4.0.2 (or greater) generates filesystem corruption while 4.0.1 works perfectly. I have a fake RAID0 managed by mdadm. I had to reinstall the system from scratch and now I simply avoid upgrading the kernel until the issue gets fixed.
Offline
Theodore Ts'o has answered a question on LKML about the patch mentioned in the original post. The patch is probably not relatated to the bug we encountered.
Offline
Are you using AHCI/NCQ? I plan to disable NCQ later today and give it a shot. I'll let you know how it goes.
Offline
After a quick look at the patchset between linux 4.0.1 and 4.0.2, I only found two patches that actually touch ext4 or raid:
http://git.kernel.org/cgit/linux/kernel … f60667fa21
http://git.kernel.org/cgit/linux/kernel … cafe9a464c
Don't know if they are relevant.
Offline
Are you using AHCI/NCQ? I plan to disable NCQ later today and give it a shot. I'll let you know how it goes.
Yes, actually I do use NCQ. I'm looking forward to your results.
Offline
Please correct me if I'm wrong but from what I've read it only seems to be affecting RAID0 setups. I've had no problems with my mdadm RAID1 ext4 setup, although I've downgraded the kernel now just to be safe.
Ryzen 9 5950X, X570S Aorus Pro AX, RX 6600, Arch x86_64
Offline
Offline
Please correct me if I'm wrong but from what I've read it only seems to be affecting RAID0 setups.
It's affecting almost all raids, some non-raids and across multiple distros (arch, fedora, debian). It appears to have been introduced with kernel 4.0.2, but I also noticed that a new mdadm was pushed to us this morning, so I'm curious to see if that may have fixed it. Upstream kernel appears to be aware of it and I'm going to assume that someone there is having an 'oh, shit' moment trying to figure out what they did as Linus points his baleful eye upon him/her.
Offline
fabertawe wrote:Please correct me if I'm wrong but from what I've read it only seems to be affecting RAID0 setups.
It's affecting almost all raids, some non-raids and across multiple distros (arch, fedora, debian). It appears to have been introduced with kernel 4.0.2, but I also noticed that a new mdadm was pushed to us this morning, so I'm curious to see if that may have fixed it. Upstream kernel appears to be aware of it and I'm going to assume that someone there is having an 'oh, shit' moment trying to figure out what they did as Linus points his baleful eye upon him/her.
Quite interesting, but the mdadm update is unlikely to fix the problem.
If only RAID0 is affected, maybe someone can revert this patch from linux-4.0.2 or newer to see if the problem still occurs on raid systems.
Offline
It sure sounds more like this [1] than something related to the ext4 corruption patch. Mind you that the commit I link to was added to 4.1-rc4 and it will probably be backported soon.
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
If only RAID0 is affected, maybe someone can revert this patch from linux-4.0.2 or newer to see if the problem still occurs on raid systems.
I agree, this seems like a good idea.
It sure sounds more like this [1] than something related to the ext4 corruption patch. Mind you that the commit I link to was added to 4.1-rc4 and it will probably be backported soon.
[1] https://github.com/torvalds/linux/commi … 8ae67acf81
Maybe you're right, however
- libata does not seem to have changed between 4.0.1 and 4.0.2, yet the bug was clearly introduced in 4.0.2
- I have this bug on SanDisk drives, not Samsung
Offline
Concerning... has anyone NOT using MDADM had this issue? I am on 4.0.3 (although did run 4.0.2 for a while) and have not noticed anything. Will be doing fsck later and a downgrade... (unless it is confirmed to affect RAID systems only).
Edit (since more information the better):
Now on Kernel 4.0.4 - Still no issues.
root is on an Intel SSD, discard mount option is enabled.
Additional storage is a 2TB hard disk, mount points are defined in FSTAB (by UUID).
Nothing fancy like LVM
Both SSD & HDD fsck'd from Live CD - both clean.
UEFI BIOS, Storage is in 'RAID' mode (Intel Matrix Storage Manager Version 12.9). Pure AHCI mode messes up my port numbers and they don't correspond to what is written on the board - thats only reason why RAID mode is selected. NO RAID VOLUMES DEFINED.
MDADM is installed & the mkinitcpio hook for MDADM is also enabled. Again this is not used for anything day-to-day I just like to have it available for if I need to manually hook up any of the disks out of my Synology NAS (ie. disaster recovery).
So in summary, my BIOS is in RAID mode, MDADM hook is enabled, but no RAID volumes are defined or in use. I have NO issues on an Intel SSD or a normal HDD. *touches wood!*
Hope this helps someone who is trying to figure out the cause
Last edited by rlees85 (2015-05-20 22:23:58)
Offline
Interestingly enough, I have a home ftp server using mdadm for a level 1 RAID array. All partitions are ext4. This system is using HDD's instead of SSD's. No encryption either. I was using Kernel version 4.0.2, and this morning updated to 4.0.4-1.
Link below shows Kernel updates for this system:
http://ix.io/iFL
Just as Fabertawe reported, I also have not observed any of the issues being reported (as of this post). Not enough data has been collected as of yet, but this doesn't appear to have affected everyone.
My condolences to the owners of all of the missing 1's & 0's.
Offline
R00KIE wrote:It sure sounds more like this [1] than something related to the ext4 corruption patch. Mind you that the commit I link to was added to 4.1-rc4 and it will probably be backported soon.
[1] https://github.com/torvalds/linux/commi … 8ae67acf81Maybe you're right, however
- libata does not seem to have changed between 4.0.1 and 4.0.2, yet the bug was clearly introduced in 4.0.2
- I have this bug on SanDisk drives, not Samsung
Oh but it's not only Samsung drives that have problems, depending on the firmware versions, some Crucial and Micron drives are also listed as having problems. I would say that more drives might be affected but no one noticed it yet or didn't complain yet.
On another note, this could be caused by a change at one of many levels, is the corruption easy to trigger? Is there any specific pattern of usage that leads to corruption? I guess this needs to be determined first so a bug can be filled on the kernel bugtracker and/or bisecting the kernel to look for the offending commit.
R00KIE
Tm90aGluZyB0byBzZWUgaGVyZSwgbW92ZSBhbG9uZy4K
Offline
Hi, I have just encountered this bug on my root partition. I am using Linux 4.0.3, single drive (no RAID). The drive is, however, a hybrid HDD with SSD cache. Could this be related?
Offline
Up until 20 minutes ago was running 4.0.4 with a Raid1 (HDD, 3TB). Thankfully no corruption on disk. downgraded to 4.0.1 to be safe. Looks like there maybe a combination of things that triggers this?
Offline
I believe I may have also encountered this issue, however, I do not use RAID. I use a single Samsung mSATA SSD that is formatted to the EXT4 file system. Ever since upgrading to linux-pf version 4.0.4-1 last Sunday, I have had issues booting into Arch, and even if I am able to boot, many programs run into errors or freeze the system entirely. Furthermore, I have also witnessed numerous kernel oopses and I also experienced a kernel panic. I have investigated by viewing journalctl logs, however, the reasons for the problems varied and sometimes there was no error to speak of. Some of the errors were due to the pamac service failing, or gnome crashing on start-up, and other times, there were no errors that were reported. Upon restart from the crashes, if I was able to boot, some programs that I had updated before were now reverted to their previous revisions as well. I have attempted to reinstall some of the programs that were listed in the logs to no avail. I could be wrong about this, but I wanted to bring it up after I saw a post about this issue on SoftPedia.
Offline