[SOLVED] MD raid5: kernel BUG at drivers/md/raid5.c:5426!

Rohrschacht · 2021-01-19 16:53:16

I am using arch on some old hardware of mine as a NAS. I have been running btrfs raid5 with 4 disks for about a year. Recently, I wanted to change my setup to MD raid5 with lvm and btrfs on top, in order to transition away from btrfs raid5. I set up a degraded raid5 with two disks and rsync'd my datasets over to them, when rsync suddenly stopped in the middle of a file. Every action regarding the MD raid5 array just freezed from that point on. The rest of the system seems to work normally. I have been experiencing this using kernel 5.9.14, but have since updated to 5.10.7, where I still seem to have the same problem.

I will firstly show the data from when I first experienced this, running the kernel version 5.9.14:

dmesg shows this output:

[31774.969377] kernel BUG at drivers/md/raid5.c:5426!
[31774.969387] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[31774.969506] CPU: 1 PID: 405 Comm: md127_raid5 Tainted: P           OE     5.9.14-arch1-1 #1
[31774.969579] Hardware name: MICRO-STAR INTERNATIONAL CO.,LTD MS-7501/K9A2VM (MS-7501), BIOS V1.11B1 10/08/2009
[31774.969579] RIP: 0010:handle_active_stripes.constprop.0+0x52b/0x560 [raid456]
[31774.969579] Code: 3c 00 00 48 8d 5c 24 28 e9 44 fe ff ff 48 8b 45 28 48 8b 40 28 a8 40 0f 84 69 fc ff ff 48 89 ef e8 8a 76 00 00 e9 5c fc ff ff <0f> 0b e8 2e 32 86 ee 4c 8b 24 24 4d 89 fd 4d 85 f6 0f 84 99 fd ff
[31774.969579] RSP: 0018:ffffa899c0bafd58 EFLAGS: 00010086
[31774.969579] RAX: 00000000ffffffff RBX: ffff96db0d342088 RCX: ffff96db0d342088
[31774.969579] RDX: ffff96da022d4f38 RSI: ffff96da0287cf48 RDI: ffff96db08d6ba80
[31774.970604] RBP: ffff96db0d342000 R08: 0000000000000000 R09: ffffa899c0bafcf0
[31774.970604] R10: 0000000000000000 R11: 0000000000000005 R12: 0000000000000005
[31774.970604] R13: ffff96db0d3420a8 R14: 0000000000000000 R15: ffff96db0d3420a8
[31774.970604] FS:  0000000000000000(0000) GS:ffff96db1c080000(0000) knlGS:0000000000000000
[31774.970604] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[31774.970604] CR2: 00007f091bf96020 CR3: 00000000a877c000 CR4: 00000000000006e0
[31774.970604] Call Trace:
[31774.970604]  ? __wake_up_common_lock+0x8b/0xc0
[31774.970604]  raid5d+0x364/0x550 [raid456]
[31774.970604]  md_thread+0xb7/0x180 [md_mod]
[31774.970604]  ? wait_woken+0x80/0x80
[31774.970604]  ? md_super_write.part.0+0x160/0x160 [md_mod]
[31774.970604]  kthread+0x142/0x160
[31774.970604]  ? __kthread_bind_mask+0x60/0x60
[31774.970604]  ret_from_fork+0x22/0x30
[31774.970604] Modules linked in: raid456 async_raid6_recov async_memcpy async_pq md_mod xxhash_generic radeon dm_integrity async_xor async_tx dm_bufio dm_mod edac_mce_amd f71882fg input_leds snd_hda_codec_realtek snd_hda_codec_generic ccp ledtrig_audio rng_core snd_hda_intel snd_in
tel_dspcfg hid_generic snd_hda_codec kvm i2c_algo_bit snd_hda_core ttm btrfs snd_hwdep snd_pcm drm_kms_helper blake2b_generic xor snd_timer usbhid snd irqbypass cec hid pcspkr soundcore r8169 k8temp rc_core raid6_pq realtek libcrc32c mdio_devres of_mdio fixed_phy syscopyarea sysfill
rect libphy sp5100_tco sysimgblt fb_sys_fops evdev i2c_piix4 zfs(POE) mac_hid zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) drm zcommon(POE) znvpair(POE) fuse spl(OE) agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 ohci_pci ata_generic pata_acpi pata_atiixp e
hci_pci ehci_hcd ohci_hcd xhci_pci xhci_pci_renesas xhci_hcd
[31774.972947] ---[ end trace 548670d6bbe90a3a ]---
[31774.972947] RIP: 0010:handle_active_stripes.constprop.0+0x52b/0x560 [raid456]
[31774.972947] Code: 3c 00 00 48 8d 5c 24 28 e9 44 fe ff ff 48 8b 45 28 48 8b 40 28 a8 40 0f 84 69 fc ff ff 48 89 ef e8 8a 76 00 00 e9 5c fc ff ff <0f> 0b e8 2e 32 86 ee 4c 8b 24 24 4d 89 fd 4d 85 f6 0f 84 99 fd ff
[31774.972947] RSP: 0018:ffffa899c0bafd58 EFLAGS: 00010086
[31774.972947] RAX: 00000000ffffffff RBX: ffff96db0d342088 RCX: ffff96db0d342088
[31774.972947] RDX: ffff96da022d4f38 RSI: ffff96da0287cf48 RDI: ffff96db08d6ba80
[31774.972947] RBP: ffff96db0d342000 R08: 0000000000000000 R09: ffffa899c0bafcf0
[31774.972947] R10: 0000000000000000 R11: 0000000000000005 R12: 0000000000000005
[31774.972947] R13: ffff96db0d3420a8 R14: 0000000000000000 R15: ffff96db0d3420a8
[31774.972947] FS:  0000000000000000(0000) GS:ffff96db1c080000(0000) knlGS:0000000000000000
[31774.972947] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[31774.972947] CR2: 00007f091bf96020 CR3: 00000000a877c000 CR4: 00000000000006e0
[31774.972947] note: md127_raid5[405] exited with preempt_count 1

Output from uname:

Linux 5.9.14-arch1-1 #1 SMP PREEMPT Sat, 12 Dec 2020 14:37:12 +0000 x86_64 GNU/Linux

Output from mdadm -D /dev/md/mainpool:

/dev/md/mainpool:
           Version : 1.2
     Creation Time : Sat Jan 16 14:21:04 2021
        Raid Level : raid5
        Array Size : 5814474752 (5545.12 GiB 5954.02 GB)
     Used Dev Size : 2907237376 (2772.56 GiB 2977.01 GB)
      Raid Devices : 3
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sat Jan 16 19:25:20 2021
             State : active, degraded
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : Horizon:mainpool  (local to host Horizon)
              UUID : 46301b20:72652d31:af46c6bd:10c647f9
            Events : 171

    Number   Major   Minor   RaidDevice State
       0     254        0        0      active sync   /dev/dm-0
       1     254        1        1      active sync   /dev/dm-1
       -       0        0        2      removed

After this first encounter, I rebooted my system, updated everything using pacman -Syu, and re-created the degraded MD raid5 array using the two disks. LVM promptly found its physical volume again and all my data was still there. I began the data transfer from where it had left of, but the same problem appeared again, after some hours. Exactly the same behaviour as the first time.
This time, I uploaded the entire dmesg output to pasebin: https://pastebin.com/D1U71ks1

I already tried to gauge if someone had run into the same problem over on reddit. Someone suggested to me to run a memcheck to see if my hardware is okay. I have been running a memcheck for over 18 hours now without any errors, 11 passes. I will leave it running to be even more sure.
Here is my original post on reddit: https://www.reddit.com/r/archlinux/comm … aid5c5426/

Has anyone encountered something like this recently? What should I do?

Last edited by Rohrschacht (2021-03-21 00:11:34)

loqs · 2021-01-19 22:50:30

https://git.kernel.org/pub/scm/linux/ke … 9.14#n5426

	BUG_ON(atomic_inc_return(&sh->count) != 1);

Perhaps you may have to contact upstream to see if the BUG is reached because of an issue in the code or in the RAID array.

Rohrschacht · 2021-03-21 00:11:15

I did not follow through with reporting this bug elsewhere. However, quite shortly after opening this thread, the problem disappeared after a kernel update. I think it must have been around kernel version 5.10.10. I have been using the linux-lts kernel since it was updated to 5.10 (currently at 5.10.23-lts) and my NAS is working reliably since.

Arch Linux

#1 2021-01-19 16:53:16

[SOLVED] MD raid5: kernel BUG at drivers/md/raid5.c:5426!

#2 2021-01-19 22:50:30

Re: [SOLVED] MD raid5: kernel BUG at drivers/md/raid5.c:5426!

#3 2021-03-21 00:11:15

Re: [SOLVED] MD raid5: kernel BUG at drivers/md/raid5.c:5426!

Board footer