RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

ezacaria · 2012-11-17 11:58:43

Hi,

I switched my installation to use systemd in accordance to the Arch Wiki entry. Booting is fine. I have systemd-sysvcompat and uninstalled the
initscripts package and erased the init=... options to the kernel in Grub.

I have two disks in RAID1 with two volumes. Both volumes mounted and worked fine before the switch (they were declared in fstab but not automounted).
I think that I was not using udisks before - or was not aware of it

After the switch, I can mount one of the volumes, and the other cannot be mounted.
Furthermore, the volume that can be mounted cannot be used normally.
Trying to copy a file will get stuck after a while or will not really complete: it seems that data from buffers cannot be flushed to the disk.

I get this type of error (here the volume is accessed as /dev/md125p2):

Nov 17 11:56:16 localhost kernel: [  600.393371] INFO: task flush-9:125:880 blocked for more than 120 seconds.
Nov 17 11:56:16 localhost kernel: [  600.393377] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 17 11:56:16 localhost kernel: [  600.393380] flush-9:125     D 0000000000000001     0   880      2 0x00000000
Nov 17 11:56:16 localhost kernel: [  600.393387]  ffff88040c67d680 0000000000000046 ffff88042824e0c0 ffff88040c67dfd8
Nov 17 11:56:16 localhost kernel: [  600.393393]  ffff88040c67dfd8 ffff88040c67dfd8 ffff88042949d8b0 ffff88042824e0c0
Nov 17 11:56:16 localhost kernel: [  600.393398]  ffff88042184fb98 ffff88042184fb80 0000000000000000 0000000000000003
Nov 17 11:56:16 localhost kernel: [  600.393403] Call Trace:
Nov 17 11:56:16 localhost kernel: [  600.393415]  [<ffffffff8108bbe2>] ? default_wake_function+0x12/0x20
Nov 17 11:56:16 localhost kernel: [  600.393422]  [<ffffffff8107a4e6>] ? autoremove_wake_function+0x16/0x40
Nov 17 11:56:16 localhost kernel: [  600.393427]  [<ffffffff81082995>] ? __wake_up_common+0x55/0x90
Nov 17 11:56:16 localhost kernel: [  600.393435]  [<ffffffff81491b79>] schedule+0x29/0x70
Nov 17 11:56:16 localhost kernel: [  600.393446]  [<ffffffffa0f13eb5>] md_write_start+0xb5/0x1a0 [md_mod]
Nov 17 11:56:16 localhost kernel: [  600.393452]  [<ffffffff8107a4d0>] ? abort_exclusive_wait+0xb0/0xb0
Nov 17 11:56:16 localhost kernel: [  600.393458]  [<ffffffffa0514131>] make_request+0x41/0xbf0 [raid1]
Nov 17 11:56:16 localhost kernel: [  600.393464]  [<ffffffff81180480>] ? get_max_files+0x20/0x20
Nov 17 11:56:16 localhost kernel: [  600.393470]  [<ffffffff8116a839>] ? kmem_cache_alloc_node+0x179/0x190
Nov 17 11:56:16 localhost kernel: [  600.393477]  [<ffffffff8123eb97>] ? create_task_io_context+0x27/0x110
Nov 17 11:56:16 localhost kernel: [  600.393484]  [<ffffffffa0f0ebcc>] md_make_request+0xfc/0x240 [md_mod]
Nov 17 11:56:16 localhost kernel: [  600.393492]  [<ffffffff8111f225>] ? mempool_alloc_slab+0x15/0x20
Nov 17 11:56:16 localhost kernel: [  600.393497]  [<ffffffff812399d2>] generic_make_request+0xc2/0x110
Nov 17 11:56:16 localhost kernel: [  600.393501]  [<ffffffff81239aa7>] submit_bio+0x87/0x110
Nov 17 11:56:16 localhost kernel: [  600.393507]  [<ffffffff811b585a>] ? bio_alloc_bioset+0x5a/0xf0
Nov 17 11:56:16 localhost kernel: [  600.393514]  [<ffffffff811afd94>] submit_bh+0xf4/0x130
Nov 17 11:56:16 localhost kernel: [  600.393519]  [<ffffffff811b3318>] __block_write_full_page+0x208/0x3a0
Nov 17 11:56:16 localhost kernel: [  600.393523]  [<ffffffff811b0af0>] ? end_buffer_async_read+0x200/0x200
Nov 17 11:56:16 localhost kernel: [  600.393528]  [<ffffffff811b79b0>] ? blkdev_get_blocks+0xd0/0xd0
Nov 17 11:56:16 localhost kernel: [  600.393533]  [<ffffffff811b79b0>] ? blkdev_get_blocks+0xd0/0xd0
Nov 17 11:56:16 localhost kernel: [  600.393536]  [<ffffffff811b0af0>] ? end_buffer_async_read+0x200/0x200
Nov 17 11:56:16 localhost kernel: [  600.393541]  [<ffffffff811b3596>] block_write_full_page_endio+0xe6/0x130
Nov 17 11:56:16 localhost kernel: [  600.393545]  [<ffffffff811b35f5>] block_write_full_page+0x15/0x20
Nov 17 11:56:16 localhost kernel: [  600.393550]  [<ffffffff811b6f08>] blkdev_writepage+0x18/0x20
Nov 17 11:56:16 localhost kernel: [  600.393555]  [<ffffffff811268ba>] __writepage+0x1a/0x50
Nov 17 11:56:16 localhost kernel: [  600.393560]  [<ffffffff81126d82>] write_cache_pages+0x1f2/0x4e0
Nov 17 11:56:16 localhost kernel: [  600.393564]  [<ffffffff811268a0>] ? global_dirtyable_memory+0x40/0x40
Nov 17 11:56:16 localhost kernel: [  600.393570]  [<ffffffff811270bd>] generic_writepages+0x4d/0x70
Nov 17 11:56:16 localhost kernel: [  600.393575]  [<ffffffff811288d1>] do_writepages+0x21/0x50
Nov 17 11:56:16 localhost kernel: [  600.393580]  [<ffffffff811a8ecb>] __writeback_single_inode.isra.31+0x3b/0x190
Nov 17 11:56:16 localhost kernel: [  600.393585]  [<ffffffff811a93ba>] writeback_sb_inodes+0x2ba/0x4d0
Nov 17 11:56:16 localhost kernel: [  600.393591]  [<ffffffff811a966f>] __writeback_inodes_wb+0x9f/0xd0
Nov 17 11:56:16 localhost kernel: [  600.393595]  [<ffffffff811a99b3>] wb_writeback+0x313/0x340
Nov 17 11:56:16 localhost kernel: [  600.393601]  [<ffffffff811aa746>] wb_do_writeback+0x1c6/0x270
Nov 17 11:56:16 localhost kernel: [  600.393606]  [<ffffffff811aa883>] bdi_writeback_thread+0x93/0x2d0
Nov 17 11:56:16 localhost kernel: [  600.393611]  [<ffffffff811aa7f0>] ? wb_do_writeback+0x270/0x270
Nov 17 11:56:16 localhost kernel: [  600.393616]  [<ffffffff81079a03>] kthread+0x93/0xa0
Nov 17 11:56:16 localhost kernel: [  600.393623]  [<ffffffff8149b144>] kernel_thread_helper+0x4/0x10
Nov 17 11:56:16 localhost kernel: [  600.393629]  [<ffffffff81079970>] ? kthread_freezable_should_stop+0x70/0x70
Nov 17 11:56:16 localhost kernel: [  600.393634]  [<ffffffff8149b140>] ? gs_change+0x13/0x13

A tried to do sync, but it freezes with similar traceback.

At this point, umount and lsof also freeze, and I cannot stop the array through "mdadm --stop /dev/md125"
If I boot into windows, both volumes are reported to be OK. Sometimes I need to do a hard shutdown in Linux. Reboot/shutdown works fine as long as I don't try to do anything with the RAID volumes.

I was blaming udisks, but it seems I was wrong. I uninstalled udisks and tried again. Here is the traceback for the hanging mount for the other volume:

Nov 17 13:52:02 localhost kernel: [  240.390076] mount           D ffff880426534800     0   683    675 0x00000004
Nov 17 13:52:02 localhost kernel: [  240.390083]  ffff880426e83868 0000000000000086 ffff8804145ce0c0 ffff880426e83fd8
Nov 17 13:52:02 localhost kernel: [  240.390089]  ffff880426e83fd8 ffff880426e83fd8 ffffffff81814420 ffff8804145ce0c0
Nov 17 13:52:02 localhost kernel: [  240.390094]  ffff880427494298 ffff880427494280 0000000000000000 0000000000000003
Nov 17 13:52:02 localhost kernel: [  240.390099] Call Trace:
Nov 17 13:52:02 localhost kernel: [  240.390111]  [<ffffffff8108bbe2>] ? default_wake_function+0x12/0x20
Nov 17 13:52:02 localhost kernel: [  240.390118]  [<ffffffff8107a4e6>] ? autoremove_wake_function+0x16/0x40
Nov 17 13:52:02 localhost kernel: [  240.390123]  [<ffffffff81082995>] ? __wake_up_common+0x55/0x90
Nov 17 13:52:02 localhost kernel: [  240.390131]  [<ffffffff81491b79>] schedule+0x29/0x70
Nov 17 13:52:02 localhost kernel: [  240.390146]  [<ffffffffa0fa2eb5>] md_write_start+0xb5/0x1a0 [md_mod]
Nov 17 13:52:02 localhost kernel: [  240.390151]  [<ffffffff8107a4d0>] ? abort_exclusive_wait+0xb0/0xb0
Nov 17 13:52:02 localhost kernel: [  240.390158]  [<ffffffffa00cd131>] make_request+0x41/0xbf0 [raid1]
Nov 17 13:52:02 localhost kernel: [  240.390163]  [<ffffffff81491b79>] ? schedule+0x29/0x70
Nov 17 13:52:02 localhost kernel: [  240.390169]  [<ffffffff810919c8>] ? enqueue_task_fair+0xa8/0xf0
Nov 17 13:52:02 localhost kernel: [  240.390177]  [<ffffffffa0f9dbcc>] md_make_request+0xfc/0x240 [md_mod]
Nov 17 13:52:02 localhost kernel: [  240.390185]  [<ffffffff8111f225>] ? mempool_alloc_slab+0x15/0x20
Nov 17 13:52:02 localhost kernel: [  240.390191]  [<ffffffff812399d2>] generic_make_request+0xc2/0x110
Nov 17 13:52:02 localhost kernel: [  240.390196]  [<ffffffff81239aa7>] submit_bio+0x87/0x110
Nov 17 13:52:02 localhost kernel: [  240.390201]  [<ffffffff811b585a>] ? bio_alloc_bioset+0x5a/0xf0
Nov 17 13:52:02 localhost kernel: [  240.390208]  [<ffffffff811afd94>] submit_bh+0xf4/0x130
Nov 17 13:52:02 localhost kernel: [  240.390212]  [<ffffffff811b1842>] __sync_dirty_buffer+0x52/0xd0
Nov 17 13:52:02 localhost kernel: [  240.390216]  [<ffffffff811b18d3>] sync_dirty_buffer+0x13/0x20
Nov 17 13:52:02 localhost kernel: [  240.390241]  [<ffffffffa0264550>] ext4_commit_super+0x1e0/0x250 [ext4]
Nov 17 13:52:02 localhost kernel: [  240.390255]  [<ffffffffa0266909>] ext4_setup_super+0x129/0x1a0 [ext4]
Nov 17 13:52:02 localhost kernel: [  240.390267]  [<ffffffffa0269a30>] ext4_fill_super+0x2560/0x2c30 [ext4]
Nov 17 13:52:02 localhost kernel: [  240.390275]  [<ffffffff811826c0>] mount_bdev+0x1d0/0x210
Nov 17 13:52:02 localhost kernel: [  240.390287]  [<ffffffffa02674d0>] ? ext4_calculate_overhead+0x430/0x430 [ext4]
Nov 17 13:52:02 localhost kernel: [  240.390302]  [<ffffffffa025d665>] ext4_mount+0x15/0x20 [ext4]
Nov 17 13:52:02 localhost kernel: [  240.390307]  [<ffffffff81183113>] mount_fs+0x43/0x1b0
Nov 17 13:52:02 localhost kernel: [  240.390314]  [<ffffffff8113ee30>] ? __alloc_percpu+0x10/0x20
Nov 17 13:52:02 localhost kernel: [  240.390322]  [<ffffffff8119da84>] vfs_kern_mount+0x74/0x110
Nov 17 13:52:02 localhost kernel: [  240.390329]  [<ffffffff8119dfe4>] do_kern_mount+0x54/0x110
Nov 17 13:52:02 localhost kernel: [  240.390334]  [<ffffffff8119fc75>] do_mount+0x315/0x8e0
Nov 17 13:52:02 localhost kernel: [  240.390339]  [<ffffffff811396b6>] ? memdup_user+0x46/0x80
Nov 17 13:52:02 localhost kernel: [  240.390344]  [<ffffffff8113974b>] ? strndup_user+0x5b/0x80
Nov 17 13:52:02 localhost kernel: [  240.390350]  [<ffffffff811a02cd>] sys_mount+0x8d/0xe0
Nov 17 13:52:02 localhost kernel: [  240.390356]  [<ffffffff81499f2d>] system_call_fastpath+0x1a/0x1f

I'm not sure what is wrong here. Any idea surely appreciated

Thanks!

Eduardo

Last edited by ezacaria (2013-10-29 20:18:35)

teekay · 2012-11-18 09:32:05

Please try to manually do the setup step by step, like
mdadm --examine --scan (see what arrays are detected)
mdadm --assemble --scan (see if detected arrays can be assembled properly)
cat /proc/mdstat (verify above)
fsck.ext4
mount

and between each step check journalctl for errors.

I gues TheRealWTF is "If I boot into windows, both volumes are reported to be OK". What does that mean exactly?

ezacaria · 2012-11-18 12:20:58

Hi teekay,

Thanks for the answer!

Just for the record, about the windows boot

"If I boot into windows, both volumes are reported to be OK"

I meant that the Intel utility handling the RAID in windows reports both volumes being fine, and the ntfs-formatted volume can be used normally.
I have been using the ntfs-formatted volume without any problems, accessing it from windows and Linux.

I was meanwhile doing a bit of troubleshooting.
I stopped the arrays and unloaded the md_mod. Then activated them via "dmraid -ay".
And everything seems to work fine. All partitions can be mounted and used.

I tried the manual assembly with and without ARRAY lines in /etc/mdadm.conf. Without the lines I get a bit farther but no luck:

mdadm --examine --scan
ARRAY metadata=imsm UUID=326555f9:6b642086:3aea8a4c:8a1408a4
ARRAY /dev/md/data container=326555f9:6b642086:3aea8a4c:8a1408a4 member=0 UUID=9078550e:639c4cb3:1d3f55e9:2055e548
ARRAY /dev/md/bulk container=326555f9:6b642086:3aea8a4c:8a1408a4 member=1 UUID=957cd0b7:ce6ffb7f:b845f634:0dafa562

mdadm -v --assemble --scan
mdadm: looking for devices for further assembly
mdadm: Cannot assemble mbr metadata on /dev/sdb2
mdadm: no recogniseable superblock on /dev/sdb1
mdadm: no RAID superblock on /dev/sda2
mdadm: no RAID superblock on /dev/sda1
mdadm: /dev/sdb is identified as a member of /dev/md/imsm0, slot -1.
mdadm: /dev/sda is identified as a member of /dev/md/imsm0, slot -1.
mdadm: added /dev/sda to /dev/md/imsm0 as -1
mdadm: added /dev/sdb to /dev/md/imsm0 as -1
mdadm: Container /dev/md/imsm0 has been assembled with 2 drives
mdadm: looking for devices for further assembly
mdadm: looking for devices for further assembly
mdadm: /dev/sdb is busy - skipping
mdadm: /dev/sda is busy - skipping

And the mdstat shows only the container:

cat /proc/mdstat
Personalities : 
md127 : inactive sdb[1](S) sda[0](S)
      6306 blocks super external:imsm
       
unused devices: <none>

The volumes md125 and md126 are not there:

mdadm --misc --detail /dev/md127 
/dev/md127:
        Version : imsm
     Raid Level : container
  Total Devices : 2

Working Devices : 2


           UUID : 326555f9:6b642086:3aea8a4c:8a1408a4
  Member Arrays :

    Number   Major   Minor   RaidDevice

       0       8        0        -        /dev/sda
       1       8       16        -        /dev/sdb

I did not see anything strange in journalctl.

Oddly enough, if I allow the md-raid.rules to start the arrays at startup, both md125, md126 show up in /proc/mdstat as clean and "active sync", but I cannot mount any filesystem (and mount/sync get stuck as described in the first post).

My current workaround: I moved the /lib/udev/rules.d/64-md-raid.rules file to prevent mdadm being called. Then I can use dmraid.
But this is just a workaround I don't know what changed in mdadm all of the sudden...

Thanks again!

Eduardo

teekay · 2012-11-19 12:25:56

It's sort of hard to understand which RAID implementation your're actually using. You're talking of dmraid and an Intel Windows app which suggests you're using onboard FakeRAID with dmraid, on the other hand you try to set it up as mdraid with mdadm, which is a purely linux kernel software based.

If mdraid is what you'd like to use, try adding "nodmraid" kernel boot parameter. If you want windows to be able to acess the raid too, then dmraid is the way to go. Currently you seem to have mdraid on top of dmraid - pretty weird.

"mdadm: /dev/sdb is busy - skipping" most likely means taht dmraid has control over it.

ezacaria · 2012-11-19 12:51:49

Hi teekay,

Ops, I did not mean to create more confusion, but thanks for the hints and good analysis.

- mdadm was working until couple of days ago. I was not using dmraid at all in Linux because it was not working with my motherboard.
On windows, I use Intel's application, I suppose some sort of windows fake raid.

There was always one thing at the time, or at least I thought it was so

Currently you seem to have mdraid on top of dmraid - pretty weird.
"mdadm: /dev/sdb is busy - skipping" most likely means taht dmraid has control over it.

Do you mean that "ARRAY /dev/md/..." suggests mdraid has already taken control of the disks?

If you want windows to be able to acess the raid too, then dmraid is the way to go

I was convinced that mdraid/mdadm and the windows implementation were coexisting peacefully.
But maybe this was only possible because I created the arrays in windows.

I can only think that earlier I have blacklisted dmraid somehow and that is why I could use mdadm normally. This could have been lost somewhere in the migration to systemd.

I'll give the nodmraid kernel parameter a try and post again.

Thanks!

Best regards,

Eduardo

georgem · 2012-11-20 13:32:01

I upgraded my system to systemd yesterday. I boot off an SSD and have /home on a RAID1. I had the same problems that you described. On every boot the raid would attempt to resync and any attempt to fsck, mount, etc would lock the process. Eventually I got fed up and just changed grub back to use /sbin/init instead of systemd. All the problems went away! systemd must be bringing up the raids differently or something, I'm sure we'll figure it out, but at least my system is back up.

ezacaria · 2012-11-20 16:23:20

A pity, really

I tried the nodmraid suggestion from teekay, but I could still not assemble the arrays with mdadm.

I think I will also take a practical approach and go with the workaround (systemd+dmraid) for the time being.
Meanwhile, if somebody is interested in doing further analysis, I can provide traces/logs on demand.

Thanks!

teekay · 2012-11-20 19:26:35

If you haven't done that already, make sure you have mdadm in the HOOKS array in /etc/mkinitcpio.conf

HOOKS="... mdadm"

In case you just added it, run

mkinitcpio -p linux

and reboot.

My arrays all come up fine, actually I even boot off md raid1.

PS: I didn't look close enough before. You're using Intel IMSM raid, that's why it works from windows, too. D'oh.
https://bbs.archlinux.org/viewtopic.php … 8#p1175458

Last edited by teekay (2012-11-20 19:51:57)

ezacaria · 2012-11-21 06:55:28

No luck, I'm afraid

Following the suggestion, I added mdraid to the hooks and ran mkinitcpio.
I uninstalled dmraid just to make sure there was no conflict.

If I comment the three lines in md-raid.rules as in the post cited by teekay, then I can run mdadm -vAs (same as --assemble --scan) and only the container is assembled. The drives are still busy (as before).

If I do not comment the lines in the md-raid.rules files, then /proc/mdstat shows md125 and md126, but any mount operation gets stuck, similar traceback as before.
It does not seem to block the shutdown/reboot, though.

Thanks!

Jasa · 2012-11-21 08:07:45

I had issues with onboard raid0 with grub2 for not able to access partitions even when installation worked, requires an additional "dmraid -ay" to make partitions visible inside raid area (GPT). Perhaps an alternative way of using grub2 could work there, instead of using that 2MB partition and trying other solutions available.

With that of mdadm thing does work as intented, or would appear so. Also there was an guide linked at installation section suggesting of using "mdadm_udev" hook in mkinitcpio configuration file.
Only slight oddiness i have noticed is how that of on during installation /dev/md0 does change into /dev/md127 later on when things have been installed, even when fstab or likewise would have an other device pointed, i'd suspect that of remounting of filesystem at boot what would cause an familiar issue and might require an fixing if some important package would be upgraded.

mash · 2012-12-08 02:32:21

I have a very similar issue.
Using an Intel Matrix RAID5 with mdadm and systemd, after mounting the NTFS formated device it gets first slower and slower - then it's totally stuck. Unmounting or remounting read only doesn't work either.

journalctl:

# mounting
Dec 08 02:23:35 s0nne ntfs-3g[1276]: Version 2012.1.15 external FUSE 29
Dec 08 02:23:35 s0nne ntfs-3g[1276]: Mounted /dev/md126p2 (Read-Write, label "sol1", NTFS 3.1)
Dec 08 02:23:35 s0nne ntfs-3g[1276]: Cmdline options: rw,nodev,nosuid,uid=1000,gid=100,dmask=0077,fmask=0177,uhelper=udisks2
Dec 08 02:23:35 s0nne ntfs-3g[1276]: Mount options: rw,nodev,nosuid,uhelper=udisks2,allow_other,nonempty,relatime,fsname=/dev/md126p2,blkdev,blksize=4096,default_permissions
Dec 08 02:23:35 s0nne ntfs-3g[1276]: Global ownership and permissions enforced, configuration type 7
Dec 08 02:23:35 s0nne udisksd[791]: Mounted /dev/md126p2 at /run/media/mash/sol1 on behalf of uid 1000
Dec 08 02:24:05 s0nne kernel: xhci_hcd 0000:00:14.0: WARN Event TRB for slot 1 ep 4 with no TDs queued? # Maybe this is relevant?
Dec 08 02:24:05 s0nne kernel: xhci_hcd 0000:00:14.0: WARN Event TRB for slot 1 ep 1 with no TDs queued?
Dec 08 02:24:05 s0nne kernel: xhci_hcd 0000:00:14.0: WARN Event TRB for slot 1 ep 0 with no TDs queued?

# the error message
Dec 08 02:26:55 s0nne kernel: INFO: task flush-9:126:1290 blocked for more than 120 seconds.
Dec 08 02:26:55 s0nne kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 08 02:26:55 s0nne kernel: flush-9:126     D ffff880810f15d80     0  1290      2 0x00000000
Dec 08 02:26:55 s0nne kernel:  ffff8807db40f6b0 0000000000000046 ffff88080fe90810 ffff8807db40ffd8
Dec 08 02:26:55 s0nne kernel:  ffff8807db40ffd8 ffff8807db40ffd8 ffff880812cbd8b0 ffff88080fe90810
Dec 08 02:26:55 s0nne kernel:  ffff8808094b8fd8 ffff8808094b8fc0 0000000000000000 0000000000000003
Dec 08 02:26:55 s0nne kernel: Call Trace:
Dec 08 02:26:55 s0nne kernel:  [<ffffffff8108bbe2>] ? default_wake_function+0x12/0x20
Dec 08 02:26:55 s0nne kernel:  [<ffffffff8107a4e6>] ? autoremove_wake_function+0x16/0x40
Dec 08 02:26:55 s0nne kernel:  [<ffffffff81082995>] ? __wake_up_common+0x55/0x90
Dec 08 02:26:55 s0nne kernel:  [<ffffffff8125a4c3>] ? cpumask_next_and+0x23/0x40
Dec 08 02:26:55 s0nne kernel:  [<ffffffff81491de9>] schedule+0x29/0x70
Dec 08 02:26:55 s0nne kernel:  [<ffffffffa05a8ed5>] md_write_start+0xb5/0x1a0 [md_mod]
Dec 08 02:26:55 s0nne kernel:  [<ffffffff8107a4d0>] ? abort_exclusive_wait+0xb0/0xb0
Dec 08 02:26:55 s0nne kernel:  [<ffffffffa03499ab>] make_request+0x3b/0x490 [raid456]
Dec 08 02:26:55 s0nne kernel:  [<ffffffff8123ed57>] ? create_task_io_context+0x27/0x110
Dec 08 02:26:55 s0nne kernel:  [<ffffffffa05a3bcc>] md_make_request+0xfc/0x240 [md_mod]
Dec 08 02:26:55 s0nne kernel:  [<ffffffff8111f2d5>] ? mempool_alloc_slab+0x15/0x20
Dec 08 02:26:55 s0nne kernel:  [<ffffffff81239b92>] generic_make_request+0xc2/0x110
Dec 08 02:26:55 s0nne kernel:  [<ffffffff81239c67>] submit_bio+0x87/0x110
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811b59ea>] ? bio_alloc_bioset+0x5a/0xf0
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811aff24>] submit_bh+0xf4/0x130
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811b34a8>] __block_write_full_page+0x208/0x3a0
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811b0c80>] ? end_buffer_async_read+0x200/0x200
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811b7b40>] ? blkdev_get_blocks+0xd0/0xd0
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811b7b40>] ? blkdev_get_blocks+0xd0/0xd0
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811b0c80>] ? end_buffer_async_read+0x200/0x200
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811b3726>] block_write_full_page_endio+0xe6/0x130
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811b3785>] block_write_full_page+0x15/0x20
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811b7098>] blkdev_writepage+0x18/0x20
Dec 08 02:26:55 s0nne kernel:  [<ffffffff8112696a>] __writepage+0x1a/0x50
Dec 08 02:26:55 s0nne kernel:  [<ffffffff81126e32>] write_cache_pages+0x1f2/0x4e0
Dec 08 02:26:55 s0nne kernel:  [<ffffffff81126950>] ? global_dirtyable_memory+0x40/0x40
Dec 08 02:26:55 s0nne kernel:  [<ffffffff8112716d>] generic_writepages+0x4d/0x70
Dec 08 02:26:55 s0nne kernel:  [<ffffffff81128981>] do_writepages+0x21/0x50
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811a90bb>] __writeback_single_inode.isra.31+0x3b/0x190
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811a989a>] writeback_sb_inodes+0x2ba/0x4a0
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811a9b1f>] __writeback_inodes_wb+0x9f/0xd0
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811a9e63>] wb_writeback+0x313/0x340
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811aa978>] wb_do_writeback+0x258/0x260
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811aaa13>] bdi_writeback_thread+0x93/0x2d0
Dec 08 02:26:55 s0nne kernel:  [<ffffffff811aa980>] ? wb_do_writeback+0x260/0x260
Dec 08 02:26:55 s0nne kernel:  [<ffffffff81079a03>] kthread+0x93/0xa0
Dec 08 02:26:55 s0nne kernel:  [<ffffffff8149b3c4>] kernel_thread_helper+0x4/0x10
Dec 08 02:26:55 s0nne kernel:  [<ffffffff81079970>] ? kthread_freezable_should_stop+0x70/0x70
Dec 08 02:26:55 s0nne kernel:  [<ffffffff8149b3c0>] ? gs_change+0x13/0x13

mash · 2012-12-09 22:14:13

Which configuration parameters should I check to be sure that every last bit of dmraid was removed..?

The only other solution i could think of is to install arch again from scratch...

mogwai · 2013-10-26 12:06:04

Hi,

It's about a year later, and I seem to be having exactly the same problem as the others. After a new install of Arch, I cannot mount the NTFS partition which is on an Intel Matrix RAID1 array (using 2 disks). Mounting through 'mount' or 'ntfs-3g' gives similar errors to what was already reported by others:

okt 21 18:56:27 ldmos kernel: INFO: task mount.ntfs-3g:4602 blocked for more than 120 seconds.
okt 21 18:56:27 ldmos kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
okt 21 18:56:27 ldmos kernel: mount.ntfs-3g   D ffff880128af5540     0  4602      1 0x00000004
okt 21 18:56:27 ldmos kernel:  ffff880110703cb8 0000000000000082 0000000000014500 ffff880110703fd8
okt 21 18:56:27 ldmos kernel:  ffff880110703fd8 0000000000014500 ffff8800ab4a4380 ffff8801274dc9d8
okt 21 18:56:27 ldmos kernel:  ffff8801274dc9c0 0000000000000000 0000000000000003 ffff880110703c18
okt 21 18:56:27 ldmos kernel: Call Trace:
okt 21 18:56:27 ldmos kernel:  [<ffffffff81093b52>] ? default_wake_function+0x12/0x20
okt 21 18:56:27 ldmos kernel:  [<ffffffff810847b2>] ? autoremove_wake_function+0x12/0x40
okt 21 18:56:27 ldmos kernel:  [<ffffffff8108c5e8>] ? __wake_up_common+0x58/0x90
okt 21 18:56:27 ldmos kernel:  [<ffffffff8108ee44>] ? __wake_up+0x44/0x50
okt 21 18:56:27 ldmos kernel:  [<ffffffff814e0f29>] schedule+0x29/0x70
okt 21 18:56:27 ldmos kernel:  [<ffffffffa07b9d75>] md_write_start+0xb5/0x1a0 [md_mod]
okt 21 18:56:27 ldmos kernel:  [<ffffffff810847a0>] ? wake_up_atomic_t+0x30/0x30
okt 21 18:56:27 ldmos kernel:  [<ffffffffa007ece6>] make_request+0x46/0xbf0 [raid1]
okt 21 18:56:27 ldmos kernel:  [<ffffffff8113cf1a>] ? write_cache_pages+0x16a/0x510
okt 21 18:56:27 ldmos kernel:  [<ffffffff81132c0a>] ? find_get_pages_tag+0xea/0x180
okt 21 18:56:27 ldmos kernel:  [<ffffffffa07b599c>] md_make_request+0xec/0x290 [md_mod]
okt 21 18:56:27 ldmos kernel:  [<ffffffff811351b5>] ? mempool_alloc_slab+0x15/0x20
okt 21 18:56:27 ldmos kernel:  [<ffffffff81263a82>] generic_make_request+0xc2/0x110
okt 21 18:56:27 ldmos kernel:  [<ffffffff81263b43>] submit_bio+0x73/0x160
okt 21 18:56:27 ldmos kernel:  [<ffffffff811d5866>] ? bio_alloc_bioset+0x196/0x2a0
okt 21 18:56:27 ldmos kernel:  [<ffffffff81266c57>] blkdev_issue_flush+0x97/0xe0
okt 21 18:56:27 ldmos kernel:  [<ffffffff811d6e55>] blkdev_fsync+0x35/0x50
okt 21 18:56:27 ldmos kernel:  [<ffffffff811cde16>] do_fsync+0x56/0x80
okt 21 18:56:27 ldmos kernel:  [<ffffffff811a0519>] ? SyS_write+0x49/0xa0
okt 21 18:56:27 ldmos kernel:  [<ffffffff811ce0a0>] SyS_fsync+0x10/0x20
okt 21 18:56:27 ldmos kernel:  [<ffffffff814ea4dd>] system_call_fastpath+0x1a/0x1f

When I try to reboot or halt after attempting this mount, my system hangs and I have to do a hard reset.

ezacaria · 2013-10-28 16:15:12

Unfortunately, I cannot comment on this anymore, as I decided to disassemble the Raid 1 array.
For several months I was getting by with the old dmraid. One annoyance is that dmeventd started to block the shutdown at some point, for no apparent reason (even if everything was sync-ed and unmounted). However, killing it before sending the shutdown command seemed to do no harm and worked.

I finally decided that whatever benefit I get from the disk mirroring protection is not worth the hassle, among others because the Intel controller on my motherboard is not particularly high-performing. Now I am using the disks separately, and taking care of making backups to an external Raid 5 array. One day I will probably replace the disks with server-grade drives to decrease chances of drive failure. Or maybe I will buy a real Raid controller that has stable drivers on both sides. Anyways, that is goodbye to both mdadm and dmraid for me, at least for the time being.

Hopefully someone will have a solution for this, in case you really need to have windows compatibility in the array.

mogwai · 2013-10-29 09:56:52

After a lot of trying and searching I found the solution to this problem!

The background is this:
Mdadm (< version 3.3-1 upstream) forks a process called mdmon, which it uses to get the metadata information of the RAID array. The 'problem' is that systemd/udev consider this to be a 'rogue' process (i.e. it was not spawned by systemd itself). It therefore kills this process immediately. When subsequently trying to access the RAID array, the kernel driver will attempt to communicate with the mdmon process it spawned. Since systemd killed it off, the kernel driver will keep on waiting, causing the reported hangs.
See also these external links in which the problem is discussed:
https://bugzilla.redhat.com/show_bug.cgi?id=873576
http://www.spinics.net/lists/raid/msg42709.html

In the upstream version of mdadm 3.3-1 this problem is solved by using systemd to spawn the mdmod helper process (instead of forking). See git-commit: https://github.com/neilbrown/mdadm/comm … 96655450c5.
However, in order to use systemd spawning instead of forking, the package needs to be built with the extra make target 'install-systemd'. This is currently not included in the Arch package mdadm_3.3-1. I have filed a bug report to include this extra target: see https://bugs.archlinux.org/task/37537.

BTW, this extra make target only installs one file mdmon@.service into /lib/systemd/system. As a temporary workaround, you can manually copy this file from the mdadm source code into the directory mentioned. This should solve all the problems mentioned above. It did in my case!

Last edited by mogwai (2013-10-29 10:14:36)

graysky · 2013-10-29 10:20:33

You should open a ticket against the affected Arch packages so the Arch maintainers are aware of this and can fix it.

mogwai · 2013-10-29 11:41:39

As mentioned in my previous post, I have already filed a bug report: https://bugs.archlinux.org/task/37537

graysky · 2013-10-29 13:27:28

Sorry, reading this on a phone so I missed that. Thanks for giving back to community by getting this into the hands of those empowered to change it.

ezacaria · 2013-10-29 20:12:29

Excellent! Thank you very much for the solid work
I will mark the thread as solved.

Arch Linux

#1 2012-11-17 11:58:43

RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#2 2012-11-18 09:32:05

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#3 2012-11-18 12:20:58

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#4 2012-11-19 12:25:56

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#5 2012-11-19 12:51:49

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#6 2012-11-20 13:32:01

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#7 2012-11-20 16:23:20

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#8 2012-11-20 19:26:35

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#9 2012-11-21 06:55:28

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#10 2012-11-21 08:07:45

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#11 2012-12-08 02:32:21

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#12 2012-12-09 22:14:13

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#13 2013-10-26 12:06:04

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#14 2013-10-28 16:15:12

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#15 2013-10-29 09:56:52

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#16 2013-10-29 10:20:33

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#17 2013-10-29 11:41:39

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#18 2013-10-29 13:27:28

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

#19 2013-10-29 20:12:29

Re: RAID 1 setup unusable after switch to systemd: sync/copy hangs[SOLVED]

Board footer