You are not logged in.
Hello archlinux community!
I run arch on my laptop for quite some time but I'm still improving the configuration here and there. The root and home partitions are on the primary disk (SSD). I have another backup partition on the secondary disk in the hdd caddy that I don't use very often. I want to let the secondary hdd spin down when I don't use it so I read about the APM levels and hdparm here. The issue I have now is that no matter what I try, some processes are stuck waiting on IO after wake up from suspend. Of course, I tried to find solutions to this problem but I couldn't find anything IO specific.
Now, several things happened. At first, I noticed that the APM level seems to reset after wake up from suspend. On the hdparm archwiki page there is a suggestion to create a systemd unit to run hdparm on wakeup but that didn't work very well. At this point I noticed huge "load average" (my cpu runs 4 cores, load average is currently 8) even when CPU is idle, memory usage is low and the system is not using swap. The reason for such a high "load average" value seems to be the "zombie" processes.
The first zombie that I found was the laptop-mode. I tried to disable my APM systemd unit, and let only laptop-mode manage the APM level but that didn't help. Currently both my systemd unit and laptop-mode tools hdd power managament are disabled, but I still get the zombies on wakeup. Now it's the udisks2 daemon. I hear that the secondary hdd is spinning but when I try to mount any of its partitions, try dd, or issue another hdparm command manually, all these processes are again stuck waiting on IO (the D process state in htop) and load average is rising with each of them.
This issue is with me for about two weeks. I can't even go back to previous state where hdd is not spinning down and udisks doesn't get stuck. So, few things changed since the problem appeared. I installed hdparm so laptop-mode possibly started doing hdd power managament only after that. There were few (at least 2 I think) kerenel updates. I am running linux-lts. I am willing to spend some time and figure this out, but I am running out of ideas. I will also leave here warning from dmesg, but it just confirms that the issue is IO related. Maybe you see something interesting in it. Anyway, I hope we can figure out together how to fix this. Thanks!
[47782.876882] INFO: task pool:17503 blocked for more than 120 seconds.
[47782.876890] Tainted: G O 4.14.22-1-lts #1
[47782.876892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[47782.876895] pool D 0 17503 1 0x00000000
[47782.876900] Call Trace:
[47782.876913] ? __schedule+0x294/0x8a0
[47782.876920] ? work_busy+0xa0/0xa0
[47782.876923] schedule+0x28/0x80
[47782.876927] schedule_timeout+0x1f6/0x360
[47782.876932] ? check_preempt_wakeup+0x102/0x230
[47782.876936] ? work_busy+0xa0/0xa0
[47782.876939] ? wait_for_completion+0xba/0x140
[47782.876943] wait_for_completion+0xba/0x140
[47782.876947] ? wake_up_q+0x70/0x70
[47782.876950] flush_work+0x13a/0x1c0
[47782.876954] ? worker_detach_from_pool+0xa0/0xa0
[47782.876959] __cancel_work_timer+0x123/0x1b0
[47782.876963] ? disk_map_sector_rcu+0x70/0x70
[47782.876969] ? kobj_lookup+0x113/0x160
[47782.876972] disk_block_events+0x78/0x90
[47782.876979] __blkdev_get+0x63/0x440
[47782.876982] blkdev_get+0x11d/0x300
[47782.876986] ? bd_acquire+0xd0/0xd0
[47782.876991] do_dentry_open+0x1b0/0x2d0
[47782.876995] ? __inode_permission+0x85/0xc0
[47782.876998] path_openat+0x4f9/0x12f0
[47782.877001] ? enqueue_entity+0x740/0x780
[47782.877004] ? select_idle_sibling+0x26/0x410
[47782.877007] ? check_preempt_curr+0x7e/0x90
[47782.877010] ? try_to_wake_up+0x54/0x480
[47782.877013] do_filp_open+0x9b/0x110
[47782.877017] ? __blkdev_put+0x1c0/0x1f0
[47782.877021] ? __check_object_size+0xaf/0x1b0
[47782.877025] ? do_sys_open+0x1bd/0x250
[47782.877028] do_sys_open+0x1bd/0x250
[47782.877033] do_syscall_64+0x67/0x120
[47782.877037] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[47782.877040] RIP: 0033:0x7ff16ce43390
[47782.877042] RSP: 002b:00007ff169e7aac0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
[47782.877045] RAX: ffffffffffffffda RBX: 0000557f0dd32120 RCX: 00007ff16ce43390
[47782.877046] RDX: 0000000000000800 RSI: 0000557f0dd5d040 RDI: ffffffffffffff9c
[47782.877048] RBP: 00007ff160007730 R08: 0000000000000000 R09: 00646975752f6d64
[47782.877049] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ff169e7acc0
[47782.877051] R13: 00007ff169e7abe8 R14: 0000000000000000 R15: 00007ff160007730
[51346.632475] pci_bus 0000:01: Allocating resources
[51346.632518] pci_bus 0000:02: Allocating resources
[51346.632601] pci_bus 0000:03: Allocating resources
[51346.632734] pci_bus 0000:08: Allocating resources
[51346.649596] PM: suspend entry (deep)
[51346.649598] PM: Syncing filesystems ...
[55606.089030] pci_bus 0000:01: Allocating resources
Offline
Removing aur/laptop-mode-tools definitely helped. I still have extra/udisks2 because it's a dependency of some other packages I need. I will remove the udisks and install just laptop-mode-tools to check if the problem is just with the laptop-mode-tools or it's really the combination of the two that makes troubles. Any other suggestions how to debug this?
Offline