Many dirty writes, high CPU usage and nearly no writeback

smis001 · 2024-02-27 15:14:55

Hi,

I hope this is the right subforum for my question here.

While there might be more than one way to trigger this behaviour, by far the easiest is to install a bigger amount of updates (or reinstall a reasonable amount of ~1GB of packages). When installing the packages everything appears to end up as a dirty write. After the dirty ration has been reached (about 1.8Gb in "cat /proc/meminfo | grep -e 'Dirty'") it comes to a screeching halt and takes ages from then on. The Dirty writes appear to reduce by about <500kB/s* and there is a whole CPU core busy with a kworker until the dirty writes are gone. I had this behaviour on my last SSD and also on the new NVME replacement.

My disk setup is a single Samsung NVME SSD -> LUKS -> LVM -> ext4

I have tried fumbling around with allow-discards, fstrim, block size, dirty ratio and tune2fs (fast commit). All to not much avail, but I wouldn't mind to systematically go through all that and some more again.

I should note that I haven't found another use case that triggers the problem yet. Pacman has no problems downloading the files and neither hast steam - both go a lot faster that 250kB/s and there are no dirty writes to speak of. Luks reencrypt goes at about 500MB/s,

I'm happy about any hints, help or whatever gets me closer to a solution.

* This is eyeballed by looking at /proc/meminfo for a few minutes.

Last edited by smis001 (2024-02-27 15:27:22)

xerxes_ · 2024-02-27 20:26:41

What if you disable LVM or/and LUKS ? Does it still happen ?

smis001 · 2024-02-28 14:41:10

Hmm, for me this would more or less mean a reinstallation of a production system. Setting up a second system that is identical enough might be hard (can't copy paste to a non encrypted system). But maybe it doesn't need to be identical – because if it doesn't have the problem, I can start looking for differences. Now that just gave me an idea: I have a working machine (with much more RAM tough) with a similar setup.

Ok, so I will try the following now:
1 Try to boot the drive from the "known working machine" in the machine that currently has the problem. Just to rule out hardware issues and because it's quick to do.
2 Set up a system with luks but no lvm on a different drive.
3 Set up a system with lvm but no luks on a different drive.

I will report back tomorrow.

smis001 · 2024-02-29 08:38:19

Ok, change in plans.

The drive of the "known good" machine works without the issue when I boot from it in the "bad machine". Now, if I install a fresh system on a different drive, I will likely not make the same "mistake" (or whatever it is) that causes my issue. So instead I will start diffing the hell out of these 2 machines. I'm still grateful for any hints though.

smis001 · 2024-02-29 10:45:49

Current status: The ext4 partition was not using 64bit mode like my reference machine. So I converted it to 64bit via resize2fs -b . There were also some slightly different mount options which I changed according to the working machine. Both to no avail. I also ran "e4defrag -c -v" out of curiosity with no measurable effect.

On the ext4 side the mount options are now identical and I cannot see any differences apart from runtime and size differences via tune2fs -l . Ok I did enable fast_commit as a desperate try to improve things, which didn't do anything.

On the LVM side I inspected /etc/lvm with no differences in the config files. I am unsure if there is anything in /etc/lvm/archive /etc/lvm/cache or /etc/lvm/backup that I should check though. For now I ignored those 3 directories.
I also inspected "pvdisplay -v" -m and "lvdisplay -v -m" and "vgdisplay -v". Now obviously there are size differences. Also on one machine I have a swap volume first and the the root volume on the bad machine. On the good machine it is the other way around.

I also checked "cryptsetup luksDump" -> no differences (no flags set). I took the liberty to enable --perf-no_read_workqueue --perf-no_write_workqueue - which may have had a miniscule impact. At least I can see tiny amounts of data (0-500kB polled every 2 seconds) in writeback now.

I also verified that my ssd (Samsung 980 PRO) has its latest firmware installed (which it had).

Arch Linux

#1 2024-02-27 15:14:55

Many dirty writes, high CPU usage and nearly no writeback

#2 2024-02-27 20:26:41

Re: Many dirty writes, high CPU usage and nearly no writeback

#3 2024-02-28 14:41:10

Re: Many dirty writes, high CPU usage and nearly no writeback

#4 2024-02-29 08:38:19

Re: Many dirty writes, high CPU usage and nearly no writeback

#5 2024-02-29 10:45:49

Re: Many dirty writes, high CPU usage and nearly no writeback

Board footer