You are not logged in.
Hi,
I hope this is the right subforum for my question here.
While there might be more than one way to trigger this behaviour, by far the easiest is to install a bigger amount of updates (or reinstall a reasonable amount of ~1GB of packages). When installing the packages everything appears to end up as a dirty write. After the dirty ration has been reached (about 1.8Gb in "cat /proc/meminfo | grep -e 'Dirty'") it comes to a screeching halt and takes ages from then on. The Dirty writes appear to reduce by about <500kB/s* and there is a whole CPU core busy with a kworker until the dirty writes are gone. I had this behaviour on my last SSD and also on the new NVME replacement.
My disk setup is a single Samsung NVME SSD -> LUKS -> LVM -> ext4
I have tried fumbling around with allow-discards, fstrim, block size, dirty ratio and tune2fs (fast commit). All to not much avail, but I wouldn't mind to systematically go through all that and some more again.
I should note that I haven't found another use case that triggers the problem yet. Pacman has no problems downloading the files and neither hast steam - both go a lot faster that 250kB/s and there are no dirty writes to speak of. Luks reencrypt goes at about 500MB/s,
I'm happy about any hints, help or whatever gets me closer to a solution.
* This is eyeballed by looking at /proc/meminfo for a few minutes.
Last edited by smis001 (2024-02-27 15:27:22)
Offline
What if you disable LVM or/and LUKS ? Does it still happen ?
Offline
Hmm, for me this would more or less mean a reinstallation of a production system. Setting up a second system that is identical enough might be hard (can't copy paste to a non encrypted system). But maybe it doesn't need to be identical – because if it doesn't have the problem, I can start looking for differences. Now that just gave me an idea: I have a working machine (with much more RAM tough) with a similar setup.
Ok, so I will try the following now:
1 Try to boot the drive from the "known working machine" in the machine that currently has the problem. Just to rule out hardware issues and because it's quick to do.
2 Set up a system with luks but no lvm on a different drive.
3 Set up a system with lvm but no luks on a different drive.
I will report back tomorrow.
Offline
Ok, change in plans.
The drive of the "known good" machine works without the issue when I boot from it in the "bad machine". Now, if I install a fresh system on a different drive, I will likely not make the same "mistake" (or whatever it is) that causes my issue. So instead I will start diffing the hell out of these 2 machines. I'm still grateful for any hints though.
Offline
Current status: The ext4 partition was not using 64bit mode like my reference machine. So I converted it to 64bit via resize2fs -b . There were also some slightly different mount options which I changed according to the working machine. Both to no avail. I also ran "e4defrag -c -v" out of curiosity with no measurable effect.
On the ext4 side the mount options are now identical and I cannot see any differences apart from runtime and size differences via tune2fs -l . Ok I did enable fast_commit as a desperate try to improve things, which didn't do anything.
On the LVM side I inspected /etc/lvm with no differences in the config files. I am unsure if there is anything in /etc/lvm/archive /etc/lvm/cache or /etc/lvm/backup that I should check though. For now I ignored those 3 directories.
I also inspected "pvdisplay -v" -m and "lvdisplay -v -m" and "vgdisplay -v". Now obviously there are size differences. Also on one machine I have a swap volume first and the the root volume on the bad machine. On the good machine it is the other way around.
I also checked "cryptsetup luksDump" -> no differences (no flags set). I took the liberty to enable --perf-no_read_workqueue --perf-no_write_workqueue - which may have had a miniscule impact. At least I can see tiny amounts of data (0-500kB polled every 2 seconds) in writeback now.
I also verified that my ssd (Samsung 980 PRO) has its latest firmware installed (which it had).
Offline