You are not logged in.
SOLVED: Upgrading to linux 6.4.8 fixed this issue for me. It is currently in core-testing, so should be in core soon.
I have Arch installed on a KVM VPS with 8 x86_64 vCPUs, 32GiB RAM, and 640GiB disk space (shameless plug: I only pay less than $8 USD per month, paid tri-annually and SSDNodes gives us both a credit for each referral I get, DM me if you're interested!). I run a few very low traffic websites on it in a Docker Compose stack (one container per service), including my Gitea system, three Ghost blogs, and supporting databases (MySQL/MariaDB) with the reverse proxy (nginx).
I use conky-cli to monitor this VPS over SSH, and rarely do I see CPU above 1-2 percent, RAM is at 5 percent, and swap is at 1-2 percent. So I don't feel it's under heavy load. I learned about PSI the other day, so I've been trying to learn more about it.
I set up a script that dumps the /proc/pressure/{cpu,io,memory} to a CSV log every 30 seconds, and set the timer to start five minutes after boot (OnBootSec=5m). I set this up and let it run for a couple of hours, then did a full Arch upgrade, including an upgrade of linux from 6.4.3 to 6.4.7 (it had been some time since my last upgrade). Here's the CSV, which I hope is easier to interpret in Gitea rather than pasting the contents in a code block. The I/O statistics are at the end/far right of this CSV, so you will need to scroll to the right to see them.
I rebooted just after midnight local time last night (July 31), and when the pressure-log started writing to the CSV again, it clearly shows io_some and io_full at just over 50 percent, and steadily grew to just under 95 percent less than an hour later. I've done some other digging, and I can't figure out why these percentages are so high.
The first thing I looked at was iotop -o, but it was mostly blank. Every few seconds a process or two would show up as it read or wrote to disk, but nothing seemed to have any persistence. I looked in atop, but that only showed what I saw in my CSV (iosome and iofull above 90 percent). Next, I saw a suggestion to use pidstat -dl. Here's the output I feel most relevant:
# pidstat -dl | tail -n +2 | awk '{print $8" "$9}' | column -t
iodelay Command
0 /sbin/init
0 khugepaged
0 jbd2/sda6-8
0 /usr/lib/systemd/systemd-journald
0 /usr/lib/systemd/systemd-udevd
0 btrfs-cleaner
0 btrfs-transaction
0 btrfs-cleaner
0 btrfs-transaction
0 /usr/bin/rngd
0 /usr/lib/systemd/systemd-resolved
0 /usr/lib/systemd/systemd-timesyncd
0 /usr/bin/dbus-daemon
0 /usr/lib/systemd/systemd-logind
0 /usr/bin/python
0 /usr/lib/systemd/systemd-networkd
0 sshd:
0 /usr/bin/containerd
0 /sbin/agetty
0 /usr/lib/polkit-1/polkitd
0 /usr/lib/postfix/bin/master
0 qmgr
0 /usr/bin/dockerd
0 /usr/bin/docker-proxy
0 /usr/bin/containerd-shim-runc-v2
0 /usr/bin/containerd-shim-runc-v2
0 mysqld
0 node
0 /usr/bin/containerd-shim-runc-v2
0 /usr/bin/containerd-shim-runc-v2
0 /bin/s6-svscan
0 mariadbd
0 /usr/bin/containerd-shim-runc-v2
0 /usr/bin/containerd-shim-runc-v2
0 /usr/bin/containerd-shim-runc-v2
0 nginx:
0 node
0 node
0 node
0 s6-supervise
0 s6-supervise
0 /usr/local/bin/gitea
0 sshd:
0 nginx:
0 nginx:
0 nginx:
0 nginx:
0 nginx:
0 nginx:
0 nginx:
0 nginx:
0 /usr/lib/systemd/systemd
0 sshd:
0 sshd:
0 /usr/lib/systemd/systemd
0 -zsh
0 /usr/bin/gpg-agent
0 ssh-agent
0 /usr/bin/tmux
0 -zsh
0 -zsh
0 -zsh
0 -zsh
0 -zsh
0 journalctl
0 weechat-curses
0 -zsh
0 sudo
0 -zsh
0 /usr/bin/gpg-agent
0 ssh-agent
0 tlsmgr
0 -zsh
0 -zsh
0 sshd:
0 sshd:
0 kworker/u16:2-events_power_efficient
0 pickup
0 kworker/u16:6-events_power_efficient
0 kworker/u16:1-events_unbound
0 kworker/u16:3-events_unbound
0 kworker/u16:0-btrfs-endio-write
0 kworker/u16:4-flush-btrfs-1This clearly shows no processes delayed because of I/O. Here's another full snapshot of pidstat -dl a few moments later.
Before the reboot, it looks like io_some and io_full were about what I'd expect, 1-2 percent (maybe up to 5 percent, but even that would be reasonable). Only after the reboot does this problem occur. I see this commit that went into 6.4.7, so I'm wondering if this had a material impact on this (which is why I'm seeing it after the reboot and not before).
I've been looking through this commit, and there's quite a bit in it. It wasn't until I was reading through it I thought to check dmesg, so I looked at jounralctl -k:
Aug 01 01:06:33 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 01:06:37 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 01:06:39 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 05:54:58 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 05:55:00 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 05:55:00 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 05:55:01 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 05:55:03 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 05:55:04 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:49 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:50 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:50 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:50 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:51 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:51 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:51 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:52 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 12:24:21 deltachunk kernel: Process accounting resumed
Aug 01 12:45:30 deltachunk kernel: Process accounting resumed
Aug 01 12:49:11 deltachunk kernel: Process accounting resumed
Aug 01 12:50:39 deltachunk kernel: Process accounting resumed
Aug 01 15:18:28 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 15:18:29 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 15:18:29 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 20:22:49 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 20:22:49 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 20:22:49 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 20:22:50 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 22:44:59 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pagesI think this is a clue, but this is not constant (and started almost an hour after my pressure.csv began logging again after the reboot, at 00:10).
I'm hoping I'm just misinterpreting what this means, and now the percentages for I/O are showing idleness, not stalls. But I haven't been able to find anything that explicitly states this. I even looked through all the commits, searching for psi, but I only found the one linked above that went into 6.4.7, nothing else PSI-related since 6.4.3.
If it's not a misinterpretation, have I uncovered a bug? I don't know enough about the kernel to know how to file such a bug report. My worst thought is I have been hacked, and the attacker has hidden their tracks very well. I do have quite a bit of packages from the AUR on this system, it's possible I installed a rogue package that way. But I figure I'd see more evidence of that, such as much higher CPU and RAM usage, even if top, htop, and atop, and ps-related tools don't show any processes consuming that many resources.
Last edited by ectospasm (2023-08-04 22:57:26)
Offline
Likely the same userspace reporting bug in relation to mariadb and a change in the 6.4.4 kernel: https://bbs.archlinux.org/viewtopic.php?id=287343
Generally harmless and a misdiagnostic, a kernel side mitigation for the reporting issue is queued for 6.4.8.
Last edited by V1del (2023-08-02 11:35:00)
Offline
Likely the same userspace reporting bug in relation to mariadb and a change in the 6.4.4 kernel: https://bbs.archlinux.org/viewtopic.php?id=287343
Generally harmless and a misdiagnostic, a kernel side mitigation for the reporting issue is queued for 6.4.8.
Thanks! Sure enough, the iowat was just about 12.5 percent, which is 1/8 of 100 percent (I have 8 vCPUs). Stopping my mariadb service and running iostat -x 5 for a few cycles showed the iowait drop near zero. I also ran watch -n1 "cat /proc/pressure/io" and watched the percentages drain off.
Any clues on how to find a process that's increasing the iowait? Had I been able to find it I may have been able to provide more information on it. Or at least started looking into MariaDB related bugs.
EDIT: I was already using iotop to look into this, and mariadb does not show up as a culprit. Had it been real I/O wait, i would have seen it. Looking forward to 6.4.8! Do you have a link to the kernel commit that fixes this? I'd like to look at it, always eager to learn more about the Linux kernel.
Last edited by ectospasm (2023-08-02 21:51:16)
Offline
Read the linked thread, but at the end there's: https://github.com/torvalds/linux/commi … daa4347813
Offline