You are not logged in.

#1 2023-08-02 03:53:09

ectospasm
Member
Registered: 2015-08-28
Posts: 303

[SOLVED] Pressure Stall Information (PSI) - ~90+% io (KVM VPS)

SOLVED:  Upgrading to linux 6.4.8 fixed this issue for me.  It is currently in core-testing, so should be in core soon.

I have Arch installed on a KVM VPS with 8 x86_64 vCPUs, 32GiB RAM, and 640GiB disk space (shameless plug:  I only pay less than $8 USD per month,  paid tri-annually and SSDNodes gives us both a credit for each referral I get, DM me if you're interested!).  I run a few very low traffic websites on it in a Docker Compose stack (one container per service), including my Gitea system, three Ghost blogs, and supporting databases (MySQL/MariaDB) with the reverse proxy (nginx). 

I use conky-cli to monitor this VPS over SSH, and rarely do I see CPU above 1-2 percent, RAM is at 5 percent, and swap is at 1-2 percent.  So I don't feel it's under heavy load.  I learned about PSI the other day, so I've been trying to learn more about it.

I set up a script that dumps the /proc/pressure/{cpu,io,memory} to a CSV log every 30 seconds, and set the timer to start five minutes after boot (OnBootSec=5m).  I set this up and let it run for a couple of hours, then did a full Arch upgrade, including an upgrade of linux from 6.4.3 to 6.4.7 (it had been some time since my last upgrade).  Here's the CSV, which I hope is easier to interpret in Gitea rather than pasting the contents in a code block.  The I/O statistics are at the end/far right of this CSV, so you will need to scroll to the right to see them.

I rebooted just after midnight local time last night (July 31), and when the pressure-log started writing to the CSV again, it clearly shows io_some and io_full at just over 50 percent, and steadily grew to just under 95 percent less than an hour later.  I've done some other digging, and I can't figure out why these percentages are so high.

The first thing I looked at was iotop -o, but it was mostly blank.  Every few seconds a process or two would show up as it read or wrote to disk, but nothing seemed to have any persistence.  I looked in atop, but that only showed what I saw in my CSV (iosome and iofull above 90 percent).  Next, I saw a suggestion to use pidstat -dl.  Here's the output I feel most relevant:

# pidstat -dl | tail -n +2 | awk '{print $8" "$9}' | column -t
iodelay  Command
0        /sbin/init
0        khugepaged
0        jbd2/sda6-8
0        /usr/lib/systemd/systemd-journald
0        /usr/lib/systemd/systemd-udevd
0        btrfs-cleaner
0        btrfs-transaction
0        btrfs-cleaner
0        btrfs-transaction
0        /usr/bin/rngd
0        /usr/lib/systemd/systemd-resolved
0        /usr/lib/systemd/systemd-timesyncd
0        /usr/bin/dbus-daemon
0        /usr/lib/systemd/systemd-logind
0        /usr/bin/python
0        /usr/lib/systemd/systemd-networkd
0        sshd:
0        /usr/bin/containerd
0        /sbin/agetty
0        /usr/lib/polkit-1/polkitd
0        /usr/lib/postfix/bin/master
0        qmgr
0        /usr/bin/dockerd
0        /usr/bin/docker-proxy
0        /usr/bin/containerd-shim-runc-v2
0        /usr/bin/containerd-shim-runc-v2
0        mysqld
0        node
0        /usr/bin/containerd-shim-runc-v2
0        /usr/bin/containerd-shim-runc-v2
0        /bin/s6-svscan
0        mariadbd
0        /usr/bin/containerd-shim-runc-v2
0        /usr/bin/containerd-shim-runc-v2
0        /usr/bin/containerd-shim-runc-v2
0        nginx:
0        node
0        node
0        node
0        s6-supervise
0        s6-supervise
0        /usr/local/bin/gitea
0        sshd:
0        nginx:
0        nginx:
0        nginx:
0        nginx:
0        nginx:
0        nginx:
0        nginx:
0        nginx:
0        /usr/lib/systemd/systemd
0        sshd:
0        sshd:
0        /usr/lib/systemd/systemd
0        -zsh
0        /usr/bin/gpg-agent
0        ssh-agent
0        /usr/bin/tmux
0        -zsh
0        -zsh
0        -zsh
0        -zsh
0        -zsh
0        journalctl
0        weechat-curses
0        -zsh
0        sudo
0        -zsh
0        /usr/bin/gpg-agent
0        ssh-agent
0        tlsmgr
0        -zsh
0        -zsh
0        sshd:
0        sshd:
0        kworker/u16:2-events_power_efficient
0        pickup
0        kworker/u16:6-events_power_efficient
0        kworker/u16:1-events_unbound
0        kworker/u16:3-events_unbound
0        kworker/u16:0-btrfs-endio-write
0        kworker/u16:4-flush-btrfs-1

This clearly shows no processes delayed because of I/O.  Here's another full snapshot of pidstat -dl a few moments later.

Before the reboot, it looks like io_some and io_full were about what I'd expect, 1-2 percent (maybe up to 5 percent, but even that would be reasonable).  Only after the reboot does this problem occur.  I see this commit that went into 6.4.7, so I'm wondering if this had a material impact on this (which is why I'm seeing it after the reboot and not before).

I've been looking through this commit, and there's quite a bit in it.  It wasn't until I was reading through it I thought to check dmesg, so I looked at jounralctl -k

Aug 01 01:06:33 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 01:06:37 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 01:06:39 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 05:54:58 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 05:55:00 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 05:55:00 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 05:55:01 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 05:55:03 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 05:55:04 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:49 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:50 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:50 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:50 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:51 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:51 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:51 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 08:51:52 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 12:24:21 deltachunk kernel: Process accounting resumed
Aug 01 12:45:30 deltachunk kernel: Process accounting resumed
Aug 01 12:49:11 deltachunk kernel: Process accounting resumed
Aug 01 12:50:39 deltachunk kernel: Process accounting resumed
Aug 01 15:18:28 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 15:18:29 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 15:18:29 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 20:22:49 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 20:22:49 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 20:22:49 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 20:22:50 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages
Aug 01 22:44:59 deltachunk kernel: virtio_balloon virtio3: Out of puff! Can't get 1 pages

I think this is a clue, but this is not constant (and started almost an hour after my pressure.csv began logging again after the reboot, at 00:10).

I'm hoping I'm just misinterpreting what this means, and now the percentages for I/O are showing idleness, not stalls.  But I haven't been able to find anything that explicitly states this.  I even looked through all the commits, searching for psi, but I only found the one linked above that went into 6.4.7, nothing else PSI-related since 6.4.3.

If it's not a misinterpretation, have I uncovered a bug?  I don't know enough about the kernel to know how to file such a bug report.  My worst thought is I have been hacked, and the attacker has hidden their tracks very well.  I do have quite a bit of packages from the AUR on this system, it's possible I installed a rogue package that way.  But I figure I'd see more evidence of that, such as much higher CPU and RAM usage, even if top, htop, and atop, and ps-related tools don't show any processes consuming that many resources.

Last edited by ectospasm (2023-08-04 22:57:26)

Offline

#2 2023-08-02 11:33:53

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 25,194

Re: [SOLVED] Pressure Stall Information (PSI) - ~90+% io (KVM VPS)

Likely the same userspace reporting bug in relation to mariadb and a change in the 6.4.4 kernel: https://bbs.archlinux.org/viewtopic.php?id=287343

Generally harmless and a misdiagnostic, a kernel side mitigation for the reporting issue is queued for 6.4.8.

Last edited by V1del (2023-08-02 11:35:00)

Offline

#3 2023-08-02 21:40:59

ectospasm
Member
Registered: 2015-08-28
Posts: 303

Re: [SOLVED] Pressure Stall Information (PSI) - ~90+% io (KVM VPS)

V1del wrote:

Likely the same userspace reporting bug in relation to mariadb and a change in the 6.4.4 kernel: https://bbs.archlinux.org/viewtopic.php?id=287343

Generally harmless and a misdiagnostic, a kernel side mitigation for the reporting issue is queued for 6.4.8.

Thanks!  Sure enough, the iowat was just about 12.5 percent, which is 1/8 of 100 percent (I have 8 vCPUs).  Stopping my mariadb service and running iostat -x 5 for a few cycles showed the iowait drop near zero.  I also ran watch -n1 "cat /proc/pressure/io" and watched the percentages drain off.

Any clues on how to find a process that's increasing the iowait?  Had I been able to find it I may have been able to provide more information on it.  Or at least started looking into MariaDB related bugs.

EDIT:  I was already using iotop to look into this, and mariadb does not show up as a culprit.  Had it been real I/O wait, i would have seen it.  Looking forward to 6.4.8!  Do you have a link to the kernel commit that fixes this?  I'd like to look at it, always eager to learn more about the Linux kernel.

Last edited by ectospasm (2023-08-02 21:51:16)

Offline

#4 2023-08-03 02:03:36

V1del
Forum Moderator
Registered: 2012-10-16
Posts: 25,194

Re: [SOLVED] Pressure Stall Information (PSI) - ~90+% io (KVM VPS)

Read the linked thread, but at the end there's: https://github.com/torvalds/linux/commi … daa4347813

Offline

Board footer

Powered by FluxBB