You are not logged in.
According to free, my "available" memory gradually disappears over the course of a few days, eventually resulting in OOM killer. There is very little memory "used". My understanding is that "buffers" and "cache" should be made available when necessary, but this doesn't seem to occur. Even after quitting all applications, the "available" memory is quite low.
$ free -wm
total used free shared buffers cache available
Mem: 31989 6148 1596 16693 1213 23031 8606
Swap: 0 0 0
Here, free + buffers + cache = 25840, much more than the available memory of 8606.
I attempted to fill the cache by writing/reading a large file, to see if somehow it was stuck in general.
$ free -wm
total used free shared buffers cache available
Mem: 31989 6346 204 16938 546 24891 8249
Swap: 0 0 0
"cache" increased by almost 2G, but there was not much difference in "available" (as expected).
I then attempted to free up the caches.
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ free -wm
total used free shared buffers cache available
Mem: 31989 6111 7853 16867 2 18021 8250
Swap: 0 0 0
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
$ free -wm
total used free shared buffers cache available
Mem: 31989 6134 7864 16946 20 17970 8157
Swap: 0 0 0
As expected, this also had no result on "available". I expected "cache" to be almost zero, but it was still ~18G. However, now "free" was pretty similar to "available", so I wonder if there is some large amount of memory in "cache" that is inaccessible, and therefore not contributing to "available", nor able to be dropped.
Last edited by Salkay (2021-08-03 22:48:58)
Offline
Maybe it's files inside a tmpfs filesystem? Check what's going on there with "df". This here filters out just the tmpfs entries:
df -h -t tmpfs
I just tried creating a file here in /tmp with "fallocate -l 10G testfile" and those 10G show up in the "cache" column in free's output.
Last edited by Ropid (2021-06-10 09:04:38)
Offline
Thanks @Ropid. Good idea. It looks like there is ~0.75G there, which goes some way to explaining the shortfall, but it looks like there's still another ~16.5 G missing.
$ df -h -t tmpfs
Filesystem Size Used Avail Use% Mounted on
run 16G 1.7M 16G 1% /run
tmpfs 16G 45M 16G 1% /dev/shm
tmpfs 16G 699M 15G 5% /tmp
tmpfs 3.2G 176K 3.2G 1% /run/user/1000
Offline
looks like there's still another ~16.5 G missing.
Looks about the same as "shared" memory:
$ free -wm
total used free shared buffers cache available
Mem: 31989 6148 1596 16693 1213 23031 8606
Swap: 0 0 0
What does
# cat /proc/meminfo
have to say about this?
--
saint_abroad
Offline
looks like there's still another ~16.5 G missing.
$ df -h -t tmpfs Filesystem Size Used Avail Use% Mounted on run 16G 1.7M 16G 1% /run tmpfs 16G 45M 16G 1% /dev/shm tmpfs 16G 699M 15G 5% /tmp tmpfs 3.2G 176K 3.2G 1% /run/user/1000
Also, there are other filesystems backed by tmpfs consuming shmem such as udev.
--
saint_abroad
Offline
shared Memory used (mostly) by tmpfs (Shmem in /proc/meminfo)
In addition, if there are files mapped by processes that have changed (or removed) on disk, they cannot be dropped.
I'm unsure that neither are file caches that are simply active.
Offline
Looks about the same as "shared" memory:
What does
# cat /proc/meminfo
have to say about this?
Also, there are other filesystems backed by tmpfs consuming shmem such as udev.
Thanks sabroad. The sizes have changed a bit overnight, so here are the new numbers. I'm not exactly sure what to look at here, but by my calculation, the shortfall is approximately free + buffers + cache - available, i.e. 440+3480+21457-6127 = 19250. Shmem indeed looks close at 18869 MB (18868708 kB). What does this mean, and how can I fix it?
$ free -wm
total used free shared buffers cache available
Mem: 31989 6610 440 18406 3480 21457 6127
Swap: 0 0 0
$ cat /proc/meminfo
MemTotal: 32757108 kB
MemFree: 402584 kB
MemAvailable: 6168884 kB
Buffers: 3589912 kB
Cached: 20130860 kB
SwapCached: 0 kB
Active: 4171764 kB
Inactive: 7950616 kB
Active(anon): 14196 kB
Inactive(anon): 7257676 kB
Active(file): 4157568 kB
Inactive(file): 692940 kB
Unevictable: 17943260 kB
Mlocked: 2008 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 46852 kB
Writeback: 56 kB
AnonPages: 6344956 kB
Mapped: 528088 kB
Shmem: 18868708 kB
KReclaimable: 1779544 kB
Slab: 2009948 kB
SReclaimable: 1779544 kB
SUnreclaim: 230404 kB
KernelStack: 29232 kB
PageTables: 85416 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 16378552 kB
Committed_AS: 38599792 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 50136 kB
VmallocChunk: 0 kB
Percpu: 8544 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 1650280 kB
DirectMap2M: 31780864 kB
DirectMap1G: 0 kB
As per the link, I also looked at /dev, but df didn't report its size, nor did du.
$ df -h
Filesystem Size Used Avail Use% Mounted on
dev 16G 0 16G 0% /dev
run 16G 1.7M 16G 1% /run
/dev/nvme0n1p2 49G 38G 8.2G 83% /
tmpfs 16G 70M 16G 1% /dev/shm
tmpfs 16G 699M 15G 5% /tmp
/dev/nvme0n1p3 185G 142G 34G 81% /home
/dev/nvme0n1p1 356M 202M 155M 57% /boot
/dev/sdb1 1.8T 1.8T 60G 97% /externalHDD
/dev/sda6 1.7T 1.2T 443G 73% /HDD
tmpfs 3.2G 204K 3.2G 1% /run/user/1000
encfs 185G 142G 34G 81% /home/salkay/.decrypt
$ sudo du -hs /dev
0 /dev
man free wrote:shared Memory used (mostly) by tmpfs (Shmem in /proc/meminfo)
In addition, if there are files mapped by processes that have changed (or removed) on disk, they cannot be dropped.
I'm unsure that neither are file caches that are simply active.
Thanks seth. Is there any way I can test this?
Offline
For that "shmem" entry from /proc/meminfo, I tried finding out how to research what's going on there. I found a tool "ipcs" and a location "/dev/shm".
You can do "ipcs -m --human" to get an overview of those shmem objects and their size. Then you can do "ipcs -p" to get the process IDs that are involved with those.
About those deleted files being kept open by programs, you can find those programs with "lsof +L1" and "lsof -dDEL". The two lsof commands find different stuff.
Last edited by Ropid (2021-06-11 05:19:35)
Offline
Thanks Ropid. Unfortunately I was forced to restart because the oom killer was going crazy. Previously the shortfall in memory was over 19G, but even after a restart it's still reasonably large at 20824+3+4206-21825 = 3208 M. I also checked again, and /proc/meminfo reports Shmem at ~3G (3074312 kB), so it's still consistent at least.
You can do "ipcs -m --human" to get an overview of those shmem objects and their size. Then you can do "ipcs -p" to get the process IDs that are involved with those.
Unfortunately this didn't seem to reveal any processes with a large memory footprint.
$ ipcs -m --human
------ Shared Memory Segments --------
key shmid owner perms size nattch status
0x00000000 32771 salkay 600 128K 2 dest
0x00000000 32772 salkay 600 128K 2 dest
0x00000000 32773 salkay 600 1.2M 2 dest
0x00000000 32774 salkay 600 1.2M 2 dest
0x00000000 65547 salkay 600 48K 2 dest
0x00000000 65548 salkay 600 48K 2 dest
0x00000000 32787 salkay 600 512K 2 dest
0x00000000 32788 salkay 600 4M 2 dest
0x51210046 21 salkay 600 1K 1
0x00000000 65560 salkay 600 24K 2 dest
0x00000000 65561 salkay 600 24K 2 dest
0x00000000 65564 salkay 600 840K 2 dest
0x00000000 29 salkay 600 384K 2 dest
0x00000000 65566 salkay 600 840K 2 dest
0x00000000 32 salkay 600 384K 2 dest
0x00000000 33 salkay 600 512K 2 dest
0x00000000 36 salkay 600 512K 2 dest
0x00000000 32805 salkay 600 512K 2 dest
The summary didn't seem helpful either.
$ ipcs -u
------ Messages Status --------
allocated queues = 0
used headers = 0
used space = 0 bytes
------ Shared Memory Status --------
segments allocated 18
pages allocated 2839
pages resident 1402
pages swapped 0
Swap performance: 0 attempts 0 successes
------ Semaphore Status --------
used arrays = 1
allocated semaphores = 1
About those deleted files being kept open by programs, you can find those programs with "lsof +L1"
Am I just looking for the rows with a large SIZE/OFF? The largest only had 67 M.
$ lsof +L1
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
...
pulseaudi 874 sal 6u REG 0,1 67108864 0 3100 /memfd:pulseaudio (deleted)
...
and "lsof -dDEL".
Here the column SIZE/OFF was empty for each row, so I wasn't sure what to look for. I tried adding the size flag -s but the column remained empty.
Offline
Active(anon): 14196 kB
Inactive(anon): 7257676 kB
There's about 7GB in anon pages (allocated by programs), mostly inactive
Active(file): 4157568 kB
Inactive(file): 692940 kB
~4.5 GB in file caches (mostly active)
Unevictable: 17943260 kB
…
Shmem: 18868708 kB
These here are the bad numbers
Mlocked: 2008 kB
Suggests it's NOT some userspace process mlock'ing them and then just forgetting to unlock them.
The more likely cause is a "leaking" kernel module (it's technically not a leak, just dumb)
=> Which kernel do you use and does this happen w/ the lts kernel as well?
Offline
Thanks @seth. Very interesting.
I use the vanilla linux kernel, but I will restart and try the lts.
Offline
Unevictable: 17943260 kB
Does a KDE session restart (logout + login) free the Unevictable memory?
--
saint_abroad
Offline
Thanks @sabroad. That looks interesting too! These problems are actually on my work computer, so I'll report back in a few days after the weekend (long weekend here in Australia). BTW I haven't experienced similar levels of OOM killing on my home computer, but in both there is a similar shortfall in free + buffers + cache - available, with both systems currently ~2GB. Is that an "expected" amount? What is reasonable?
EDIT: Regarding the "expected" shortfall, I had a look on my (non-leaking) home computer. "shared" was ~2GB and df -BM | grep tmpfs showed /dev/shm with 1GB. ls -l did not, possibly suggesting deleted but open files. lsof /dev/shm revelaed these deleted files were open in Signal and Steam. Quitting both programs freed up this directory, and the shortfall was now "only" 1GB.
Last edited by Salkay (2021-06-12 01:21:11)
Offline
The numbers you want look at here are mlock and unevictable - if they're similar (unevictable maybe slightly bigger) you're "good"
shmem/shared is typically your combined tmpfs usage (so /tmp and /run/user/* next to /dev/shm)
Offline
Hi Salkay,
The behavior you've described sounds very similar (in it's timespan) to what I encountered in a thread I've begun; however the problems just went away after a kernel upgrade in my case. I had the leak on both -zen and regular Kernel. You might want to take a look at kmemleak documentation and try recompiling the kernel with it enabled. Wish I had known the tips in this thread (about lsof and ipcs!).
Just a few questions: what GPU are you using? Do you have any potential heavily GPU-using (such as folding@home) or other heavy (server-like) software running in the background besides the GUI?
In my case, my working theory is that the leak resided in amdgpu, possibly exacerbated by running F@H. However that is just a working theory and the problems went away before I could allocate time to investigate this. But in any case my leak is probably unrelated, this is just my 2 cents in general tips!
EDIT: As another workaround in addition to using the -lts branch, you might like to take a look at earlyOOM or nohang. I prefer the latter, having tried them both. Nohang has saved my butt a few times while working with the leaky Kernel (after >20GiB has been consumed by "something", before processes started to be killed). The in-Kernel OOM is very "stupid", especially from a desktop-oriented users point of view.
Last edited by Wild Penguin (2021-06-14 14:52:59)
Offline
Thanks again all.
does this happen w/ the lts kernel as well?
Yes, it also occurs with the lts kernel.
Yes! It does. I tracked Unevictable memory over a few days, and it went from 1,899,436 kB to 4,657,444 kB. I then quit all my programs and it went down, but it was still 3,178,788 kB (does that mean some was evicted??). I logged out then back into Plasma, and it was only 568,620 kB.
The numbers you want look at here are mlock and unevictable - if they're similar (unevictable maybe slightly bigger) you're "good"
I forgot to check Mlocked when the system was really bad, but even early on, Unevictable and Mlocked were quite different.
$ grep -E '^(Unevictable|Mlocked|Shmem):' /proc/meminfo
Unevictable: 1637828 kB
Mlocked: 1824 kB
Shmem: 1793220 kB
You might want to take a look at kmemleak documentation and try recompiling the kernel with it enabled.
Thanks; I'll check it out.
Just a few questions: what GPU are you using? Do you have any potential heavily GPU-using (such as folding@home) or other heavy (server-like) software running in the background besides the GUI?
Just using Intel GPU. Nothing heavy going on. I actually do have an Nvidia card, but I don't use it because it only has one physical output.
As another workaround in addition to using the -lts branch, you might like to take a look at
Thank you. I already actually use earlyoom. When memory is low it does help, but it randomly crashes browser tabs, which gives me time to save my work and restart, but not much else.
Offline
Do you run on the xf86-video-intel or the modesetting driver and does the other exhibit this behavior?
Offline
Yes! It does. I tracked Unevictable memory over a few days, and it went from 1,899,436 kB to 4,657,444 kB. I then quit all my programs and it went down, but it was still 3,178,788 kB (does that mean some was evicted??). I logged out then back into Plasma, and it was only 568,620 kB.
This means that the leak is not in Kernel-space but user space. In my case, there was no way to free the memory which had leaked (stopping the whole X.org GUI stack, F@H, tvheadend and still up to 20 GiBs of RAM is used) - so my tips about kmemleak will probably not help you.
Also, looking at the numbers in your first post - I realised the "Used" column does not actually go up (oops sorry, I didn't look at them properly before ). In my case it did - just for some reason I never posted the output of "free" in that thread (mostly since I've thought it doesn't actually give useful information in leak situations - but this thread proves otherwise).
But I will not derail this anymore - I believe the leak I've experienced was of a different kind.
Offline
Wild Penguin wrote:As another workaround in addition to using the -lts branch, you might like to take a look at
Thank you. I already actually use earlyoom. When memory is low it does help, but it randomly crashes browser tabs, which gives me time to save my work and restart, but not much else.
The main and major benefit of nohang is that it has configurable warnings. This is why I chose to use it over earlyoom.
Indeed with earlyoom (or any oom killer) the problem is the user is not notified if he/she is absent-mindedly working on something. Chances are, if the user notices the work of the OOM, it has already destroyed some important data by killing the said application (or something which depends on it, such as the whole X.org server). The configuration options of earlyoom makes this chance lower, but does not eliminate it. With alerts, the user can take manual action before any processes gets killed!
It's configuration is way more complicated, though, but the documentation makes it quite clear and the default configuration works for many desktop users. IIRC I didn't change anything - only looked at the documentation to determine what it is actually doing. You don't have to :-) .
Last edited by Wild Penguin (2021-06-16 08:11:03)
Offline
Do you run on the xf86-video-intel or the modesetting driver and does the other exhibit this behavior?
Sorry for the delay, I've not been using this system for a while. I wanted to test with more usage before reporting back. I did use modesetting, but I've tried xf86-video-intel now, and I see the same problem. Not quite as bad (with less usage?), but still:
$ grep -E '^(Unevictable|Mlocked|Shmem):' /proc/meminfo
Unevictable: 2545848 kB
Mlocked: 1948 kB
Shmem: 2810720 kB
And indeed, as per the linked bug report, restarting kwin resulted in Unevictable dropping to 1314316 kB.
I'm going to try some older kernels and see how they go. I can see LTS 4.19 and 5.4 in the AUR, so I'll try them.
The main and major benefit of nohang is that it has configurable warnings.
Thanks for all that advice. Regarding warnings, I actually already have a script that runs every 5 minutes to warn me of this!
#!/usr/bin/env sh
threshold=20
avail_percent=$(free | grep '^Mem:' | awk '{printf "%.0f", ($7 / $2 * 100)}')
if [ $avail_percent -lt $threshold ]; then
notify-send -i emblem-warning 'Swap warning' "${avail_percent}% memory available"
printf '%s\n%s\n' 'Swap warning' "${avail_percent}% memory available"
fi
Offline
I've tested some older kernels. 5.4.127 exhibits the same bug, but 4.19.195 is totally fine! I've been running it for a couple of days now, and it looks solid!
$ grep -E '^(Unevictable|Mlocked|Shmem):' /proc/meminfo
Unevictable: 2032 kB
Mlocked: 2032 kB
Shmem: 3558272 kB
And
$ free -wm
total used free shared buffers cache available
Mem: 32047 7329 294 3482 6573 17848 20785
Swap: 0 0 0
i.e. free+buffers+cache-available = 3932
Offline