Half of RAM used, but not by any process, caches or /tmp

PBS · 2022-11-05 05:32:52

After doing a lot of compiling I frequently find myself in the following scenario: exactly half of my RAM is being used up by apparently nothing.

I know this question has been asked to death so many times that entire websites exist to dismiss it. But after going through all the usual explanations, none of them seem to apply here.

So here's an example of my situation: out of 15 GiB RAM (no swap), 7 GiB is "Used", according to free -m:

               total        used        free      shared  buff/cache   available
Mem:           15334        7000        7464          79         868        7978
Swap:              0           0           0

This 7 GiB is genuinely non-reclaimable. If I run a compile job that requires more than the remaining ~7 GiB of RAM, it gets killed with OOM (or the system locks up). So the memory is definitely being used.
The memory is not being used by any caches. If the above didn't convince you of this already, running echo 3 > /proc/sys/vm/drop_caches has no effect. In fact, the table above is after running that command.
It's not being used by any process, according to top and friends. Indeed, if I log out and in again, it's still there - the table basically doesn't change.
My /tmp directory is clean, and no other tmpfs is taking up a significant amount of space either, as can be checked with df -h.
It doesn't appear to be being used by the kernel either, as all the caches in slabtop are reporting usage amounts orders of magnitude less than 7 GiB.

However, there is one way to reclaim this memory without rebooting. If I relogin, and then run echo 3 > /proc/sys/vm/drop_caches, then the 7 GiB of Used memory drops completely to zero. But nothing else works:

Drop caches --> no change.
Relogin --> no change.
Drop caches + Relogin --> no change.
Relogin + Drop caches --> DROPS THE MEMORY.

What kind of memory usage could possibly lead to this behaviour? I can't reconcile this with anything:

Because the memory couldn't be reclaimed when under pressure, and a relogin with was required before it could be dropped, this suggests that a user process was responsible for holding on the memory.
But no process in System Monitor, top etc was using anything close to the required amount of memory!
After a relogin, the memory should have been released by the user process. So why did it remain as "Used", when surely it should have become cached, and therefore gone into "buff/cache" instead?

What tools can I use to investigate this problem when it comes up again, and what possible explanations are there?

cfr · 2022-11-05 05:55:44

What does df -h give? Because on my system roughly half of my RAM is available to tmpfs, if I understand the output correctly. (A bit less than 4G out of a bit under 8G total.)

seth · 2022-11-05 07:32:38

cat /proc/meminfo

And see whether "echo 1 > /proc/sys/vm/drop_caches" (page cache) or "echo 2 > /proc/sys/vm/drop_caches" (slab) frees it.

PBS · 2022-11-05 12:21:37

I'll run those commands as soon as it happens again.

In the meantime, I wonder if the cause is that some process has created and opened a large file in /tmp, then deleted it while holding the file descriptor open? That sounds like it could explain the observations - but I need to test.

Trilby · 2022-11-05 13:27:58

FWIW, a large file placed in tmpfs will reduce the "free" space reported by "free" by the filesize, but it will have only a very small increment on the "used" column. I don't know why this is the case, but it seems it is - which, given your symptoms, argues against the open file descriptor hypothesis (which otherwise sounds quite plausible).

seth · 2022-11-05 13:32:56

Unlinked opened files will still show up in /proc/$PID/fd

cd /proc
for PID in [0-9]*; do if sudo ls -l $PID/fd | grep '/tmp'; then echo $PID; fi; done

Chromium does that for sure.

PBS · 2022-11-08 06:47:08

Happened again, though I didn't have time to wait for it to get really bad, so this time it was only a persistent 3.3 GiB of "Used" memory in free -h, which dropped to 1.7 GiB after a relogin + echo 2 > /proc/sys/vm/drop_caches, and no further after echo 1 > /proc/sys/vm/drop_caches. So the slab cache it is, which I thought I'd ruled out.

I grabbed the output of some commands just before the first drop_caches:

free -h

               total        used        free      shared  buff/cache   available
Mem:            14Gi       3.3Gi       9.8Gi        24Mi       1.8Gi        11Gi
Swap:             0B          0B          0B

df -h

Filesystem             Size  Used Avail Use% Mounted on
devtmpfs               4.0M     0  4.0M   0% /dev
tmpfs                  7.5G     0  7.5G   0% /dev/shm
tmpfs                  3.0G  9.6M  3.0G   1% /run
/dev/mapper/cryptmain  239G   87G  151G  37% /
tmpfs                  7.5G     0  7.5G   0% /tmp
/dev/mapper/cryptmain  239G   87G  151G  37% /swap
/dev/mapper/cryptmain  239G   87G  151G  37% /home
/dev/nvme0n1p1         256M  132M  125M  52% /boot/efi
tmpfs                  1.5G   84K  1.5G   1% /run/user/1000

cat /proc/meminfo

MemTotal:       15702696 kB
MemFree:        10308948 kB
MemAvailable:   11845908 kB
Buffers:              16 kB
Cached:          1738356 kB
SwapCached:            0 kB
Active:          2032404 kB
Inactive:         527916 kB
Active(anon):     750280 kB
Inactive(anon):    96884 kB
Active(file):    1282124 kB
Inactive(file):   431032 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:             27184 kB
Writeback:             0 kB
AnonPages:        806120 kB
Mapped:           511420 kB
Shmem:             25212 kB
KReclaimable:     147484 kB
Slab:             312460 kB
SReclaimable:     147484 kB
SUnreclaim:       164976 kB
KernelStack:        9056 kB
PageTables:        16580 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     7851348 kB
Committed_AS:    3276608 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       46148 kB
VmallocChunk:          0 kB
Percpu:            20672 kB
HardwareCorrupted:     0 kB
AnonHugePages:    198656 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:    11625960 kB
DirectMap2M:     4464640 kB
DirectMap1G:     1048576 kB

Afterwards, free -h reports

               total        used        free      shared  buff/cache   available
Mem:            14Gi       1.7Gi        11Gi        58Mi       2.0Gi        12Gi
Swap:             0B          0B          0B

So what could cause 1.6 GiB of slab objects to hang around in the kernel, but be killed by a logout?

seth · 2022-11-08 08:19:17

/dev/mapper/cryptmain  239G   87G  151G  37% /
/dev/mapper/cryptmain  239G   87G  151G  37% /swap
/dev/mapper/cryptmain  239G   87G  151G  37% /home

You close any of those when you log out?

cfr · 2022-11-08 14:28:25

Trilby wrote:

FWIW, a large file placed in tmpfs will reduce the "free" space reported by "free" by the filesize, but it will have only a very small increment on the "used" column. I don't know why this is the case, but it seems it is - which, given your symptoms, argues against the open file descriptor hypothesis (which otherwise sounds quite plausible).

Thanks. (This makes me curious about what uses memory on my system.)

PBS · 2022-11-08 16:45:08

seth wrote:

/dev/mapper/cryptmain  239G   87G  151G  37% /
/dev/mapper/cryptmain  239G   87G  151G  37% /swap
/dev/mapper/cryptmain  239G   87G  151G  37% /home

You close any of those when you log out?

No, but I could try closing /home.

seth · 2022-11-08 20:07:56

The idea was more that closing the encrypted volume was more relevant that the logout, but that's apparentl not the case.
Alternatively (but that's gonna be a chore) you could kill session processes one by one, drop the cache and see which process it makes an impact

PBS · 2022-11-09 08:15:31

Closing the encrypted volume might actually be impossible, since it's system root.

seth wrote:

Alternatively (but that's gonna be a chore) you could kill session processes one by one, drop the cache and see which process it makes an impact

Yeah, I started doing that, but quickly got bored after killing all the obvious suspects and observing nothing to happen. I can always fall back on it as a last resort, though I was hoping for a way to interrogate this information out of the kernel.

Last edited by PBS (2022-11-09 08:17:17)

dimich · 2022-11-09 12:30:53

lsmod | sort -k 2 -rn

?

Arch Linux

#1 2022-11-05 05:32:52

Half of RAM used, but not by any process, caches or /tmp

#2 2022-11-05 05:55:44

Re: Half of RAM used, but not by any process, caches or /tmp

#3 2022-11-05 07:32:38

Re: Half of RAM used, but not by any process, caches or /tmp

#4 2022-11-05 12:21:37

Re: Half of RAM used, but not by any process, caches or /tmp

#5 2022-11-05 13:27:58

Re: Half of RAM used, but not by any process, caches or /tmp

#6 2022-11-05 13:32:56

Re: Half of RAM used, but not by any process, caches or /tmp

#7 2022-11-08 06:47:08

Re: Half of RAM used, but not by any process, caches or /tmp

#8 2022-11-08 08:19:17

Re: Half of RAM used, but not by any process, caches or /tmp

#9 2022-11-08 14:28:25

Re: Half of RAM used, but not by any process, caches or /tmp

#10 2022-11-08 16:45:08

Re: Half of RAM used, but not by any process, caches or /tmp

#11 2022-11-08 20:07:56

Re: Half of RAM used, but not by any process, caches or /tmp

#12 2022-11-09 08:15:31

Re: Half of RAM used, but not by any process, caches or /tmp

#13 2022-11-09 12:30:53

Re: Half of RAM used, but not by any process, caches or /tmp

Board footer