How to diagnose kernel memory usage and causes?

natervance · 2023-09-25 13:29:02

I have possibly the same issue as this user back in 2016, where my ram usage keeps growing until my desktop grinds to a halt, even after killing all applications including X. Several times this week I've gotten to my computer in the morning and seen that X has been been killed by the oom-killer, but still there's no working command prompt, forcing me to perform a hard reboot. I'm currently on Linux 6.5.4, but I've had these symptoms for several months.

Here's /proc/meminfo after a near episode:

$ cat /proc/meminfo
MemTotal:       16325952 kB
MemFree:         3066180 kB
MemAvailable:    6168588 kB
Buffers:          249620 kB
Cached:          2929100 kB
SwapCached:            0 kB
Active:          2651708 kB
Inactive:        1897508 kB
Active(anon):     851468 kB
Inactive(anon):   545120 kB
Active(file):    1800240 kB
Inactive(file):  1352388 kB
Unevictable:          48 kB
Mlocked:              48 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:               264 kB
Writeback:             0 kB
AnonPages:       1345336 kB
Mapped:           355888 kB
Shmem:             26092 kB
KReclaimable:     287208 kB
Slab:             367188 kB
SReclaimable:     287208 kB
SUnreclaim:        79980 kB
KernelStack:        9024 kB
PageTables:        14052 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8162976 kB
Committed_AS:    3999412 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       50872 kB
VmallocChunk:          0 kB
Percpu:             2368 kB
HardwareCorrupted:     0 kB
AnonHugePages:    616448 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:     96256 kB
FilePmdMapped:     90112 kB
CmaTotal:              0 kB
CmaFree:               0 kB
Unaccepted:            0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     8841028 kB
DirectMap2M:     6832128 kB
DirectMap1G:     1048576 kB

And top:

$ top -o +%MEM -bn1 | head -n20
top - 08:56:08 up 16:49,  1 user,  load average: 0.08, 0.11, 0.21
Tasks: 202 total,   1 running, 201 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  2.2 sy,  0.0 ni, 97.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15943.3 total,   2851.9 free,  10054.8 used,   3392.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   5888.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 159161 nathan    20   0   11.0g 563776 255088 S   0.0   3.5   1:41.61 firefox
   6233 nathan    20   0 1968456 345184  42640 S   0.0   2.1   2:14.12 python3
   7534 nathan    20   0 1915260 301544  50764 S   0.0   1.8   2:23.51 python3
 159945 nathan    20   0 1374048 169432 125476 S   0.0   1.0   0:03.70 alacrit+
    719 nathan    20   0 1012200 165364 103112 S   0.0   1.0  73:26.79 Xorg
   3018 nathan    20   0 1192412 161160 111940 S   0.0   1.0   5:50.73 alacrit+
 159373 nathan    20   0 2514796 160728  89872 S   0.0   1.0   0:01.63 Isolate+
 162281 nathan    20   0 2496652 148536  91272 S   0.0   0.9   0:02.18 Isolate+
 159304 nathan    20   0   18.5g 128484  84576 S   0.0   0.8   0:04.32 WebExte+
 159539 nathan    20   0 2470680 115480  89880 S   0.0   0.7   0:20.84 Isolate+
 159265 nathan    20   0 2482816 114552  87164 S   0.0   0.7   0:01.49 Privile+
 160268 nathan    20   0 2453784 110272  86912 S   0.0   0.7   0:00.63 Isolate+
 159403 nathan    20   0 2450516 106224  85900 S   0.0   0.7   0:05.35 Isolate+

So in summary, huge RAM usage (currently 10 of 16G used), but not by user processes, which leads me to suspect that it must be the kernel itself gobbling it all up. Are there other possibilities of who the culprit might be? And if it is the kernel, how do I go about getting more fine-grained diagnostics, such as which module might be at fault so that I can blacklist it? Whenever I have this issue and the system hasn't locked up yet, I've been going down the lsmod list and rmmod'ing stuff until the system becomes unusable, but so far no reduction in memory usage.

seth · 2023-09-25 14:19:27

I have possibly the same issue as this user back in 2016

Possibly not.

A seven 7 year younger seth wrote:

2.5GB slab, most of it unclaimable.
The kernel or some module *is* pretty hungry.

Slab:             367188 kB
SReclaimable:     287208 kB
SUnreclaim:        79980 kB

Do you have an AMD GPU?

If you let the system run and leak a bit (so much that there's an obvious gap between MemTotal and the sum of MemFree, Active and Inactive), then isolate the multi-user.target (what kills your X11 server and every client, make sure to save active data before) and then drop all caches: does your RAM return?

https://bbs.archlinux.org/viewtopic.php … 3#p2096273

natervance · 2023-09-25 16:27:34

I do indeed have an AMD GPU:

$ lspci -k | grep "VGA" -A 3
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
	Subsystem: Gigabyte Technology Co., Ltd Radeon RX 570 Gaming 4G
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

I'm still waiting for another leak to be sprung. When it does, just to be clear, this is what I shall do:

$ sudo systemctl isolate multi-user.target

This kills X and all. Then:

$ sudo sysctl vm.drop_caches=3
$ cat /proc/meminfo

and see what I get. I'll cross my fingers and wait until the opportunity arises.

seth · 2023-09-25 19:00:35

Yup.
Unfortunately that will only tell you what's happening - not why nor how to stop it

seth · 2023-09-29 17:58:34

Can you please post your complete system journal for the boot:

sudo journalctl -b | curl -F 'file=@-' 0x0.st

natervance · 2023-09-29 19:45:10

My system journal: http://0x0.st/HV0K.txt

I had only accumulated about 2G of memory leakage, but I went through the test anyway. Here's before:

MemTotal:       16325948 kB
MemFree:         8318044 kB
MemAvailable:    9385904 kB
Buffers:           45032 kB
Cached:          2641080 kB
SwapCached:            0 kB
Active:          2852764 kB
Inactive:        1908536 kB
Active(anon):    1988840 kB
Inactive(anon):  1700408 kB
Active(file):     863924 kB
Inactive(file):   208128 kB
Unevictable:         576 kB
Mlocked:             576 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:               136 kB
Writeback:             0 kB
AnonPages:       2065304 kB
Mapped:           394776 kB
Shmem:           1614060 kB
KReclaimable:     333236 kB
Slab:             459084 kB
SReclaimable:     333236 kB
SUnreclaim:       125848 kB
KernelStack:       12864 kB
PageTables:        22508 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8162972 kB
Committed_AS:    7115020 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       54940 kB
VmallocChunk:          0 kB
Percpu:             2272 kB
HardwareCorrupted:     0 kB
AnonHugePages:    669696 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:     32768 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
Unaccepted:            0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     5062468 kB
DirectMap2M:    11659264 kB
DirectMap1G:           0 kB

(MemTotal - MemFree - Active - Inactive - Slab = 2722.19 MB)

And here's after isolating rescue.target and dropping caches:

MemTotal:       16325948 kB
MemFree:        14292604 kB
MemAvailable:   14164004 kB
Buffers:            3044 kB
Cached:          1577408 kB
SwapCached:            0 kB
Active:           153656 kB
Inactive:        1675852 kB
Active(anon):     122092 kB
Inactive(anon):  1669428 kB
Active(file):      31564 kB
Inactive(file):     6424 kB
Unevictable:          32 kB
Mlocked:              32 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:                 8 kB
Writeback:             0 kB
AnonPages:        249124 kB
Mapped:            31896 kB
Shmem:           1542440 kB
KReclaimable:      41804 kB
Slab:             106072 kB
SReclaimable:      41804 kB
SUnreclaim:        64268 kB
KernelStack:        3616 kB
PageTables:         2728 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8162972 kB
Committed_AS:    2466884 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       45156 kB
VmallocChunk:          0 kB
Percpu:             2272 kB
HardwareCorrupted:     0 kB
AnonHugePages:    196608 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
Unaccepted:            0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     5062468 kB
DirectMap2M:    11659264 kB
DirectMap1G:           0 kB

(MemTotal - MemFree - Active - Inactive - Slab = 95.4727 MB)

Interestingly, I was mistaken about multi-user.target killing X, so I first went through the whole process of isolate multi-user.target (was baffled that X is still up), dropped caches anyway, then checked and got the following:

MemTotal:       16325948 kB
MemFree:        10043656 kB
MemAvailable:   10259540 kB
Buffers:            6836 kB
Cached:          2043596 kB
SwapCached:            0 kB
Active:          1566068 kB
Inactive:        2564892 kB
Active(anon):    1165064 kB
Inactive(anon):  2529648 kB
Active(file):     401004 kB
Inactive(file):    35244 kB
Unevictable:         576 kB
Mlocked:             576 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:                 4 kB
Writeback:             0 kB
AnonPages:       2081148 kB
Mapped:           431788 kB
Shmem:           1614184 kB
KReclaimable:      65196 kB
Slab:             182644 kB
SReclaimable:      65196 kB
SUnreclaim:       117448 kB
KernelStack:       12208 kB
PageTables:        22600 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8162972 kB
Committed_AS:    7083444 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       54080 kB
VmallocChunk:          0 kB
Percpu:             2272 kB
HardwareCorrupted:     0 kB
AnonHugePages:    665600 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:     16384 kB
FilePmdMapped:     16384 kB
CmaTotal:              0 kB
CmaFree:               0 kB
Unaccepted:            0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     5062468 kB
DirectMap2M:    11659264 kB
DirectMap1G:           0 kB

(MemTotal - MemFree - Active - Inactive - Slab = 1922.55 MB)

Does this mean that something to do with X is to blame for leaking memory?

seth · 2023-09-29 20:08:13

Depending on how you start X11 it might not be bound to the graphical.target - unfortunately you don't have the crazy "amdgpu: AGP" value I was hoping for, so that seems to be a red herring.
Something's leaking textures, or more likely causing the amdgpu kernel module to leak them - another users seems to accelerate that w/ picom + jetbrains (I suspect the latter flickering some window)

This could be induced by the client directly or by the X11 server - are you using the xf86-video-amdgpu driver?

natervance · 2023-09-29 20:16:40

I don't have xf86-video-amdgpu, though perhaps I should? I do, however, have amdvlk, which I see on the wiki is discouraged. I'll try replacing it with vulkan-radeon and see how stable that one is.

seth · 2023-09-29 20:23:49

though perhaps I should

Typically not, but you may also try whether it can sidestep the issue.

Edit, links for later:
https://gitlab.freedesktop.org/drm/amd/-/issues/2863
https://gitlab.freedesktop.org/drm/amd/-/issues/1942 ( https://gitlab.freedesktop.org/drm/amd/ … te_1426514 talks about a systemic issue)
https://gitlab.freedesktop.org/drm/amd/-/issues/2360 (though should be fixed in the non-LTS kernel)

Last edited by seth (2023-09-30 13:29:45)

Arch Linux

#1 2023-09-25 13:29:02

How to diagnose kernel memory usage and causes?

#2 2023-09-25 14:19:27

Re: How to diagnose kernel memory usage and causes?

#3 2023-09-25 16:27:34

Re: How to diagnose kernel memory usage and causes?

#4 2023-09-25 19:00:35

Re: How to diagnose kernel memory usage and causes?

#5 2023-09-29 17:58:34

Re: How to diagnose kernel memory usage and causes?

#6 2023-09-29 19:45:10

Re: How to diagnose kernel memory usage and causes?

#7 2023-09-29 20:08:13

Re: How to diagnose kernel memory usage and causes?

#8 2023-09-29 20:16:40

Re: How to diagnose kernel memory usage and causes?

#9 2023-09-29 20:23:49

Re: How to diagnose kernel memory usage and causes?

Board footer