NFS performance dies if a share is read by a process with parent PID 1

Artlav · 2023-03-05 11:27:16

So i have an NFS share mounted, which at first works fine but at some point the directory listing becomes very slow.
After some trial and error i found that the slowness is because the kernel is sending a lot of ACCESS messages to the server on directory read, a behaviour triggered by reading a directory on a share by a process that got parent PID of 1.

Normally a directory read on NFS share generates READDIR and some noise. Every time a directory is read, this is what happens.
In the failure mode in question it is sending a lot of ACCESS messages on directory read instead. Which kills performance.

It permanently switches to this failure mode when any directory on a share is read by something with parent PID of 1.
Such as anything launched by a hotkey, opened from mc, and probably some other cases.

https://i.imgur.com/60qQc1g.png

This is fixed either by unmounting and remounting the NFS share, or by dropping caches (echo 2 > /proc/sys/vm/drop_caches).
Then it all works fine again, until the next time i'll try to view a photo or something.

It is effectively impossible to google this kind of problem, since all that gets returned is performance tuning and common issues.

This does not appear to be a server related problem, since i tried it with several very different servers.
It is not specific to NFSv4, also happens on NFSv3.
None of the common mount parameters seem to change anything.
It is not app specific - every program works fine if launched from terminal emulator, every program triggers it to fail if started by an XFCE hotkey or my mc's extension open. Can be as simple as ls.
Both scenarios have the same `id` and `set` output.
NFS client debugging `rpcdebug -m nfs -s all` produces a long log with nothing i can make sense of.

Anyone have any idea what could be causing this, or how to google for this kind of problem?

Last edited by Artlav (2023-03-05 16:00:41)

Trilby · 2023-03-05 14:49:02

One diagnostic test I'd recommend is setting a hotkey that will launch a shell that will start one of these processes. For example change your key binding for that "Image Viewer" from `/path/to/image-viewer` to `/bin/sh -c "image-viewer"`. In such a case, does the resulting image viewer fall into the "ok" or "not ok" category. Or in other words, is this really about the parent PID being 1, or perhaps instead related to something in a shellrc or other startup file (that is inherited by all the current "ok" examples).

seth · 2023-03-05 14:55:23

Please replace the oversized image w/ a link and in general avoid posting images of text, post the text (eg. ps fax or pstree)

pstree -puST

Are those still in the same cgroup?

/proc/<pid>/cgroup

Compare the environments

tr '\0' '\n' < /proc/<pid>/environ

What if you launch a process (probably has to be graphical) from the terminal and then
a) disown it
b) "just" close stdin/out/err ?

Edit: also should have F5'd

Last edited by seth (2023-03-05 14:55:53)

Artlav · 2023-03-05 15:50:12

Trilby wrote:

One diagnostic test I'd recommend is setting a hotkey that will launch a shell that will start one of these processes.

Tried that, with ls.
Simple ls launched like that is enough to trigger the issue.

Artlav · 2023-03-05 15:59:12

seth wrote:

Are those still in the same cgroup?
Compare the environments

I don't see any significant differences.

UID          PID    PPID  C STIME TTY          TIME CMD
artlav     28402   28290  2 16:51 pts/5    00:00:00 imv
artlav     28448       1  2 16:51 pts/4    00:00:00 imv /mnt/nas/photos/020327/dscn0005.jpg

/proc/28402/cgroup:
0::/user.slice/user-1000.slice/session-1.scope

/proc/28448/cgroup:
0::/user.slice/user-1000.slice/session-1.scope

diff of environ:
6a7
> HISTCONTROL=ignoreboth
10d10
< MC_EXT_FILENAME=/mnt/nas/photos/020327/dscn0005.jpg
13,14d12
< MC_EXT_BASENAME=dscn0005.jpg
< MC_EXT_SELECTED=dscn0005.jpg
36d34
< MC_EXT_CURRENTDIR=/mnt/nas/photos/020327
40d37
< MC_EXT_ONLYTAGGED=
56a54
> PS1=\[\e[32m\][\[\e[m\]\[\e[31m\]\u\[\e[m\]\[\e[33m\]@\[\e[m\]\[\e[32m\]\h\[\e[m\]:\[\e[36m\]\w\[\e[m\]\[\e[32m\]]\[\e[m\]\[\e[32m\]\$\[\e[m\] 
62c60
< PATH=/home/artlav/.wasmtime/bin:/home/artlav/.wasmtime/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/ubin:/usr/lib/jvm/default/bin:/ubin:/ubin/fpc_bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl
---
> PATH=/home/artlav/.wasmtime/bin:/home/artlav/.wasmtime/bin:/home/artlav/.wasmtime/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/ubin:/usr/lib/jvm/default/bin:/ubin:/ubin/fpc_bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl
68c66,67
< _=/usr/bin/nohup
---
> OLDPWD=/mnt/nas/photos
> _=/usr/bin/imv

seth · 2023-03-05 16:01:22

< _=/usr/bin/nohup

Artlav · 2023-03-05 16:03:44

Hm, just tried one more thing, and apparently any interaction from PPID 1 process is enough, not just listing a directory. I.e. cat a file.

Artlav · 2023-03-05 16:09:12

seth wrote:

< _=/usr/bin/nohup

Launching something with nohup from a terminal (not as PPID 1) does not trigger the issue.

Tried again without nohup involved, same effect, no extra differences.

seth · 2023-03-05 16:14:56

seth wrote:

What if you launch a process (probably has to be graphical) from the terminal and then
a) disown it
b) "just" close stdin/out/err ?

Artlav · 2023-03-05 16:30:04

seth wrote:

What if you launch a process (probably has to be graphical) from the terminal and then
a) disown it
b) "just" close stdin/out/err ?

Interesting.

So, launching a process from terminal then disowning it does not seem to trigger it.
But once i close the terminal i launched it from, then it triggers without any extra actions.

After launch:
UID          PID    PPID  C STIME TTY          TIME CMD
artlav     32842   32756  4 17:27 pts/5    00:00:00 imv

disown 32842:
UID          PID    PPID  C STIME TTY          TIME CMD
artlav     32842   32756  2 17:27 pts/5    00:00:00 imv

Close terminal:
UID          PID    PPID  C STIME TTY          TIME CMD
artlav     32842       1  1 17:27 ?        00:00:00 imv

Not sure how to do b).

seth · 2023-03-05 17:02:23

imv 0<&- 1<&- 2<&-

but disown would inherently do that.

Artlav · 2023-03-06 21:39:18

seth wrote:

imv 0<&- 1<&- 2<&-

That does not seem to trigger it.

seth · 2023-03-06 23:13:02

Do you get the same as root?
(Speculating that this is less about NFS and more about "man 3 readdir/access" and that it's also not about PID 1, but that the UID of the $PPID is different from the $PID's UID)

Artlav · 2023-03-07 21:00:16

seth wrote:

Do you get the same as root?
(Speculating that this is less about NFS and more about "man 3 readdir/access" and that it's also not about PID 1, but that the UID of the $PPID is different from the $PID's UID)

Same story running as root.

Also, found a way to make it happen on it's own.
Add "acdirmax=1" to the mount parameters, and it would trigger by itself some tens of seconds after mount.

seth · 2023-03-08 07:35:55

That's likely equivalent to "noac", what if you reverse this and set the cache to a very generous "acdirmin=3600,acdirmmax=3600"?
Though I don't think that's gonna help, you're causing the same symptoms by deactivating the cache what can either mean that the loss of the parent process breaks the cache or just causes similar symptoms (like caughing can be caused by pulmonary aspiration or some virus…)

Artlav · 2023-03-08 13:15:03

seth wrote:

what if you reverse this and set the cache to a very generous "acdirmin=3600,acdirmax=3600"?

No effect, it still breaks.

It really is like the cache is getting switched off for some reason. Things like copying lots of small files to or from a nas becomes 10 times slower as well.
These also get padded by a lot of ACCESS RPCs.

Artlav · 2023-03-08 14:35:23

The plot thickens.
I tried LTS kernel 6.1.15, and everything works just fine with it!

So this might be some sort of a kernel regression between 6.1.15 and 6.2.1

Artlav · 2023-03-08 15:00:28

Ok, kernels 6.2.0, 6.2.1 and 6.2.2 are all broken, while 6.1.15 is good.

Can't find this exact problem being discussed anywhere, but there is a heated discussion about some NFS regression that started at exactly the right time here:
https://www.spinics.net/lists/linux-nfs/msg95290.html

Not sure if it's related or not, seems to be describing something slightly different.

seth · 2023-03-08 15:16:02

https://github.com/torvalds/linux/commi … 908c9dead1

Since commit 1a34c8c ("NFS: Support larger readdir buffers") has
updated dtsize, and with recent improvements to the READDIRPLUS helper
heuristic, the heuristic may not trigger until many dentries are emitted
to userspace. This will cause many thousands of GETATTR calls for "ls
-l" when the directory's pagecache has already been populated. This
manifests as poor performance for long directory listings after an
initially fast "ls -l".

The problem is how this relates to losing the PPID.
https://github.com/torvalds/linux/commi … 0a6c4943af and https://github.com/torvalds/linux/commi … 7dd0c7f902 might be interesting as well.

Artlav · 2023-03-08 18:11:38

Well, bad news is that i tried vanilla kernel 6.3-rc1, and it's still broken.

seth · 2023-03-08 20:22:44

https://bugzilla.kernel.org/buglist.cgi?component=NFS
https://wiki.archlinux.org/title/Bisect … s_with_Git
https://wiki.archlinux.org/title/Compile_kernel_module

Artlav · 2023-03-09 13:30:14

seth wrote:

https://bugzilla.kernel.org/buglist.cgi?component=NFS
https://wiki.archlinux.org/title/Bisect … s_with_Git
https://wiki.archlinux.org/title/Compile_kernel_module

Bisected to commit https://github.com/torvalds/linux/commi … 7dd0c7f902
Submitted a bug: https://bugzilla.kernel.org/show_bug.cgi?id=217165

And now we wait...

loserMcloser · 2023-03-27 22:56:32

Oh man, I hope the fix for this also fixes my issues with NFS. I've had /home mounted over NFS on my various machines from my main home server for years without issues, and all of a sudden (since about two weeks ago), performance has been terrible. I don't have any problems with ls, and dd read/write speeds are fine, but I get delays in the 10-20 second range trying to open apps, or even doing tab-autocomplete in my terminal. Driving me nuts. Took a look with wireshark (not that I understand much of it) and saw lots of GETATTR and ACCESS calls for nfs....

Artlav · 2023-03-28 17:35:23

Yep, that sounds like this issue. Should be fixed soon.

Artlav · 2023-04-02 16:36:50

Looks like the linux package got the fix a few days ago.
And yes, it works fine now.

Arch Linux

#1 2023-03-05 11:27:16

NFS performance dies if a share is read by a process with parent PID 1

#2 2023-03-05 14:49:02

Re: NFS performance dies if a share is read by a process with parent PID 1

#3 2023-03-05 14:55:23

Re: NFS performance dies if a share is read by a process with parent PID 1

#4 2023-03-05 15:50:12

Re: NFS performance dies if a share is read by a process with parent PID 1

#5 2023-03-05 15:59:12

Re: NFS performance dies if a share is read by a process with parent PID 1

#6 2023-03-05 16:01:22

Re: NFS performance dies if a share is read by a process with parent PID 1

#7 2023-03-05 16:03:44

Re: NFS performance dies if a share is read by a process with parent PID 1

#8 2023-03-05 16:09:12

Re: NFS performance dies if a share is read by a process with parent PID 1

#9 2023-03-05 16:14:56

Re: NFS performance dies if a share is read by a process with parent PID 1

#10 2023-03-05 16:30:04

Re: NFS performance dies if a share is read by a process with parent PID 1

#11 2023-03-05 17:02:23

Re: NFS performance dies if a share is read by a process with parent PID 1

#12 2023-03-06 21:39:18

Re: NFS performance dies if a share is read by a process with parent PID 1

#13 2023-03-06 23:13:02

Re: NFS performance dies if a share is read by a process with parent PID 1

#14 2023-03-07 21:00:16

Re: NFS performance dies if a share is read by a process with parent PID 1

#15 2023-03-08 07:35:55

Re: NFS performance dies if a share is read by a process with parent PID 1

#16 2023-03-08 13:15:03

Re: NFS performance dies if a share is read by a process with parent PID 1

#17 2023-03-08 14:35:23

Re: NFS performance dies if a share is read by a process with parent PID 1

#18 2023-03-08 15:00:28

Re: NFS performance dies if a share is read by a process with parent PID 1

#19 2023-03-08 15:16:02

Re: NFS performance dies if a share is read by a process with parent PID 1

#20 2023-03-08 18:11:38

Re: NFS performance dies if a share is read by a process with parent PID 1

#21 2023-03-08 20:22:44

Re: NFS performance dies if a share is read by a process with parent PID 1

#22 2023-03-09 13:30:14

Re: NFS performance dies if a share is read by a process with parent PID 1

#23 2023-03-27 22:56:32

Re: NFS performance dies if a share is read by a process with parent PID 1

#24 2023-03-28 17:35:23

Re: NFS performance dies if a share is read by a process with parent PID 1

#25 2023-04-02 16:36:50

Re: NFS performance dies if a share is read by a process with parent PID 1

Board footer