You are not logged in.
So i have an NFS share mounted, which at first works fine but at some point the directory listing becomes very slow.
After some trial and error i found that the slowness is because the kernel is sending a lot of ACCESS messages to the server on directory read, a behaviour triggered by reading a directory on a share by a process that got parent PID of 1.
Normally a directory read on NFS share generates READDIR and some noise. Every time a directory is read, this is what happens.
In the failure mode in question it is sending a lot of ACCESS messages on directory read instead. Which kills performance.
It permanently switches to this failure mode when any directory on a share is read by something with parent PID of 1.
Such as anything launched by a hotkey, opened from mc, and probably some other cases.
https://i.imgur.com/60qQc1g.png
This is fixed either by unmounting and remounting the NFS share, or by dropping caches (echo 2 > /proc/sys/vm/drop_caches).
Then it all works fine again, until the next time i'll try to view a photo or something.
It is effectively impossible to google this kind of problem, since all that gets returned is performance tuning and common issues.
This does not appear to be a server related problem, since i tried it with several very different servers.
It is not specific to NFSv4, also happens on NFSv3.
None of the common mount parameters seem to change anything.
It is not app specific - every program works fine if launched from terminal emulator, every program triggers it to fail if started by an XFCE hotkey or my mc's extension open. Can be as simple as ls.
Both scenarios have the same `id` and `set` output.
NFS client debugging `rpcdebug -m nfs -s all` produces a long log with nothing i can make sense of.
Anyone have any idea what could be causing this, or how to google for this kind of problem?
Last edited by Artlav (2023-03-05 16:00:41)
Offline
One diagnostic test I'd recommend is setting a hotkey that will launch a shell that will start one of these processes. For example change your key binding for that "Image Viewer" from `/path/to/image-viewer` to `/bin/sh -c "image-viewer"`. In such a case, does the resulting image viewer fall into the "ok" or "not ok" category. Or in other words, is this really about the parent PID being 1, or perhaps instead related to something in a shellrc or other startup file (that is inherited by all the current "ok" examples).
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
Please replace the oversized image w/ a link and in general avoid posting images of text, post the text (eg. ps fax or pstree)
pstree -puST
Are those still in the same cgroup?
/proc/<pid>/cgroup
Compare the environments
tr '\0' '\n' < /proc/<pid>/environ
What if you launch a process (probably has to be graphical) from the terminal and then
a) disown it
b) "just" close stdin/out/err ?
Edit: also should have F5'd
Last edited by seth (2023-03-05 14:55:53)
Online
One diagnostic test I'd recommend is setting a hotkey that will launch a shell that will start one of these processes.
Tried that, with ls.
Simple ls launched like that is enough to trigger the issue.
Offline
Are those still in the same cgroup?
Compare the environments
I don't see any significant differences.
UID PID PPID C STIME TTY TIME CMD
artlav 28402 28290 2 16:51 pts/5 00:00:00 imv
artlav 28448 1 2 16:51 pts/4 00:00:00 imv /mnt/nas/photos/020327/dscn0005.jpg
/proc/28402/cgroup:
0::/user.slice/user-1000.slice/session-1.scope
/proc/28448/cgroup:
0::/user.slice/user-1000.slice/session-1.scope
diff of environ:
6a7
> HISTCONTROL=ignoreboth
10d10
< MC_EXT_FILENAME=/mnt/nas/photos/020327/dscn0005.jpg
13,14d12
< MC_EXT_BASENAME=dscn0005.jpg
< MC_EXT_SELECTED=dscn0005.jpg
36d34
< MC_EXT_CURRENTDIR=/mnt/nas/photos/020327
40d37
< MC_EXT_ONLYTAGGED=
56a54
> PS1=\[\e[32m\][\[\e[m\]\[\e[31m\]\u\[\e[m\]\[\e[33m\]@\[\e[m\]\[\e[32m\]\h\[\e[m\]:\[\e[36m\]\w\[\e[m\]\[\e[32m\]]\[\e[m\]\[\e[32m\]\$\[\e[m\]
62c60
< PATH=/home/artlav/.wasmtime/bin:/home/artlav/.wasmtime/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/ubin:/usr/lib/jvm/default/bin:/ubin:/ubin/fpc_bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl
---
> PATH=/home/artlav/.wasmtime/bin:/home/artlav/.wasmtime/bin:/home/artlav/.wasmtime/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/ubin:/usr/lib/jvm/default/bin:/ubin:/ubin/fpc_bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl
68c66,67
< _=/usr/bin/nohup
---
> OLDPWD=/mnt/nas/photos
> _=/usr/bin/imv
Offline
< _=/usr/bin/nohup
Online
Hm, just tried one more thing, and apparently any interaction from PPID 1 process is enough, not just listing a directory. I.e. cat a file.
Offline
< _=/usr/bin/nohup
Launching something with nohup from a terminal (not as PPID 1) does not trigger the issue.
Tried again without nohup involved, same effect, no extra differences.
Offline
What if you launch a process (probably has to be graphical) from the terminal and then
a) disown it
b) "just" close stdin/out/err ?
Online
What if you launch a process (probably has to be graphical) from the terminal and then
a) disown it
b) "just" close stdin/out/err ?
Interesting.
So, launching a process from terminal then disowning it does not seem to trigger it.
But once i close the terminal i launched it from, then it triggers without any extra actions.
After launch:
UID PID PPID C STIME TTY TIME CMD
artlav 32842 32756 4 17:27 pts/5 00:00:00 imv
disown 32842:
UID PID PPID C STIME TTY TIME CMD
artlav 32842 32756 2 17:27 pts/5 00:00:00 imv
Close terminal:
UID PID PPID C STIME TTY TIME CMD
artlav 32842 1 1 17:27 ? 00:00:00 imv
Not sure how to do b).
Offline
imv 0<&- 1<&- 2<&-
but disown would inherently do that.
Online
imv 0<&- 1<&- 2<&-
That does not seem to trigger it.
Offline
Do you get the same as root?
(Speculating that this is less about NFS and more about "man 3 readdir/access" and that it's also not about PID 1, but that the UID of the $PPID is different from the $PID's UID)
Online
Do you get the same as root?
(Speculating that this is less about NFS and more about "man 3 readdir/access" and that it's also not about PID 1, but that the UID of the $PPID is different from the $PID's UID)
Same story running as root.
Also, found a way to make it happen on it's own.
Add "acdirmax=1" to the mount parameters, and it would trigger by itself some tens of seconds after mount.
Offline
That's likely equivalent to "noac", what if you reverse this and set the cache to a very generous "acdirmin=3600,acdirmmax=3600"?
Though I don't think that's gonna help, you're causing the same symptoms by deactivating the cache what can either mean that the loss of the parent process breaks the cache or just causes similar symptoms (like caughing can be caused by pulmonary aspiration or some virus…)
Online
what if you reverse this and set the cache to a very generous "acdirmin=3600,acdirmax=3600"?
No effect, it still breaks.
It really is like the cache is getting switched off for some reason. Things like copying lots of small files to or from a nas becomes 10 times slower as well.
These also get padded by a lot of ACCESS RPCs.
Offline
The plot thickens.
I tried LTS kernel 6.1.15, and everything works just fine with it!
So this might be some sort of a kernel regression between 6.1.15 and 6.2.1
Offline
Ok, kernels 6.2.0, 6.2.1 and 6.2.2 are all broken, while 6.1.15 is good.
Can't find this exact problem being discussed anywhere, but there is a heated discussion about some NFS regression that started at exactly the right time here:
https://www.spinics.net/lists/linux-nfs/msg95290.html
Not sure if it's related or not, seems to be describing something slightly different.
Offline
https://github.com/torvalds/linux/commi … 908c9dead1
Since commit 1a34c8c ("NFS: Support larger readdir buffers") has
updated dtsize, and with recent improvements to the READDIRPLUS helper
heuristic, the heuristic may not trigger until many dentries are emitted
to userspace. This will cause many thousands of GETATTR calls for "ls
-l" when the directory's pagecache has already been populated. This
manifests as poor performance for long directory listings after an
initially fast "ls -l".
The problem is how this relates to losing the PPID.
https://github.com/torvalds/linux/commi … 0a6c4943af and https://github.com/torvalds/linux/commi … 7dd0c7f902 might be interesting as well.
Online
Well, bad news is that i tried vanilla kernel 6.3-rc1, and it's still broken.
Offline
Online
Bisected to commit https://github.com/torvalds/linux/commi … 7dd0c7f902
Submitted a bug: https://bugzilla.kernel.org/show_bug.cgi?id=217165
And now we wait...
Offline
Oh man, I hope the fix for this also fixes my issues with NFS. I've had /home mounted over NFS on my various machines from my main home server for years without issues, and all of a sudden (since about two weeks ago), performance has been terrible. I don't have any problems with ls, and dd read/write speeds are fine, but I get delays in the 10-20 second range trying to open apps, or even doing tab-autocomplete in my terminal. Driving me nuts. Took a look with wireshark (not that I understand much of it) and saw lots of GETATTR and ACCESS calls for nfs....
Offline
Yep, that sounds like this issue. Should be fixed soon.
Offline
Looks like the linux package got the fix a few days ago.
And yes, it works fine now.
Offline