You are not logged in.
Since some time (I guess 15'th April) I get some freezes in programs like libreoffice, firefox.
A very uncertain guess is that it started when updating xorg-server and xorg-server-common from 21.1.12-1.
Looking in the journal I find something similar as below, only with different tasks, e.g. sqldb:p~lite, Permission, mplayer, nemo, firefox, ...
But I'm not sure if this blocking really has something to do with the freezing?
kernel: INFO: task sqldb:p~lite #2:2524 blocked for more than 122 seconds.
kernel: Tainted: P OE 6.8.7-arch1-1 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: task:sqldb:p~lite #2 state:D stack:0 pid:2524 tgid:2290 ppid:1338 flags:0x00000002
kernel: Call Trace:
kernel: <TASK>
kernel: __schedule+0x3e6/0x1520
kernel: schedule+0x32/0xd0
kernel: schedule_preempt_disabled+0x15/0x30
kernel: rwsem_down_read_slowpath+0x2aa/0x540
kernel: down_read+0x48/0xb0
kernel: walk_component+0xcc/0x150
kernel: path_lookupat+0x6a/0x1a0
kernel: ? __rseq_handle_notify_resume+0xa9/0x4f0
kernel: filename_lookup+0xe8/0x1f0
kernel: vfs_statx+0x92/0x1c0
kernel: vfs_fstatat+0x85/0xb0
kernel: __do_sys_newfstatat+0x3f/0x80
kernel: do_syscall_64+0x83/0x170
kernel: ? ptep_set_access_flags+0x32/0x40
kernel: ? wp_page_reuse+0x8e/0xa0
kernel: ? do_wp_page+0xf8/0xb90
kernel: ? __pte_offset_map+0x1b/0x180
kernel: ? __handle_mm_fault+0xb2c/0xe50
kernel: ? __count_memcg_events+0x4d/0xc0
kernel: ? count_memcg_events.constprop.0+0x1a/0x30
kernel: ? handle_mm_fault+0xa2/0x360
kernel: ? do_user_addr_fault+0x304/0x670
kernel: ? exc_page_fault+0x7f/0x180
kernel: entry_SYSCALL_64_after_hwframe+0x78/0x80
kernel: RIP: 0033:0x7eab7311586e
kernel: RSP: 002b:00007eab46dbd2a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000106
kernel: RAX: ffffffffffffffda RBX: 00007eab46dbd370 RCX: 00007eab7311586e
kernel: RDX: 00007eab46dbd2b8 RSI: 00007eab35c0ce96 RDI: 00000000ffffff9c
kernel: RBP: 0000000000000000 R08: 00007eab3276e188 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007eab3590b980
kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 00007eab63f6d450
kernel: </TASK>
kernel: INFO: task sqldb:c~lite #3:2526 blocked for more than 122 seconds.
kernel: Tainted: P OE 6.8.7-arch1-1 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: task:sqldb:c~lite #3 state:D stack:0 pid:2526 tgid:2290 ppid:1338 flags:0x00000002
Offline
The hung task is a symptom, rather not the cause of anything.
Please post your complete system journal for the boot:
sudo journalctl -b | curl -F 'file=@-' 0x0.st
And because of your suspicion also your Xorg log, https://wiki.archlinux.org/title/Xorg#General
Offline
Hi seth,
please find my logging data here: https://0x0.st/s/QUJm88bN_2_k9o74LymwDw/XHyg.txt
Offline
From the only hung task in that journal it very much looks like it's the AFS, specifically the PAM errr… module
But regardless of that also the pleathora of "byte-range locks only enforced for processes on this machine" - can you test w/o that mount?
Offline
Yes this may well be. I will let the packagers know.
Testing is difficult, because I don't know how to reproduce it directly.
I found to more hungs in the journal.
Apr 18 15:33:33 systemd[1113]: Started Application launched by gnome-shell.
Apr 18 15:33:40 kernel: afs: byte-range locks only enforced for processes on this machine (pid 2103 (ThreadPoo>
Apr 18 15:34:10 kernel: INFO: task Permission:2313 blocked for more than 122 seconds.
Apr 18 15:34:10 kernel: Tainted: P OE 6.8.7-arch1-1 #1
Apr 18 15:34:10 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 18 15:34:10 kernel: task:Permission state:D stack:0 pid:2313 tgid:2242 ppid:1324 flags:0x0000>
Apr 18 15:34:10 kernel: Call Trace:
Apr 18 15:34:10 kernel: <TASK>
Apr 18 15:34:10 kernel: __schedule+0x3e6/0x1520
Apr 23 16:35:00 CROND[7697]: pam_afs_session(crond:session): pam_sm_close_session: exit (success)
Apr 23 16:35:00 CROND[7697]: pam_systemd(crond:session): Failed to release session: Transport endpoint is not >
Apr 23 16:35:00 systemd[1]: session-10.scope: Deactivated successfully.
Apr 23 16:35:38 kernel: afs: byte-range locks only enforced for processes on this machine (pid 2304 (Socket Th>
Apr 23 16:35:57 kernel: afs: byte-range locks only enforced for processes on this machine (pid 2290 (firefox),>
Apr 23 16:36:45 systemd[1121]: Started Application launched by gnome-shell.
Apr 23 16:36:45 systemd[1121]: Starting GNOME Terminal Server...
Apr 23 16:36:45 systemd[1121]: Started GNOME Terminal Server.
Apr 23 16:36:46 systemd[1121]: Started VTE child process 7741 launched by gnome-terminal-server process 7732.
Apr 23 16:36:47 kernel: afs: byte-range locks only enforced for processes on this machine (pid 2125 (ThreadPoo>
Apr 23 16:36:47 kernel: INFO: task sqldb:p~lite #2:2524 blocked for more than 122 seconds.
Apr 23 16:36:47 kernel: Tainted: P OE 6.8.7-arch1-1 #1
Apr 23 16:36:47 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 23 16:36:47 kernel: task:sqldb:p~lite #2 state:D stack:0 pid:2524 tgid:2290 ppid:1338 flags:0x0000>
Apr 23 16:36:47 kernel: Call Trace:
Apr 23 16:36:47 kernel: <TASK>
Apr 23 16:36:47 kernel: __schedule+0x3e6/0x1520
Offline
Yes this may well be. I will let the packagers know.
Hi, thanks for pinging me.
It's very well possible that this is an issue with OpenAFS. To see if it is, you may want to run
cmdebug localhost
while your system is hanging. I assume it will print out an entry that later matches with the blocked task.
Do you have your home directory in AFS?
Unfortunately, this sort of issue can be hard to debug as it could be caused by the client, the server, or the connection between them. Maybe there is packet loss and some operation needs to time out before OpenAFS is trying again. Did anything change with your internet or network connection (VPN, MTU, etc.)?
The byte-range lock warnings should be OK as long as you do not access those files simultaneously from different systems.
Offline
@Bevan, yes I have my home directory in AFS. But as @seth suggested I will play around with a local user and openafs-clent stopped.
What is puzzling me a bit, is that the messages with the blocked task and call trace have not shown up again since I downgraded xorg-server. But this is probably misleading. Actually there where only a few such messages in the log file since 15'th April when I upgraded xorg-server. So the statistic is quite low. I have to be more patient.
I still get some freezes, but without the call trace messages. And I'm not sure if the freezing is identical as before. Sorry for the confusion. I have to check this further and also with openafs-client stopped, as mentioned above.
Thanks for your help.
I will post here if I find something more conclusive.
Offline
1) With a local user, not using afs, I was not able (quickly) to produce a freeze.
2) With local user and Re-updating xorg-server and common to 21.1.13-1 also no freeze.
3) With the user having home in AFS (and the newer xorg-server) a freeze occured but still no blocked task and call trace in the journal.
The output of
cmdebug localhost
gives:
Lock afs_discon_lock status: (none_waiting, 2 read_locks(pid:0))
** Cache entry @ 0xc7c7aec0 for .... [<my domain>]
locks: (none_waiting, write_locked(pid:5483 at:147))
2048 bytes DV 30246 refcnt 2
callback 71ed3880 expires 1714483116
0 opens 0 writers
normal file
states (0x1), stat'd
** Cache entry @ 0x965d3740 for ... [<my domain>]
locks: (none_waiting, write_locked(pid:4987 at:147))
2048 bytes DV 74027 refcnt 2
callback 71ed3880 expires 1714483067
0 opens 0 writers
normal file
states (0x1), stat'd
I also dumped that to a file, which changes the output a bit to:
Lock afs_discon_lock status: (none_waiting, 1 read_locks(pid:0)) (list of all scripts in my ~/script directory (in AFS)) Cache entry @ 0x02378880 for ..... [<my domain>] locks: (none_waiting, write_locked(pid:6346 at:147)) 2048 bytes DV 30212 refcnt 2 callback bb2bf780 expires 1714473665 0 opens 0 writers normal file states (0x1), stat'
When the freezing happens, this time I don't see anything inm the journal except for the byte-range lock warnings.
Offline
and another one looks like this:
Lock afs_discon_lock status: (none_waiting, 2 read_locks(pid:0))
** Cache entry @ 0xc7c7aec0 for 182.537041393.50599.16763065 [<my domain>]
locks: (none_waiting, write_locked(pid:9477 at:147))
2048 bytes DV 30368 refcnt 2
callback 71ed3880 expires 1714483116
0 opens 0 writers
normal file
states (0x1), stat'd
** Cache entry @ 0x965d0cc0 for 182.537041393.284080.16762084 [<my domain>]
locks: (none_waiting, write_locked(pid:676 at:51))
0 bytes DV 0 refcnt 3
callback 71ed3880 expires 1714483066
1 opens 1 writers
normal file
states (0x21), stat'd
** Cache entry @ 0x965d1980 for 182.537041393.50053.16761613 [<my domain>]
locks: (none_waiting, write_locked(pid:5050 at:142))
6144 bytes DV 164874 refcnt 2
callback 71ed3880 expires 1714483065
0 opens 0 writers
normal file
states (0x1), stat'd
Offline
The way I triggered the freezing is to start libreoffice with a spread sheet and left press on one corner of the window and moving the mouse quickly for some time. Sometimes it "helped" to maximize and resize the window and than do the quick resizing with the mouse again. This lead to the window freezing and the popup of a screen "Libreoffice Calc is not responding".
Still this could be two different issues. Because "cmdebug" also sometimes return lock information if I did not stress any window.
Let me add: it looks like the to trigger the freezing, the file has to be in AFS as well. When I copied it to a local disk, I was not abel to trigger the freeze.
Sorry that last statement is wrong. It just took a bit longer, maybe by chance.
Last edited by drslmr (2024-04-30 14:20:41)
Offline
If noone has answered yet and your last post was a short time ago, please use the Edit button instead of creating a new post.
Disliking systemd intensely, but not satisfied with alternatives so focusing on taming systemd.
clean chroot building not flexible enough ?
Try clean chroot manager by graysky
Offline
This lead to the window freezing and the popup of a screen "Libreoffice Calc is not responding".
generally means that the client locked up - there was an input bug in libx11, but
a) it's supposed to be fixed and afair is for everyone who reported to be affected
b) only hit users of IMs (fcitx, ibus, …)
so I doubt it's related but what X11 packages did you suspect to have been relevant here?
=> You'd probably want to attach strace or gdb to the LO process and see where it's hanging out - and whether it's IO (can also perhaps be seen in the process state as "D") and whether it's IO that depends on the AFS mount…
Offline
Thanks seth.
I think the problem is due to the way I use AFS.
I have some directories i.e. ".config" in my home directory (in AFS). And I use the @sys feature to point it to a directory related to my machine, again in AFS.
.config -> '.@sys/.config'/
fs sysname
Current sysname is '<my machine>'
.<my machine>/.config
Second, libreoffice, and probably other programs too, write window parameters in a file in .config. That way openafs is involved when resizing windows.
In conclusion, I think this topic may be closed, at least in this place.
Offline
Threads can't be really "closed" (except by mods what locks them) - you'd add some tag to the subject but right now I'd wait, maybe use this as platform in case Bevan has more insights, until AFS is identified as the categorical culprit (and ideally the situation addressed upstream)
(Unfortunately my own expertise wrt AFS is "I'm aware that it exists")
Offline