Xorg signal 6/ABRT on shutdown, Flushing workques (legacy Nvidia)

drankinatty · 2026-03-15 18:57:48

I have had a 90 second delay in shutdown/reboot that has appeared in the last 60 days or so (6.18/6.19 kernel). Since this box is a server, it only reboots on kernel update when new kernels appear in the core repo. This box also has an old Nvidia 560 card in it and uses the nvidia-390xx-utils driver from AUR. This problem is seen with the 6.18 and 6.19 kernels, so it's not something brand new with the 6.19 kernel.

The problem is I don't know if this is Xorg, systemd, or the nvidia driver that is at the heart of the matter. It looks like the problem begins with Xorg terminating abnormally, then the systemd coredump fails (looks like due to the facilities already being down for it)0, then the kernel stack dump/call trace happens. This is the 90 second delay in shutdown I see until this does whatever it does, timeout, etc.. and then shutdown/reboot proceeds normally.

The journal entries related to the shutdown crash are:

Mar 14 23:00:56 valkyrie mkinitcpio[171509]: ==> Starting build: 'none'
Mar 14 23:00:56 valkyrie mkinitcpio[171509]:   -> Running build hook: [sd-shutdown]
Mar 14 23:00:56 valkyrie dovecot[807]: master: Warning: Killed with signal 15 (by pid=171503 uid=0 code=kill)
Mar 14 23:00:56 valkyrie systemd-coredump[171643]: Process 899 (Xorg) of user 0 terminated abnormally with signal 6/ABRT
, processing...
Mar 14 23:00:56 valkyrie systemd-coredump[171643]: Failed to send coredump datagram: Broken pipe
Mar 14 23:00:56 valkyrie mkinitcpio[171509]: ==> Build complete.
Mar 14 23:00:56 valkyrie systemd[1]: mkinitcpio-generate-shutdown-ramfs.service: Deactivated successfully.
Mar 14 23:00:56 valkyrie systemd[1]: Finished Generate shutdown-ramfs.
Mar 14 23:00:56 valkyrie kernel: WARNING: Flushing system-wide workqueues will be prohibited in near future.
Mar 14 23:00:56 valkyrie kernel: CPU: 5 UID: 0 PID: 1064 Comm: InputThread Tainted: P           OE       6.19.6-arch1-1
#1 PREEMPT(full)  a70f585a3574c37bff18875a6cf7bd8652b4cbca
Mar 14 23:00:56 valkyrie kernel: Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Mar 14 23:00:56 valkyrie kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./990FXA-UD3, BIOS F3
 05/28/2015
Mar 14 23:00:56 valkyrie kernel: Call Trace:
Mar 14 23:00:56 valkyrie kernel:  <TASK>
Mar 14 23:00:56 valkyrie kernel:  dump_stack_lvl+0x5d/0x80
Mar 14 23:00:56 valkyrie kernel:  os_flush_work_queue+0x41/0x60 [nvidia 0f3cb87a613037cd37443fc0f9e701834c4ef2b7]
Mar 14 23:00:56 valkyrie kernel:  rm_disable_adapter+0x52/0xd0 [nvidia 0f3cb87a613037cd37443fc0f9e701834c4ef2b7]
Mar 14 23:00:56 valkyrie kernel:  nv_shutdown_adapter+0x17/0xa0 [nvidia 0f3cb87a613037cd37443fc0f9e701834c4ef2b7]
Mar 14 23:00:56 valkyrie kernel:  nv_close_device+0x106/0x190 [nvidia 0f3cb87a613037cd37443fc0f9e701834c4ef2b7]
Mar 14 23:00:56 valkyrie kernel:  ? down+0x1e/0x70
Mar 14 23:00:56 valkyrie kernel:  nvidia_close+0x91/0x3b0 [nvidia 0f3cb87a613037cd37443fc0f9e701834c4ef2b7]
Mar 14 23:00:56 valkyrie kernel:  nvidia_frontend_close+0x2e/0x50 [nvidia 0f3cb87a613037cd37443fc0f9e701834c4ef2b7]
Mar 14 23:00:56 valkyrie kernel:  __fput+0xe6/0x2a0
Mar 14 23:00:56 valkyrie kernel:  task_work_run+0x5d/0x90
Mar 14 23:00:56 valkyrie kernel:  do_exit+0x2a9/0xaa0
Mar 14 23:00:56 valkyrie kernel:  do_group_exit+0x2d/0xc0
Mar 14 23:00:56 valkyrie kernel:  get_signal+0x81c/0x840
Mar 14 23:00:56 valkyrie kernel:  arch_do_signal_or_restart+0x3e/0x260
Mar 14 23:00:56 valkyrie kernel:  exit_to_user_mode_loop+0x7c/0x4b0
Mar 14 23:00:56 valkyrie kernel:  do_syscall_64+0x206/0x610
Mar 14 23:00:56 valkyrie kernel:  ? do_epoll_ctl+0x2b5/0xef0
Mar 14 23:00:56 valkyrie kernel:  ? switch_fpu_return+0x4e/0xd0
Mar 14 23:00:56 valkyrie kernel:  ? __x64_sys_epoll_ctl+0x6f/0xa0
Mar 14 23:00:56 valkyrie kernel:  ? do_syscall_64+0x3c4/0x610
Mar 14 23:00:56 valkyrie kernel:  ? ep_poll_callback+0x261/0x2a0
Mar 14 23:00:56 valkyrie kernel:  ? __wake_up_common+0x72/0xa0
Mar 14 23:00:56 valkyrie kernel:  ? futex_hash+0x86/0xa0
Mar 14 23:00:56 valkyrie kernel:  ? futex_wake+0xac/0x1c0
Mar 14 23:00:56 valkyrie kernel:  ? __schedule+0x390/0x1720
Mar 14 23:00:56 valkyrie kernel:  ? do_futex+0x11f/0x190
Mar 14 23:00:56 valkyrie kernel:  ? __x64_sys_futex+0x12d/0x220
Mar 14 23:00:56 valkyrie kernel:  ? do_syscall_64+0x3c4/0x610
Mar 14 23:00:56 valkyrie kernel:  ? ksys_write+0xcd/0xf0
Mar 14 23:00:56 valkyrie kernel:  ? do_syscall_64+0x3c4/0x610
Mar 14 23:00:56 valkyrie kernel:  ? __pfx_ep_autoremove_wake_function+0x10/0x10
Mar 14 23:00:56 valkyrie kernel:  ? __x64_sys_epoll_wait+0x70/0x120
Mar 14 23:00:56 valkyrie kernel:  ? exit_to_user_mode_loop+0x302/0x4b0
Mar 14 23:00:56 valkyrie kernel:  ? switch_fpu_return+0x4e/0xd0
Mar 14 23:00:56 valkyrie kernel:  ? do_syscall_64+0x23b/0x610
Mar 14 23:00:56 valkyrie kernel:  ? irqentry_exit+0x378/0x5e0
Mar 14 23:00:56 valkyrie kernel:  ? __irq_exit_rcu+0x4c/0xf0
Mar 14 23:00:56 valkyrie kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Mar 14 23:00:56 valkyrie kernel: RIP: 0033:0x7f3dbeadef32
Mar 14 23:00:56 valkyrie kernel: Code: Unable to access opcode bytes at 0x7f3dbeadef08.
Mar 14 23:00:56 valkyrie kernel: RSP: 002b:00007f3dacffde68 EFLAGS: 00000246 ORIG_RAX: 00000000000000e8
Mar 14 23:00:56 valkyrie kernel: RAX: fffffffffffffffc RBX: 00007f3dacffdee0 RCX: 00007f3dbeadef32
Mar 14 23:00:56 valkyrie kernel: RDX: 0000000000000100 RSI: 00007f3dacffdee0 RDI: 000000000000001f
Mar 14 23:00:56 valkyrie kernel: RBP: 00007f3dacffde90 R08: 0000000000000000 R09: 0000000000000000
Mar 14 23:00:56 valkyrie kernel: R10: ffffffffffffffff R11: 0000000000000246 R12: 0000562b0f279738
Mar 14 23:00:56 valkyrie kernel: R13: 0000562b0f24e3c0 R14: 00007f3dacfffce4 R15: 00007ffd76cb9787
Mar 14 23:00:56 valkyrie kernel:  </TASK>

What concerns me is the "WARNING: Flushing system-wide workqueues will be prohibited in near future." and not knowing whether this is a related cause, or just a warning that things will change in the future? However it does look related to the stack dump beginning with "os_flush_work_queue+0x41/0x60 [nvidia 0f3cb87a613037cd37443fc0f9e701834c4ef2b7]" and the following lines. I'm not great at reading kernel call traces, but this sure looks like this crash is triggered by the attempt to os_flush_work_queue which starts the process to disable the Nvidia card and shut it down. I have no clue what the register contents tell me about the issue, otherr than what the registers contained when the crash occurred. I don't understand "Unable to access opcode bytes at 0x7f3dbeadef08" other than whatever opcode it was trying to execute is no longer functional (is that the flush is already removed?)

The crux here is I help maintain the nvidia-390xx-utils driver, so if this is a driver issue we need to patch, then that's likely something I can get with the others that help out and try and work through. But I don't know if this is Xorg, or driver, or systemd, or kernel or what? Any help either resolving this issue, or in identifying which of the involved packages need to be fixed would be greatly appreciated. If you need more info, just let me know.

Last edited by drankinatty (2026-03-15 19:34:57)

loqs · 2026-03-15 19:31:19

drankinatty wrote:

What concerns me is the "WARNING: Flushing system-wide workqueues will be prohibited in near future." and not knowing whether this is a related cause, or just a warning that things will change in the future? However it does look related to the stack dump beginning with "os_flush_work_queue+0x41/0x60 [nvidia 0f3cb87a613037cd37443fc0f9e701834c4ef2b7]" and the following lines.

The warning is the cause of the backtrace. At this point it is just a warning that the Nvidia driver called flush_scheduled_work.

drankinatty · 2026-03-15 19:40:26

Thank you. So if that's the case, then the driver needs to be patched to prevent a system-wide flush call. Oh joy, that will take a bit of digging and hopefully it's not in the binary blob. I don't experience the same hang with the openSUSE patched 390 driver (what they call their G04 driver) and the 6.19 kernel, so I'll check with the maintainer there to see if he included a patch we missed.

If anyone is familiar with the code for the system-wide flush as part of the driver shut-down, chime in and let us know.

loqs · 2026-03-15 19:51:11

You can make the call a noop. Switching to a private workqueue would be problematic as I believe 390xx has a binary blob that might be using the global workqueue.

drankinatty · 2026-03-15 20:04:13

Thank you. I'll look though the sources and try and find where the call is being made. (hopefully a case-insensitive "flush" will help). If a noop will do, that is something we can do without messing it up

loqs · 2026-03-15 20:33:59

kernel/common/inc/nv-linux.h uses flush_scheduled_work:

....
#if defined(NV_KMEM_CACHE_HAS_KOBJ_REMOVE_WORK) && !defined(NV_SYSFS_SLAB_UNLINK_PRESENT)
static inline void nv_kmem_ctor_dummy(void *arg)
{
    (void)arg;
}
#define NV_KMEM_CACHE_DESTROY_FLUSH flush_scheduled_work
#else
#define nv_kmem_ctor_dummy NULL
#define NV_KMEM_CACHE_DESTROY_FLUSH()
#endif
.....
#define NV_WORKQUEUE_FLUSH()                           \
    flush_scheduled_work();

grep also hit kernel/nvidia/nv-kernel.o_binary

Do note currently using flush_scheduled_work should only generate a warning with backtrace. It should not be triggering a 90 second delay which may be a systemd service timeout which may or may not be related.

Last edited by loqs (2026-03-15 20:35:55)

drankinatty · 2026-03-15 20:55:46

You are right on the 90 second delay. That is due to my NUT driver a few lines later:

Mar 14 23:00:59 valkyrie systemd[1]: mariadb.service: Consumed 54.685s CPU time over 6d 9h 11min 4.313s wall clock time, 238.8M memory peak.
Mar 14 23:02:26 valkyrie systemd[1]: nut-driver.service: State 'stop-sigterm' timed out. Killing.
Mar 14 23:02:26 valkyrie systemd[1]: nut-driver.service: Killing process 831 (usbhid-ups) with signal SIGKILL.
Mar 14 23:02:26 valkyrie systemd[1]: nut-driver.service: Main process exited, code=killed, status=9/KILL

Sorry for the 90 second noise. On the workque flush, It looks like the invoking culprits are kernel/nvidia/nv.c:1875:

/*
 * Tears down the device on the last file close. Assumes nvl->ldata_lock is
 * held.
 */
static void nv_stop_device(nv_state_t *nv, nvidia_stack_t *sp)
{
    ...
    if (NV_IS_GVI_DEVICE(nv))
    {
        rm_shutdown_gvi_device(sp, nv);
        NV_WORKQUEUE_FLUSH();
        free_irq(nv->interrupt_line, (void *)nvl);
    }

and kernel/nvidia/nv.c:4112:

void
nvidia_remove(struct pci_dev *dev)
{
    ...
    if (NV_IS_GVI_DEVICE(nv))
    {
        NV_WORKQUEUE_FLUSH();
        rm_gvi_detach_device(sp, nv);
        rm_gvi_free_private_state(sp, nv);
    }

On the stop and remove, why not simply condition the call to NV_WORKQUEUE_FLUSH(); on kernels less than 6.18? I'll have to look into that further, but if the NV_WORKQUEUE_FLUSH(); (flush_scheduled_work();) is causing the stack dump on shutdown for 6.18 or greater, would just omitting the call for 6.18 or greater be viable?

loqs · 2026-03-15 21:15:00

I would have replaced it in nv-linux.h:

#if defined(NV_KMEM_CACHE_HAS_KOBJ_REMOVE_WORK) && !defined(NV_SYSFS_SLAB_UNLINK_PRESENT)
    static inline void nv_kmem_ctor_dummy(void *arg)
    {
        (void)arg;
    }

    #if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 18, 0)
        #define NV_KMEM_CACHE_DESTROY_FLUSH ((void)0)
    #else
        #define NV_KMEM_CACHE_DESTROY_FLUSH flush_scheduled_work
    #endif
#else
    #define nv_kmem_ctor_dummy NULL
    #define NV_KMEM_CACHE_DESTROY_FLUSH()
#endif
....
#if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 18, 0)
    #define NV_WORKQUEUE_FLUSH() do { } while (0)
#else
    #define NV_WORKQUEUE_FLUSH() flush_scheduled_work();
#endif

Both approaches should produce the same result.

drankinatty · 2026-03-15 21:32:36

Your approach handling it in the header and leaving the source-file alone is cleaner. I'll give it a go and report back. Thank you for your help! That wisdom of experience to fill in the gaps in my understanding is worth its weight in gold.

drankinatty · 2026-03-17 01:11:42

This is the patch I came up with for testing. If you have time, give it a test, I'll report back with results after I apply it and can find a time in the middle of the night to reboot the server (people seem to get upset if they find the server missing -- even for a few moments....)

diff -ruNb a/kernel/common/inc/nv-linux.h b/kernel/common/inc/nv-linux.h
--- a/kernel/common/inc/nv-linux.h      2026-03-08 13:24:42.740759537 -0500
+++ b/kernel/common/inc/nv-linux.h      2026-03-16 20:05:14.506496123 -0500
@@ -1129,7 +1129,21 @@
 {
     (void)arg;
 }
-#define NV_KMEM_CACHE_DESTROY_FLUSH flush_scheduled_work
+
+/*
+ * On Linux, beginning with kernel 6.18 flushing system-wide workqueues was
+ * removed resulting in a stack dump on driver shutdown and removal. The
+ * following additions eliminate the system-wide workqueue flush on
+ * shutdown by making call to NV_KMEM_CACHE_DESTROY_FLUSH a noop for kenrels
+ * 6.18 and later. (below and at line 1541 for NV_WORKQUEUE_FLUSH)
+ *
+ * See discussion: https://bbs.archlinux.org/viewtopic.php?id=312658
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 18, 0)
+    #define NV_KMEM_CACHE_DESTROY_FLUSH ((void)0)
+#else
+    #define NV_KMEM_CACHE_DESTROY_FLUSH flush_scheduled_work
+#endif
 #else
 #define nv_kmem_ctor_dummy NULL
 #define NV_KMEM_CACHE_DESTROY_FLUSH()
@@ -1524,8 +1538,14 @@
 } nv_work_t;

 #define NV_WORKQUEUE_SCHEDULE(work) schedule_work(work)
+/* system-wide workqueue flust removal, cont. from 1151 */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(6, 18, 0)
+#define NV_WORKQUEUE_FLUSH() do { } while (0)
+#else
 #define NV_WORKQUEUE_FLUSH()                           \
     flush_scheduled_work();
+#endif
+
 #if (NV_INIT_WORK_ARGUMENT_COUNT == 2)
 #define NV_WORKQUEUE_INIT(tq,handler,data)             \
     {                                                  \

Last edited by drankinatty (2026-03-20 22:35:02)

drankinatty · 2026-03-20 21:25:09

Patch successfully removes stack dump and call trace on shutdown. I named the patch kernel-6.18-nv_workqueue_flush.patch and it is applied in the PKGBUILD as folows:

source=('nvidia-drm-outputclass.conf'
        ...
        kernel-6.18-nv_workqueue_flush.patch
        kernel-4.16+-memory-encryption.patch)
...
    # developed to eliminate call to system-wide workqueue flush on shutdown causing
    # stackdump and call-trace.
    #   https://bbs.archlinux.org/viewtopic.php?id=312658
    patch -Np1 -i ../kernel-6.18-nv_workqueue_flush.patch

(of course bump pkgver by +1)

Thanks for your help @logs!

Arch Linux

#1 2026-03-15 18:57:48

Xorg signal 6/ABRT on shutdown, Flushing workques (legacy Nvidia)

#2 2026-03-15 19:31:19

Re: Xorg signal 6/ABRT on shutdown, Flushing workques (legacy Nvidia)

#3 2026-03-15 19:40:26

Re: Xorg signal 6/ABRT on shutdown, Flushing workques (legacy Nvidia)

#4 2026-03-15 19:51:11

Re: Xorg signal 6/ABRT on shutdown, Flushing workques (legacy Nvidia)

#5 2026-03-15 20:04:13

Re: Xorg signal 6/ABRT on shutdown, Flushing workques (legacy Nvidia)

#6 2026-03-15 20:33:59

Re: Xorg signal 6/ABRT on shutdown, Flushing workques (legacy Nvidia)

#7 2026-03-15 20:55:46

Re: Xorg signal 6/ABRT on shutdown, Flushing workques (legacy Nvidia)

#8 2026-03-15 21:15:00

Re: Xorg signal 6/ABRT on shutdown, Flushing workques (legacy Nvidia)

#9 2026-03-15 21:32:36

Re: Xorg signal 6/ABRT on shutdown, Flushing workques (legacy Nvidia)

#10 2026-03-17 01:11:42

Re: Xorg signal 6/ABRT on shutdown, Flushing workques (legacy Nvidia)

#11 2026-03-20 21:25:09

Re: Xorg signal 6/ABRT on shutdown, Flushing workques (legacy Nvidia)

Board footer