[SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

mesaprotector · 2024-04-06 14:52:06

seth wrote:

Update, for anyone affected by this:
If you feel adventurous to try the latest nvidia driver again, the problem is increasingly likely the nvidia_uvm module.
This is mostly relevant for cuda and to use the GPU in containers, but also for https://wiki.archlinux.org/title/NVIDIA … with_NVENC and eg. utilized by gimp for HW acceleration.
If you critically rely on that, this isn't a viable solution but it would still be awesome if you can confirm the condition.
Afaict the module also comes w/ a ton of paramters, so if you've it around, posting
modinfo nvidia_uvm
might shed further light.
https://wiki.archlinux.org/title/Kernel … and_line_2
module_blacklist=nvidia_uvm
And if anyone here happens to post at https://forums.developer.nvidia.com/t/s … /284772/27 you might want to forward this.
Particularily if you could confirm it.

I'm still back on 545, but I guess if someone is on 550 they can use this for comparison. (Output of modinfo nvidia_uvm).

filename:       /lib/modules/6.6.25-1-lts/updates/dkms/nvidia-uvm.ko.zst
version:        545.29.06
supported:      external
license:        Dual MIT/GPL
srcversion:     F211ABD912933D70C263150
depends:        nvidia
retpoline:      Y
name:           nvidia_uvm
vermagic:       6.6.25-1-lts SMP preempt mod_unload 
sig_id:         PKCS#7
signer:         DKMS module signing key
sig_key:        41:B7:D0:17:6D:35:50:04:09:82:DE:8C:92:9F:F4:42:61:0E:86:5B
sig_hashalgo:   sha512
signature:      07:67:94:8D:9A:3F:7E:4C:7E:AF:7C:38:B1:4E:84:CC:8A:8F:04:AF:
		05:FD:F3:C4:EE:67:57:25:7D:63:E1:2B:E0:CA:B8:52:BC:6D:C3:E1:
		FB:F1:4E:78:0B:0A:BB:45:BC:14:61:75:7F:0E:5C:8F:86:4D:2D:D9:
		AB:52:99:84:D8:3B:D1:AC:BC:70:EC:7C:5B:6B:D3:8E:96:2C:1D:8C:
		B4:30:98:54:A8:7B:46:AB:A3:28:FB:89:98:9E:17:E8:95:7F:27:CF:
		A9:41:C9:74:3F:F6:B9:57:FD:7D:3F:2D:57:A7:96:71:DA:F3:32:1D:
		13:59:F1:D3:72:1D:27:77:69:88:7D:B3:9D:8B:22:FA:49:3B:A7:4D:
		36:24:CE:54:DC:69:90:AB:11:98:F0:00:E2:2D:8F:BC:AF:CF:62:AA:
		CD:9D:59:AA:E4:00:89:4D:65:0B:9C:28:44:8F:C3:41:BB:EB:09:B1:
		80:49:29:B7:E7:E7:6B:A1:58:31:98:A8:32:F4:1E:71:84:70:3B:D3:
		0E:52:62:26:F3:C4:B7:12:0C:1B:22:43:F9:28:55:B2:AC:52:51:43:
		40:97:74:67:A2:05:5A:52:3D:D6:34:70:0F:8D:04:9A:18:4F:AC:9B:
		37:1E:76:96:6E:EE:03:CE:B3:B4:B3:09:65:35:77:F2
parm:           uvm_ats_mode:Set to 0 to disable ATS (Address Translation Services). Any other value is ignored. Has no effect unless the platform supports ATS. (int)
parm:           uvm_perf_prefetch_enable:uint
parm:           uvm_perf_prefetch_threshold:uint
parm:           uvm_perf_prefetch_min_faults:uint
parm:           uvm_perf_thrashing_enable:uint
parm:           uvm_perf_thrashing_threshold:uint
parm:           uvm_perf_thrashing_pin_threshold:uint
parm:           uvm_perf_thrashing_lapse_usec:uint
parm:           uvm_perf_thrashing_nap:uint
parm:           uvm_perf_thrashing_epoch:uint
parm:           uvm_perf_thrashing_pin:uint
parm:           uvm_perf_thrashing_max_resets:uint
parm:           uvm_perf_map_remote_on_native_atomics_fault:uint
parm:           uvm_disable_hmm:Force-disable HMM functionality in the UVM driver. Default: false (HMM is enabled if possible). However, even with uvm_disable_hmm=false, HMM will not be enabled if is not supported in this driver build configuration, or if ATS settings conflict with HMM. (bool)
parm:           uvm_perf_migrate_cpu_preunmap_enable:int
parm:           uvm_perf_migrate_cpu_preunmap_block_order:uint
parm:           uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)
parm:           uvm_perf_pma_batch_nonpinned_order:uint
parm:           uvm_cpu_chunk_allocation_sizes:OR'ed value of all CPU chunk allocation sizes. (uint)
parm:           uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes allocated and freed, 2 = per-allocation origin tracking. (int)
parm:           uvm_force_prefetch_fault_support:uint
parm:           uvm_debug_enable_push_desc:Enable push description tracking (uint)
parm:           uvm_debug_enable_push_acquire_info:Enable push acquire information tracking (uint)
parm:           uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid, sys. (charp)
parm:           uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_batch_count:uint
parm:           uvm_perf_access_counter_granularity:Size of the physical memory region tracked by each counter. Valid values asof Volta: 64k, 2m, 16m, 16g (charp)
parm:           uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a notification.Valid values: [1, 65535] (uint)
parm:           uvm_perf_reenable_prefetch_faults_lapse_msec:uint
parm:           uvm_perf_fault_batch_count:uint
parm:           uvm_perf_fault_replay_policy:uint
parm:           uvm_perf_fault_replay_update_put_ratio:uint
parm:           uvm_perf_fault_max_batches_per_service:uint
parm:           uvm_perf_fault_max_throttle_per_service:uint
parm:           uvm_perf_fault_coalesce:uint
parm:           uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0. (int)
parm:           uvm_perf_map_remote_on_eviction:int
parm:           uvm_exp_gpu_cache_peermem:Force caching for mappings to peer memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_exp_gpu_cache_sysmem:Force caching for mappings to system memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_downgrade_force_membar_sys:Force all TLB invalidation downgrades to use MEMBAR_SYS (uint)
parm:           uvm_channel_num_gpfifo_entries:uint
parm:           uvm_channel_gpfifo_loc:charp
parm:           uvm_channel_gpput_loc:charp
parm:           uvm_channel_pushbuffer_loc:charp
parm:           uvm_enable_va_space_mm:Set to 0 to disable UVM from using mmu_notifiers to create an association between a UVM VA space and a process. This will also disable pageable memory access via either ATS or HMM. (int)
parm:           uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)

Last edited by mesaprotector (2024-04-06 14:53:03)

ZulluBalti · 2024-04-06 15:01:10

mesaprotector wrote:

seth wrote:

Update, for anyone affected by this:
If you feel adventurous to try the latest nvidia driver again, the problem is increasingly likely the nvidia_uvm module.
This is mostly relevant for cuda and to use the GPU in containers, but also for https://wiki.archlinux.org/title/NVIDIA … with_NVENC and eg. utilized by gimp for HW acceleration.
If you critically rely on that, this isn't a viable solution but it would still be awesome if you can confirm the condition.
Afaict the module also comes w/ a ton of paramters, so if you've it around, posting
modinfo nvidia_uvm
might shed further light.
https://wiki.archlinux.org/title/Kernel … and_line_2
module_blacklist=nvidia_uvm
And if anyone here happens to post at https://forums.developer.nvidia.com/t/s … /284772/27 you might want to forward this.
Particularily if you could confirm it.

I'm still back on 545, but I guess if someone is on 550 they can use this for comparison. (Output of modinfo nvidia_uvm).

filename:       /lib/modules/6.6.25-1-lts/updates/dkms/nvidia-uvm.ko.zst
version:        545.29.06
supported:      external
license:        Dual MIT/GPL
srcversion:     F211ABD912933D70C263150
depends:        nvidia
retpoline:      Y
name:           nvidia_uvm
vermagic:       6.6.25-1-lts SMP preempt mod_unload 
sig_id:         PKCS#7
signer:         DKMS module signing key
sig_key:        41:B7:D0:17:6D:35:50:04:09:82:DE:8C:92:9F:F4:42:61:0E:86:5B
sig_hashalgo:   sha512
signature:      07:67:94:8D:9A:3F:7E:4C:7E:AF:7C:38:B1:4E:84:CC:8A:8F:04:AF:
		05:FD:F3:C4:EE:67:57:25:7D:63:E1:2B:E0:CA:B8:52:BC:6D:C3:E1:
		FB:F1:4E:78:0B:0A:BB:45:BC:14:61:75:7F:0E:5C:8F:86:4D:2D:D9:
		AB:52:99:84:D8:3B:D1:AC:BC:70:EC:7C:5B:6B:D3:8E:96:2C:1D:8C:
		B4:30:98:54:A8:7B:46:AB:A3:28:FB:89:98:9E:17:E8:95:7F:27:CF:
		A9:41:C9:74:3F:F6:B9:57:FD:7D:3F:2D:57:A7:96:71:DA:F3:32:1D:
		13:59:F1:D3:72:1D:27:77:69:88:7D:B3:9D:8B:22:FA:49:3B:A7:4D:
		36:24:CE:54:DC:69:90:AB:11:98:F0:00:E2:2D:8F:BC:AF:CF:62:AA:
		CD:9D:59:AA:E4:00:89:4D:65:0B:9C:28:44:8F:C3:41:BB:EB:09:B1:
		80:49:29:B7:E7:E7:6B:A1:58:31:98:A8:32:F4:1E:71:84:70:3B:D3:
		0E:52:62:26:F3:C4:B7:12:0C:1B:22:43:F9:28:55:B2:AC:52:51:43:
		40:97:74:67:A2:05:5A:52:3D:D6:34:70:0F:8D:04:9A:18:4F:AC:9B:
		37:1E:76:96:6E:EE:03:CE:B3:B4:B3:09:65:35:77:F2
parm:           uvm_ats_mode:Set to 0 to disable ATS (Address Translation Services). Any other value is ignored. Has no effect unless the platform supports ATS. (int)
parm:           uvm_perf_prefetch_enable:uint
parm:           uvm_perf_prefetch_threshold:uint
parm:           uvm_perf_prefetch_min_faults:uint
parm:           uvm_perf_thrashing_enable:uint
parm:           uvm_perf_thrashing_threshold:uint
parm:           uvm_perf_thrashing_pin_threshold:uint
parm:           uvm_perf_thrashing_lapse_usec:uint
parm:           uvm_perf_thrashing_nap:uint
parm:           uvm_perf_thrashing_epoch:uint
parm:           uvm_perf_thrashing_pin:uint
parm:           uvm_perf_thrashing_max_resets:uint
parm:           uvm_perf_map_remote_on_native_atomics_fault:uint
parm:           uvm_disable_hmm:Force-disable HMM functionality in the UVM driver. Default: false (HMM is enabled if possible). However, even with uvm_disable_hmm=false, HMM will not be enabled if is not supported in this driver build configuration, or if ATS settings conflict with HMM. (bool)
parm:           uvm_perf_migrate_cpu_preunmap_enable:int
parm:           uvm_perf_migrate_cpu_preunmap_block_order:uint
parm:           uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)
parm:           uvm_perf_pma_batch_nonpinned_order:uint
parm:           uvm_cpu_chunk_allocation_sizes:OR'ed value of all CPU chunk allocation sizes. (uint)
parm:           uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes allocated and freed, 2 = per-allocation origin tracking. (int)
parm:           uvm_force_prefetch_fault_support:uint
parm:           uvm_debug_enable_push_desc:Enable push description tracking (uint)
parm:           uvm_debug_enable_push_acquire_info:Enable push acquire information tracking (uint)
parm:           uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid, sys. (charp)
parm:           uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_batch_count:uint
parm:           uvm_perf_access_counter_granularity:Size of the physical memory region tracked by each counter. Valid values asof Volta: 64k, 2m, 16m, 16g (charp)
parm:           uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a notification.Valid values: [1, 65535] (uint)
parm:           uvm_perf_reenable_prefetch_faults_lapse_msec:uint
parm:           uvm_perf_fault_batch_count:uint
parm:           uvm_perf_fault_replay_policy:uint
parm:           uvm_perf_fault_replay_update_put_ratio:uint
parm:           uvm_perf_fault_max_batches_per_service:uint
parm:           uvm_perf_fault_max_throttle_per_service:uint
parm:           uvm_perf_fault_coalesce:uint
parm:           uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0. (int)
parm:           uvm_perf_map_remote_on_eviction:int
parm:           uvm_exp_gpu_cache_peermem:Force caching for mappings to peer memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_exp_gpu_cache_sysmem:Force caching for mappings to system memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_downgrade_force_membar_sys:Force all TLB invalidation downgrades to use MEMBAR_SYS (uint)
parm:           uvm_channel_num_gpfifo_entries:uint
parm:           uvm_channel_gpfifo_loc:charp
parm:           uvm_channel_gpput_loc:charp
parm:           uvm_channel_pushbuffer_loc:charp
parm:           uvm_enable_va_space_mm:Set to 0 to disable UVM from using mmu_notifiers to create an association between a UVM VA space and a process. This will also disable pageable memory access via either ATS or HMM. (int)
parm:           uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)

Here's the output of

modeinfo nvidia-uvm

filename:       /lib/modules/6.8.2-arch2-1/extramodules/nvidia-uvm.ko.xz
version:        550.67
supported:      external
license:        Dual MIT/GPL
srcversion:     E8BAEAF83C32EBD2D30C349
depends:        nvidia
retpoline:      Y
name:           nvidia_uvm
vermagic:       6.8.2-arch2-1 SMP preempt mod_unload 
parm:           uvm_ats_mode:Set to 0 to disable ATS (Address Translation Services). Any other value is ignored. Has no effect unless the platform supports ATS. (int)
parm:           uvm_perf_prefetch_enable:uint
parm:           uvm_perf_prefetch_threshold:uint
parm:           uvm_perf_prefetch_min_faults:uint
parm:           uvm_perf_thrashing_enable:uint
parm:           uvm_perf_thrashing_threshold:uint
parm:           uvm_perf_thrashing_pin_threshold:uint
parm:           uvm_perf_thrashing_lapse_usec:uint
parm:           uvm_perf_thrashing_nap:uint
parm:           uvm_perf_thrashing_epoch:uint
parm:           uvm_perf_thrashing_pin:uint
parm:           uvm_perf_thrashing_max_resets:uint
parm:           uvm_perf_map_remote_on_native_atomics_fault:uint
parm:           uvm_disable_hmm:Force-disable HMM functionality in the UVM driver. Default: false (HMM is enabled if possible). However, even with uvm_disable_hmm=false, HMM will not be enabled if is not supported in this driver build configuration, or if ATS settings conflict with HMM. (bool)
parm:           uvm_perf_migrate_cpu_preunmap_enable:int
parm:           uvm_perf_migrate_cpu_preunmap_block_order:uint
parm:           uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)
parm:           uvm_perf_pma_batch_nonpinned_order:uint
parm:           uvm_cpu_chunk_allocation_sizes:OR'ed value of all CPU chunk allocation sizes. (uint)
parm:           uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes allocated and freed, 2 = per-allocation origin tracking. (int)
parm:           uvm_force_prefetch_fault_support:uint
parm:           uvm_debug_enable_push_desc:Enable push description tracking (uint)
parm:           uvm_debug_enable_push_acquire_info:Enable push acquire information tracking (uint)
parm:           uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid, sys. (charp)
parm:           uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_batch_count:uint
parm:           uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a notification.Valid values: [1, 65535] (uint)
parm:           uvm_perf_reenable_prefetch_faults_lapse_msec:uint
parm:           uvm_perf_fault_batch_count:uint
parm:           uvm_perf_fault_replay_policy:uint
parm:           uvm_perf_fault_replay_update_put_ratio:uint
parm:           uvm_perf_fault_max_batches_per_service:uint
parm:           uvm_perf_fault_max_throttle_per_service:uint
parm:           uvm_perf_fault_coalesce:uint
parm:           uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0. (int)
parm:           uvm_perf_map_remote_on_eviction:int
parm:           uvm_block_cpu_to_cpu_copy_with_ce:Use GPU CEs for CPU-to-CPU migrations. (int)
parm:           uvm_exp_gpu_cache_peermem:Force caching for mappings to peer memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_exp_gpu_cache_sysmem:Force caching for mappings to system memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_downgrade_force_membar_sys:Force all TLB invalidation downgrades to use MEMBAR_SYS (uint)
parm:           uvm_channel_num_gpfifo_entries:uint
parm:           uvm_channel_gpfifo_loc:charp
parm:           uvm_channel_gpput_loc:charp
parm:           uvm_channel_pushbuffer_loc:charp
parm:           uvm_enable_va_space_mm:Set to 0 to disable UVM from using mmu_notifiers to create an association between a UVM VA space and a process. This will also disable pageable memory access via either ATS or HMM. (int)
parm:           uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)
parm:           uvm_peer_copy:Choose the addressing mode for peer copying, options: phys [default] or virt. Valid for Ampere+ GPUs. (charp)
parm:           uvm_debug_prints:Enable uvm debug prints. (int)
parm:           uvm_enable_builtin_tests:Enable the UVM built-in tests. (This is a security risk) (int)
parm:           uvm_release_asserts:Enable uvm asserts included in release builds. (int)
parm:           uvm_release_asserts_dump_stack:dump_stack() on failed UVM release asserts. (int)
parm:           uvm_release_asserts_set_global_error:Set UVM global fatal error on failed release asserts. (int)
parm:           uvm_conf_computing_channel_iv_rotation_limit:ulong

seth · 2024-04-06 20:50:07

There's unfortunately nothing that blatantly says "uvm_please_leave_cgroups_alone"
Maybe somebody on the nvidia forum knows more about that.

RaZorr · 2024-04-18 02:43:03

Unfortunately facing the same issue.

My system: Legion 5 15ACH6H, AMD ryzen 7 5800H with Radeon iGPU and nvidia RTX 3060

For me the freezes happens during:

1. Updating the system/Installing a package - when it reaches

Reloading system manger configuration

. Happened during kernel update two days ago and the system was in a unbootable state. Had to update using arch iso on a USB.

2. Shutting down the system. The system just freezes without ever shutting down

Journalctl logs from the boot that crashed: http://0x0.st/XoKh.txt

EDIT :- I had done the setting up of udev rules and the module parameter as per fully power down the GPU when not in use instructions. Don't know whether it is relevant.

Last edited by RaZorr (2024-04-18 02:49:13)

seth · 2024-04-18 08:12:01

There're three "Bad page map"'s, possibly because of zswap.
But the journal concludes with

Apr 18 02:46:34 archsys systemd[1]: Starting Hostname Service...
-- Boot cef218bc495e4d460000000000000000 --

because you likely pushed the power button, shttps://wiki.archlinux.org/title/Keyboard_shortcuts#Kernel_(SysRq)

Can you impact this by
a) blacklisting nvidia_uvm
b)

sudo touch /etc/systemd/do-not-udevadm-trigger-on-update

- https://bugs.archlinux.org/task/77789

RaZorr · 2024-04-18 09:41:22

I will test this soon. Have a deadline today and was frustrated when the laptop froze during training a neural network model. I was installing {{ic|btrbk}} to have snapshots during Pacman updates when this happened.

Edit: Yes I forced shut down using the power button. I have Sysrq enabled in

cat /etc/sysctl.d/99-magicsysrq.conf
kernel.sysrq=1

But forgot it was there.

Last edited by RaZorr (2024-04-18 09:58:32)

RaZorr · 2024-04-18 16:09:33

I don't know if this is relevant but I saw a difference between the output of inxi -G with a user on reddit who didn't have such an issue. For me(first output below) the modesetting driver was unloaded whereas they (second output below) had it loaded.
My output:

inxi -G
Graphics:
  Device-1: NVIDIA GA106M [GeForce RTX 3060 Mobile / Max-Q] driver: nvidia
    v: 550.67
  Device-2: AMD Cezanne [Radeon Vega Series / Radeon Mobile Series]
    driver: amdgpu v: kernel
  Device-3: Syntek Integrated Camera driver: uvcvideo type: USB
  Display: x11 server: X.Org v: 21.1.13 driver: X: loaded: amdgpu,nvidia
    unloaded: modesetting dri: radeonsi gpu: amdgpu resolution: 1920x1080~165Hz
  API: EGL v: 1.5 drivers: kms_swrast,nvidia,radeonsi,swrast
    platforms: gbm,x11,surfaceless,device
  API: OpenGL v: 4.6.0 compat-v: 4.5 vendor: amd mesa v: 24.0.5-arch1.1
    renderer: AMD Radeon Graphics (radeonsi renoir LLVM 17.0.6 DRM 3.57
    6.8.5-arch1-1)

Graphics:
  Device-1: NVIDIA TU106M [GeForce RTX 2060 Max-Q] driver: nvidia v: 550.67
  Device-2: AMD Renoir [Radeon RX Vega 6 ] driver: amdgpu v: kernel
  Display: wayland server: X.org v: 1.21.1.13 with: Xwayland v: 23.2.6
    compositor: Hyprland v: 0.39.1-1-ge8e02e81 driver: X:
    loaded: modesetting,nvidia gpu: amdgpu resolution: 1920x1080~120Hz
  API: EGL v: 1.5 drivers: nvidia,radeonsi,swrast
    platforms: wayland,x11,surfaceless,device
  API: OpenGL v: 4.6.0 compat-v: 4.5 vendor: amd mesa v: 24.0.5-arch1.1
    renderer: AMD Radeon Graphics (radeonsi renoir LLVM 17.0.6 DRM 3.57
    6.8.7-arch1-1-g14)
  API: Vulkan v: 1.3.279 drivers: nvidia surfaces: xcb,xlib,wayland

I do have nvidia_drm.modeset=1 as mentioned in NVIDIA#DRM_kernel_mode_setting

cat /proc/cmdline
root=UUID=xxx-xxx-xxx-xxx ro rootflags=subvol=@ initrd=initramfs-linux.img nvidia-drm.modeset=1 root=/dev/mapper/Cbtrfs cryptdevice=UUID=xxx-xxx-xxx-xxx:Cbtrfs:allow-discards

and

sudo cat /sys/module/nvidia_drm/parameters/modeset
Y

mesaprotector · 2024-04-18 16:57:01

RaZorr wrote:

I will test this soon. Have a deadline today and was frustrated when the laptop froze during training a neural network model. I was installing {{ic|btrbk}} to have snapshots during Pacman updates when this happened.
Edit: Yes I forced shut down using the power button. I have Sysrq enabled in
cat /etc/sysctl.d/99-magicsysrq.conf
kernel.sysrq=1
But forgot it was there.

SysRq will only work on a kernel oops (of course it also works if you have non-kernel-level problems like Xorg freezing). If your kernel panics then SysRq will do nothing. This bug seems capable of causing either an oops or a panic.

Right now I just pray every time I see an LTS update that the 545 driver will continue to build. I have heard that 550.40.07 (available in AUR) is the last version without a bug, and it should still work with the 6.8 kernels. I have not tried it myself.

ngoonee · 2024-04-19 04:36:44

This annoyed me enough that I downgraded linux and nvidia-related to 6.7.6/545.29.06. Staying there until it's resolved.

demensdeum · 2024-04-22 21:04:53

Had to downgrade to 535.171.04, because 550.76 is unusable, even with nvidia_uvm blacklisted

seth · 2024-04-22 21:38:39

https://wiki.archlinux.org/title/Arch_Linux_Archive
You'll need matching nvidia-dkms and nvidia-utils and the lts kernel (the older nvidia drivers won't build unpatched against 6.8 because of GPL restrictions) as well as linux-lts-headers (for dkms)

demensdeum · 2024-04-23 05:41:25

Right now I am using my own kernel (6.8.7), as it's allowed in Arch wiki

seth · 2024-04-23 06:13:23

You don't need the permission of the wiki, but the 535xx and 545xx kernels won't build against 6.8
https://bugs.launchpad.net/ubuntu/+sour … comments/3 and the following comments link debian patches to build the older drivers against the 6.8 kernel.
The general suggestion would be to use the LTS kernel w/ the regular driver packages.

demensdeum · 2024-04-23 06:29:58

Interesting situation, right now I have 6.8.7 kernel and 535.171.04 Nvidia driver, I installed drivers directly from NVIDIA-Linux-x86_64-535.171.04.run file

seth · 2024-04-23 06:40:31

The last version in the ALA is 535.113 - to use a newer subrelease it would have been advisable to use https://wiki.archlinux.org/title/Arch_b … ILD_source and update the PKGBUILD for the new release so you won't have to deal w/ getting the installation out of your system.
You'll also most likely be missing the config files from https://gitlab.archlinux.org/archlinux/ … idia-utils and a blacklist for nouveau this way.

jongeduard · 2024-04-26 19:00:52

Ok, I now I am also going to here.
I do believe that I may also be affected by this problem as well, Legion 5 Pro laptop (NVIDIA GeForce RTX 3060 Laptop GPU), but it's still strange in what type of panic it includes.

For a couple of months I have been affected by random panics on shutdown. Often ending with the "Attempt to kill init" message. One person has also placed a screenshot with that on the mentioned Nvidia forum page by seth.
What is trying to kill the init process (systemd)? Does Nvidia run anything which forcibly shuts things down, or does it somehow trigger things to shutdown in incorrect order. I do not yet see the logical connection between things.

For a while now I have also blacklisted the nvidia_uvm module (I use the /etc/modprobe.d/ .conf file approach, and not the kernel command line, but to prove it to be actually disabled I just check it with `lsmod | grep -i nvidia`), but sadly I am still getting the panics every now and then. The frequency of the panics seems somewhat lower maybe.

Also note that I am experiencing the same issue with both kernels linux and linux-lts.

The main scary issue with all this is that I have experienced multiple instances of file system data loss now.
So bad that several folders in the root where gone and I couldn't login to LigtDM and Xfce anymore.
Which is also strange since I am using Btrfs which should be a lot more reliable and not suffer from such issues. I don't get it.

But luckily I also make snapshots on this system and I even backup these (which is all still really great with Btrfs), which I all wrote scripts for.
so I am quite easy able to keep recovering things after they got lost, which is good, but it's still beyond frustrating.

If anyone knows more, it would be really amazing of course.

Update 04-05-2024:

For anyone who might be reading this still and still feels somewhat unsure:
I have been using the version 535 from AUR for a while now and the panics have gone away completely. Everything works normal again.

And the very slow poweroff's are gone as well. There wasn't always a panic, but most of the time the system did shudown, but with delay immediately before actual power off. But now the system now shuts down fast again.

So I can fully confirm it to be the solution for now.
(I feel no hurries to switch back to the new 550 version quickly, so I am fine, but I still hope they are going to do their homework at NVIDIA in properly testing stuff better before rolling out newer versions)

Last edited by jongeduard (2024-05-04 19:28:46)

seth · 2024-04-26 20:09:36

systemd crashes when accessing the cgroupfs, which is userspace exposed kernel structure.
The theory is that nvidia somehow, for some reason compromises that.

The nvidia driver versions before 550xx arent affected and w/ the LTS kernel you can easily downgrade to 545xx or 535xx from the ALA (using the nvidia-dkms and nvidia-utils package of the same version, you'll also need the linux-lts-headers; not from the ALA but the regular repo, matching the actual LTS kernel you're running)
The older nvidia drivers in the ALA won't compile against 6.8, but there's also https://aur.archlinux.org/packages/nvidia-535xx-dkms

nvann · 2024-04-28 02:48:14

seth wrote:

systemd crashes when accessing the cgroupfs, which is userspace exposed kernel structure

Sorry if it is irrelevant but it isn't an inherent systemd crash, the same issue appears on Artix with runit

seth · 2024-04-28 06:16:55

Do you have a stacktrace and context for that?

The crash is inherent to the kernel (cgroupfs) that'll generate an Oops or just halt triggered by what access it.
cgroupfs is extremely likely corrupted by the nvidia module(s), pending questions:

1. why?
I understand why nvidia_uvm is interested in messing with that, but not the drm modules.

2. when?
systemd bangs the cgroupfs "enough" when the reexecution hook runs (which would be a systemd pecularity), from various reports also just when reloading the daemon and this somehow exposes the corruption while day-to-day business seems "fine" (at least I'm not aware of reports outside these contexts)

mesaprotector · 2024-05-08 20:40:52

A heads up for anyone who's doing what I've done and using the linux-lts kernel; as of today (6.6.30-2) it will no longer build the 545 dkms version. Not sure what I can do now other than attempt to get the 535xx versions from the AUR to work. Or just boot without Nvidia lol.

EDIT: The 535xx-dkms driver from the AUR still builds with both the regular and lts kernel. AUR it is.

Last edited by mesaprotector (2024-05-08 20:58:09)

seth · 2024-05-08 20:43:13

Likely same problem as https://bbs.archlinux.org/viewtopic.php?id=295600 (not yet resolved)
You can add the kernel to the ignore list in pacman.conf for a while - it's the most self-contained package and partially downgrading it is relatively safe … for a while.

rodstu · 2024-05-09 15:23:10

After this issue happened, I've created a function to facilitate downgrading packages, like a restore point.
I don't know if this is the best place to share it, in any case, here it is:

archlinux-create-downgrade-point () {
        arr=(linux linux-headers nvidia nvidia-utils nvidia-settings)
        kernel_ver=$(pacman -Qi "linux" | awk '/Version/ {print $NF}')
        kernel_dt=$(date --date="TZ=\"America/Sao_Paulo\" $(pacman -Qi linux | awk -F': ' '/Build Date/ { print $2 }' | sed 's/-[0-9]\+//')" +"%Y-%m-%d")
        echo "?${kernel_ver}"
        dt=$(date +"%Y-%m-%d")
        dest="${HOME}/Downloads/archlinux/downgrade/${kernel_dt}_kernel-${kernel_ver}"
        echo "making directory: ${dest}"
        mkdir -p "$dest"
        echo ""
        for x in ${arr[@]}
        do
                version=$(pacman -Qi "$x" | awk '/Version/ {print $NF}')
                fullpath="$(fd "${x}-${version}.*zst$" /var/cache/pacman/pkg/)"
                echo "    $x: $version"
                echo "    ${fullpath}"
                cp "$fullpath" "$dest"
                echo ""
        done
        gdu -np "$dest" | sort -k3
}

Last edited by rodstu (2024-05-09 15:24:24)

seth · 2024-05-09 15:25:03

There's meanwhile a patch for GCC 14

k3yss · 2024-05-12 11:34:57

~tfa wrote:

I too reverted back to 545 and the issue is gone. I did not test my daemon idea, cause after the last update of 550, Unity in a Vulkan context kept crashing upon load. Same for the vulkan-beta drivers.
Guess this thread is solved then, at least the riddle.

Can you tell me how you downgraded correctly to the NVIDIA 545 version? I used the command `sudo downgrade nvidia nvidia-utils nvidia-lts` and selected version 545.29.02 for all three. However, now when I run `nvidia-smi`, I get the error message:

`NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.`

Edit: Do I also need to downgrade the `kernel` and `kernel header` to a specific version for this to work, if yes how do I know the specific version?

Last edited by k3yss (2024-05-12 14:00:33)

mesaprotector · 2024-05-12 16:53:55

545 doesn't work anymore with the current kernel. You need to either also downgrade the kernel (to any 6.7 version for "linux" or 6.6.30-1 and earlier for "linux-lts") or probably more recommended, get the 535 driver from the AUR: https://aur.archlinux.org/packages/nvidia-535xx-dkms . There are even earlier drivers available in AUR which you can use with a patch, if 535xx doesn't work well for you.

Note that if you build a driver from the AUR you need to manually blacklist nouveau as it will no longer be done automatically.

Last edited by mesaprotector (2024-05-12 16:54:15)

Arch Linux

#26 2024-04-06 14:52:06

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#27 2024-04-06 15:01:10

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#28 2024-04-06 20:50:07

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#29 2024-04-18 02:43:03

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#30 2024-04-18 08:12:01

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#31 2024-04-18 09:41:22

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#32 2024-04-18 16:09:33

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#33 2024-04-18 16:57:01

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#34 2024-04-19 04:36:44

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#35 2024-04-22 21:04:53

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#36 2024-04-22 21:38:39

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#37 2024-04-23 05:41:25

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#38 2024-04-23 06:13:23

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#39 2024-04-23 06:29:58

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#40 2024-04-23 06:40:31

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#41 2024-04-26 19:00:52

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#42 2024-04-26 20:09:36

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#43 2024-04-28 02:48:14

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#44 2024-04-28 06:16:55

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#45 2024-05-08 20:40:52

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#46 2024-05-08 20:43:13

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#47 2024-05-09 15:23:10

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#48 2024-05-09 15:25:03

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#49 2024-05-12 11:34:57

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

#50 2024-05-12 16:53:55

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6 (Nvidia 550)

Board footer