You are not logged in.

#26 2024-04-06 14:52:06

mesaprotector
Member
Registered: 2024-03-03
Posts: 34

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

seth wrote:

Update, for anyone affected by this:
If you feel adventurous to try the latest nvidia driver again, the problem is increasingly likely the nvidia_uvm module.
This is mostly relevant for cuda and to use the GPU in containers, but also for https://wiki.archlinux.org/title/NVIDIA … with_NVENC and eg. utilized by gimp for HW acceleration.

If you critically rely on that, this isn't a viable solution but it would still be awesome if you can confirm the condition.
Afaict the module also comes w/ a ton of paramters, so if you've it around, posting

modinfo nvidia_uvm

might shed further light.

https://wiki.archlinux.org/title/Kernel … and_line_2

module_blacklist=nvidia_uvm

And if anyone here happens to post at https://forums.developer.nvidia.com/t/s … /284772/27 you might want to forward this.
Particularily if you could confirm it.

I'm still back on 545, but I guess if someone is on 550 they can use this for comparison.  (Output of modinfo nvidia_uvm).

filename:       /lib/modules/6.6.25-1-lts/updates/dkms/nvidia-uvm.ko.zst
version:        545.29.06
supported:      external
license:        Dual MIT/GPL
srcversion:     F211ABD912933D70C263150
depends:        nvidia
retpoline:      Y
name:           nvidia_uvm
vermagic:       6.6.25-1-lts SMP preempt mod_unload 
sig_id:         PKCS#7
signer:         DKMS module signing key
sig_key:        41:B7:D0:17:6D:35:50:04:09:82:DE:8C:92:9F:F4:42:61:0E:86:5B
sig_hashalgo:   sha512
signature:      07:67:94:8D:9A:3F:7E:4C:7E:AF:7C:38:B1:4E:84:CC:8A:8F:04:AF:
		05:FD:F3:C4:EE:67:57:25:7D:63:E1:2B:E0:CA:B8:52:BC:6D:C3:E1:
		FB:F1:4E:78:0B:0A:BB:45:BC:14:61:75:7F:0E:5C:8F:86:4D:2D:D9:
		AB:52:99:84:D8:3B:D1:AC:BC:70:EC:7C:5B:6B:D3:8E:96:2C:1D:8C:
		B4:30:98:54:A8:7B:46:AB:A3:28:FB:89:98:9E:17:E8:95:7F:27:CF:
		A9:41:C9:74:3F:F6:B9:57:FD:7D:3F:2D:57:A7:96:71:DA:F3:32:1D:
		13:59:F1:D3:72:1D:27:77:69:88:7D:B3:9D:8B:22:FA:49:3B:A7:4D:
		36:24:CE:54:DC:69:90:AB:11:98:F0:00:E2:2D:8F:BC:AF:CF:62:AA:
		CD:9D:59:AA:E4:00:89:4D:65:0B:9C:28:44:8F:C3:41:BB:EB:09:B1:
		80:49:29:B7:E7:E7:6B:A1:58:31:98:A8:32:F4:1E:71:84:70:3B:D3:
		0E:52:62:26:F3:C4:B7:12:0C:1B:22:43:F9:28:55:B2:AC:52:51:43:
		40:97:74:67:A2:05:5A:52:3D:D6:34:70:0F:8D:04:9A:18:4F:AC:9B:
		37:1E:76:96:6E:EE:03:CE:B3:B4:B3:09:65:35:77:F2
parm:           uvm_ats_mode:Set to 0 to disable ATS (Address Translation Services). Any other value is ignored. Has no effect unless the platform supports ATS. (int)
parm:           uvm_perf_prefetch_enable:uint
parm:           uvm_perf_prefetch_threshold:uint
parm:           uvm_perf_prefetch_min_faults:uint
parm:           uvm_perf_thrashing_enable:uint
parm:           uvm_perf_thrashing_threshold:uint
parm:           uvm_perf_thrashing_pin_threshold:uint
parm:           uvm_perf_thrashing_lapse_usec:uint
parm:           uvm_perf_thrashing_nap:uint
parm:           uvm_perf_thrashing_epoch:uint
parm:           uvm_perf_thrashing_pin:uint
parm:           uvm_perf_thrashing_max_resets:uint
parm:           uvm_perf_map_remote_on_native_atomics_fault:uint
parm:           uvm_disable_hmm:Force-disable HMM functionality in the UVM driver. Default: false (HMM is enabled if possible). However, even with uvm_disable_hmm=false, HMM will not be enabled if is not supported in this driver build configuration, or if ATS settings conflict with HMM. (bool)
parm:           uvm_perf_migrate_cpu_preunmap_enable:int
parm:           uvm_perf_migrate_cpu_preunmap_block_order:uint
parm:           uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)
parm:           uvm_perf_pma_batch_nonpinned_order:uint
parm:           uvm_cpu_chunk_allocation_sizes:OR'ed value of all CPU chunk allocation sizes. (uint)
parm:           uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes allocated and freed, 2 = per-allocation origin tracking. (int)
parm:           uvm_force_prefetch_fault_support:uint
parm:           uvm_debug_enable_push_desc:Enable push description tracking (uint)
parm:           uvm_debug_enable_push_acquire_info:Enable push acquire information tracking (uint)
parm:           uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid, sys. (charp)
parm:           uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_batch_count:uint
parm:           uvm_perf_access_counter_granularity:Size of the physical memory region tracked by each counter. Valid values asof Volta: 64k, 2m, 16m, 16g (charp)
parm:           uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a notification.Valid values: [1, 65535] (uint)
parm:           uvm_perf_reenable_prefetch_faults_lapse_msec:uint
parm:           uvm_perf_fault_batch_count:uint
parm:           uvm_perf_fault_replay_policy:uint
parm:           uvm_perf_fault_replay_update_put_ratio:uint
parm:           uvm_perf_fault_max_batches_per_service:uint
parm:           uvm_perf_fault_max_throttle_per_service:uint
parm:           uvm_perf_fault_coalesce:uint
parm:           uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0. (int)
parm:           uvm_perf_map_remote_on_eviction:int
parm:           uvm_exp_gpu_cache_peermem:Force caching for mappings to peer memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_exp_gpu_cache_sysmem:Force caching for mappings to system memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_downgrade_force_membar_sys:Force all TLB invalidation downgrades to use MEMBAR_SYS (uint)
parm:           uvm_channel_num_gpfifo_entries:uint
parm:           uvm_channel_gpfifo_loc:charp
parm:           uvm_channel_gpput_loc:charp
parm:           uvm_channel_pushbuffer_loc:charp
parm:           uvm_enable_va_space_mm:Set to 0 to disable UVM from using mmu_notifiers to create an association between a UVM VA space and a process. This will also disable pageable memory access via either ATS or HMM. (int)
parm:           uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)

Last edited by mesaprotector (2024-04-06 14:53:03)

Offline

#27 2024-04-06 15:01:10

ZulluBalti
Member
Registered: 2021-04-07
Posts: 62

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

mesaprotector wrote:
seth wrote:

Update, for anyone affected by this:
If you feel adventurous to try the latest nvidia driver again, the problem is increasingly likely the nvidia_uvm module.
This is mostly relevant for cuda and to use the GPU in containers, but also for https://wiki.archlinux.org/title/NVIDIA … with_NVENC and eg. utilized by gimp for HW acceleration.

If you critically rely on that, this isn't a viable solution but it would still be awesome if you can confirm the condition.
Afaict the module also comes w/ a ton of paramters, so if you've it around, posting

modinfo nvidia_uvm

might shed further light.

https://wiki.archlinux.org/title/Kernel … and_line_2

module_blacklist=nvidia_uvm

And if anyone here happens to post at https://forums.developer.nvidia.com/t/s … /284772/27 you might want to forward this.
Particularily if you could confirm it.

I'm still back on 545, but I guess if someone is on 550 they can use this for comparison.  (Output of modinfo nvidia_uvm).

filename:       /lib/modules/6.6.25-1-lts/updates/dkms/nvidia-uvm.ko.zst
version:        545.29.06
supported:      external
license:        Dual MIT/GPL
srcversion:     F211ABD912933D70C263150
depends:        nvidia
retpoline:      Y
name:           nvidia_uvm
vermagic:       6.6.25-1-lts SMP preempt mod_unload 
sig_id:         PKCS#7
signer:         DKMS module signing key
sig_key:        41:B7:D0:17:6D:35:50:04:09:82:DE:8C:92:9F:F4:42:61:0E:86:5B
sig_hashalgo:   sha512
signature:      07:67:94:8D:9A:3F:7E:4C:7E:AF:7C:38:B1:4E:84:CC:8A:8F:04:AF:
		05:FD:F3:C4:EE:67:57:25:7D:63:E1:2B:E0:CA:B8:52:BC:6D:C3:E1:
		FB:F1:4E:78:0B:0A:BB:45:BC:14:61:75:7F:0E:5C:8F:86:4D:2D:D9:
		AB:52:99:84:D8:3B:D1:AC:BC:70:EC:7C:5B:6B:D3:8E:96:2C:1D:8C:
		B4:30:98:54:A8:7B:46:AB:A3:28:FB:89:98:9E:17:E8:95:7F:27:CF:
		A9:41:C9:74:3F:F6:B9:57:FD:7D:3F:2D:57:A7:96:71:DA:F3:32:1D:
		13:59:F1:D3:72:1D:27:77:69:88:7D:B3:9D:8B:22:FA:49:3B:A7:4D:
		36:24:CE:54:DC:69:90:AB:11:98:F0:00:E2:2D:8F:BC:AF:CF:62:AA:
		CD:9D:59:AA:E4:00:89:4D:65:0B:9C:28:44:8F:C3:41:BB:EB:09:B1:
		80:49:29:B7:E7:E7:6B:A1:58:31:98:A8:32:F4:1E:71:84:70:3B:D3:
		0E:52:62:26:F3:C4:B7:12:0C:1B:22:43:F9:28:55:B2:AC:52:51:43:
		40:97:74:67:A2:05:5A:52:3D:D6:34:70:0F:8D:04:9A:18:4F:AC:9B:
		37:1E:76:96:6E:EE:03:CE:B3:B4:B3:09:65:35:77:F2
parm:           uvm_ats_mode:Set to 0 to disable ATS (Address Translation Services). Any other value is ignored. Has no effect unless the platform supports ATS. (int)
parm:           uvm_perf_prefetch_enable:uint
parm:           uvm_perf_prefetch_threshold:uint
parm:           uvm_perf_prefetch_min_faults:uint
parm:           uvm_perf_thrashing_enable:uint
parm:           uvm_perf_thrashing_threshold:uint
parm:           uvm_perf_thrashing_pin_threshold:uint
parm:           uvm_perf_thrashing_lapse_usec:uint
parm:           uvm_perf_thrashing_nap:uint
parm:           uvm_perf_thrashing_epoch:uint
parm:           uvm_perf_thrashing_pin:uint
parm:           uvm_perf_thrashing_max_resets:uint
parm:           uvm_perf_map_remote_on_native_atomics_fault:uint
parm:           uvm_disable_hmm:Force-disable HMM functionality in the UVM driver. Default: false (HMM is enabled if possible). However, even with uvm_disable_hmm=false, HMM will not be enabled if is not supported in this driver build configuration, or if ATS settings conflict with HMM. (bool)
parm:           uvm_perf_migrate_cpu_preunmap_enable:int
parm:           uvm_perf_migrate_cpu_preunmap_block_order:uint
parm:           uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)
parm:           uvm_perf_pma_batch_nonpinned_order:uint
parm:           uvm_cpu_chunk_allocation_sizes:OR'ed value of all CPU chunk allocation sizes. (uint)
parm:           uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes allocated and freed, 2 = per-allocation origin tracking. (int)
parm:           uvm_force_prefetch_fault_support:uint
parm:           uvm_debug_enable_push_desc:Enable push description tracking (uint)
parm:           uvm_debug_enable_push_acquire_info:Enable push acquire information tracking (uint)
parm:           uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid, sys. (charp)
parm:           uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_batch_count:uint
parm:           uvm_perf_access_counter_granularity:Size of the physical memory region tracked by each counter. Valid values asof Volta: 64k, 2m, 16m, 16g (charp)
parm:           uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a notification.Valid values: [1, 65535] (uint)
parm:           uvm_perf_reenable_prefetch_faults_lapse_msec:uint
parm:           uvm_perf_fault_batch_count:uint
parm:           uvm_perf_fault_replay_policy:uint
parm:           uvm_perf_fault_replay_update_put_ratio:uint
parm:           uvm_perf_fault_max_batches_per_service:uint
parm:           uvm_perf_fault_max_throttle_per_service:uint
parm:           uvm_perf_fault_coalesce:uint
parm:           uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0. (int)
parm:           uvm_perf_map_remote_on_eviction:int
parm:           uvm_exp_gpu_cache_peermem:Force caching for mappings to peer memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_exp_gpu_cache_sysmem:Force caching for mappings to system memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_downgrade_force_membar_sys:Force all TLB invalidation downgrades to use MEMBAR_SYS (uint)
parm:           uvm_channel_num_gpfifo_entries:uint
parm:           uvm_channel_gpfifo_loc:charp
parm:           uvm_channel_gpput_loc:charp
parm:           uvm_channel_pushbuffer_loc:charp
parm:           uvm_enable_va_space_mm:Set to 0 to disable UVM from using mmu_notifiers to create an association between a UVM VA space and a process. This will also disable pageable memory access via either ATS or HMM. (int)
parm:           uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)

Here's the output of

modeinfo nvidia-uvm
filename:       /lib/modules/6.8.2-arch2-1/extramodules/nvidia-uvm.ko.xz
version:        550.67
supported:      external
license:        Dual MIT/GPL
srcversion:     E8BAEAF83C32EBD2D30C349
depends:        nvidia
retpoline:      Y
name:           nvidia_uvm
vermagic:       6.8.2-arch2-1 SMP preempt mod_unload 
parm:           uvm_ats_mode:Set to 0 to disable ATS (Address Translation Services). Any other value is ignored. Has no effect unless the platform supports ATS. (int)
parm:           uvm_perf_prefetch_enable:uint
parm:           uvm_perf_prefetch_threshold:uint
parm:           uvm_perf_prefetch_min_faults:uint
parm:           uvm_perf_thrashing_enable:uint
parm:           uvm_perf_thrashing_threshold:uint
parm:           uvm_perf_thrashing_pin_threshold:uint
parm:           uvm_perf_thrashing_lapse_usec:uint
parm:           uvm_perf_thrashing_nap:uint
parm:           uvm_perf_thrashing_epoch:uint
parm:           uvm_perf_thrashing_pin:uint
parm:           uvm_perf_thrashing_max_resets:uint
parm:           uvm_perf_map_remote_on_native_atomics_fault:uint
parm:           uvm_disable_hmm:Force-disable HMM functionality in the UVM driver. Default: false (HMM is enabled if possible). However, even with uvm_disable_hmm=false, HMM will not be enabled if is not supported in this driver build configuration, or if ATS settings conflict with HMM. (bool)
parm:           uvm_perf_migrate_cpu_preunmap_enable:int
parm:           uvm_perf_migrate_cpu_preunmap_block_order:uint
parm:           uvm_global_oversubscription:Enable (1) or disable (0) global oversubscription support. (int)
parm:           uvm_perf_pma_batch_nonpinned_order:uint
parm:           uvm_cpu_chunk_allocation_sizes:OR'ed value of all CPU chunk allocation sizes. (uint)
parm:           uvm_leak_checker:Enable uvm memory leak checking. 0 = disabled, 1 = count total bytes allocated and freed, 2 = per-allocation origin tracking. (int)
parm:           uvm_force_prefetch_fault_support:uint
parm:           uvm_debug_enable_push_desc:Enable push description tracking (uint)
parm:           uvm_debug_enable_push_acquire_info:Enable push acquire information tracking (uint)
parm:           uvm_page_table_location:Set the location for UVM-allocated page tables. Choices are: vid, sys. (charp)
parm:           uvm_perf_access_counter_mimc_migration_enable:Whether MIMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_momc_migration_enable:Whether MOMC access counters will trigger migrations.Valid values: <= -1 (default policy), 0 (off), >= 1 (on) (int)
parm:           uvm_perf_access_counter_batch_count:uint
parm:           uvm_perf_access_counter_threshold:Number of remote accesses on a region required to trigger a notification.Valid values: [1, 65535] (uint)
parm:           uvm_perf_reenable_prefetch_faults_lapse_msec:uint
parm:           uvm_perf_fault_batch_count:uint
parm:           uvm_perf_fault_replay_policy:uint
parm:           uvm_perf_fault_replay_update_put_ratio:uint
parm:           uvm_perf_fault_max_batches_per_service:uint
parm:           uvm_perf_fault_max_throttle_per_service:uint
parm:           uvm_perf_fault_coalesce:uint
parm:           uvm_fault_force_sysmem:Force (1) using sysmem storage for pages that faulted. Default: 0. (int)
parm:           uvm_perf_map_remote_on_eviction:int
parm:           uvm_block_cpu_to_cpu_copy_with_ce:Use GPU CEs for CPU-to-CPU migrations. (int)
parm:           uvm_exp_gpu_cache_peermem:Force caching for mappings to peer memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_exp_gpu_cache_sysmem:Force caching for mappings to system memory. This is an experimental parameter that may cause correctness issues if used. (uint)
parm:           uvm_downgrade_force_membar_sys:Force all TLB invalidation downgrades to use MEMBAR_SYS (uint)
parm:           uvm_channel_num_gpfifo_entries:uint
parm:           uvm_channel_gpfifo_loc:charp
parm:           uvm_channel_gpput_loc:charp
parm:           uvm_channel_pushbuffer_loc:charp
parm:           uvm_enable_va_space_mm:Set to 0 to disable UVM from using mmu_notifiers to create an association between a UVM VA space and a process. This will also disable pageable memory access via either ATS or HMM. (int)
parm:           uvm_enable_debug_procfs:Enable debug procfs entries in /proc/driver/nvidia-uvm (int)
parm:           uvm_peer_copy:Choose the addressing mode for peer copying, options: phys [default] or virt. Valid for Ampere+ GPUs. (charp)
parm:           uvm_debug_prints:Enable uvm debug prints. (int)
parm:           uvm_enable_builtin_tests:Enable the UVM built-in tests. (This is a security risk) (int)
parm:           uvm_release_asserts:Enable uvm asserts included in release builds. (int)
parm:           uvm_release_asserts_dump_stack:dump_stack() on failed UVM release asserts. (int)
parm:           uvm_release_asserts_set_global_error:Set UVM global fatal error on failed release asserts. (int)
parm:           uvm_conf_computing_channel_iv_rotation_limit:ulong

Offline

#28 2024-04-06 20:50:07

seth
Member
Registered: 2012-09-03
Posts: 51,811

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

There's unfortunately nothing that blatantly says "uvm_please_leave_cgroups_alone" sad
Maybe somebody on the nvidia forum knows more about that.

Online

#29 2024-04-18 02:43:03

RaZorr
Member
Registered: 2021-12-02
Posts: 42

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

Unfortunately facing the same issue.

My system: Legion 5 15ACH6H, AMD ryzen 7 5800H with Radeon iGPU and nvidia RTX 3060

For me the freezes happens during:

1. Updating the system/Installing a package - when it reaches

Reloading system manger configuration

. Happened during kernel update two days ago and the system was in a unbootable state. Had to update using arch iso on a USB.

2. Shutting down the system. The system just freezes without ever shutting down

Journalctl logs from the boot that crashed: http://0x0.st/XoKh.txt

EDIT :- I had done the setting up of udev rules and the module parameter as per fully power down the GPU when not in use instructions. Don't know whether it is relevant.

Last edited by RaZorr (2024-04-18 02:49:13)

Offline

#30 2024-04-18 08:12:01

seth
Member
Registered: 2012-09-03
Posts: 51,811

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

There're three "Bad page map"'s, possibly because of zswap.
But the journal concludes with

Apr 18 02:46:34 archsys systemd[1]: Starting Hostname Service...
-- Boot cef218bc495e4d460000000000000000 --

because you likely pushed the power button, shttps://wiki.archlinux.org/title/Keyboard_shortcuts#Kernel_(SysRq)


Can you impact this by
a) blacklisting nvidia_uvm
b)

sudo touch /etc/systemd/do-not-udevadm-trigger-on-update

- https://bugs.archlinux.org/task/77789

Online

#31 2024-04-18 09:41:22

RaZorr
Member
Registered: 2021-12-02
Posts: 42

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

I will test this soon. Have a deadline today and was frustrated when the laptop froze during training a neural network model. I was installing {{ic|btrbk}} to have snapshots during Pacman updates when this happened.

Edit: Yes I forced shut down using the power button. I have Sysrq enabled in

cat /etc/sysctl.d/99-magicsysrq.conf
kernel.sysrq=1

But forgot it was there.

Last edited by RaZorr (2024-04-18 09:58:32)

Offline

#32 2024-04-18 16:09:33

RaZorr
Member
Registered: 2021-12-02
Posts: 42

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

I don't know if this is relevant but I saw a difference between the output of inxi -G with a user on reddit who didn't have such an issue. For me(first output below) the modesetting driver was unloaded whereas they (second output below) had it loaded.
My output:

inxi -G
Graphics:
  Device-1: NVIDIA GA106M [GeForce RTX 3060 Mobile / Max-Q] driver: nvidia
    v: 550.67
  Device-2: AMD Cezanne [Radeon Vega Series / Radeon Mobile Series]
    driver: amdgpu v: kernel
  Device-3: Syntek Integrated Camera driver: uvcvideo type: USB
  Display: x11 server: X.Org v: 21.1.13 driver: X: loaded: amdgpu,nvidia
    unloaded: modesetting dri: radeonsi gpu: amdgpu resolution: 1920x1080~165Hz
  API: EGL v: 1.5 drivers: kms_swrast,nvidia,radeonsi,swrast
    platforms: gbm,x11,surfaceless,device
  API: OpenGL v: 4.6.0 compat-v: 4.5 vendor: amd mesa v: 24.0.5-arch1.1
    renderer: AMD Radeon Graphics (radeonsi renoir LLVM 17.0.6 DRM 3.57
    6.8.5-arch1-1)
Graphics:
  Device-1: NVIDIA TU106M [GeForce RTX 2060 Max-Q] driver: nvidia v: 550.67
  Device-2: AMD Renoir [Radeon RX Vega 6 ] driver: amdgpu v: kernel
  Display: wayland server: X.org v: 1.21.1.13 with: Xwayland v: 23.2.6
    compositor: Hyprland v: 0.39.1-1-ge8e02e81 driver: X:
    loaded: modesetting,nvidia gpu: amdgpu resolution: 1920x1080~120Hz
  API: EGL v: 1.5 drivers: nvidia,radeonsi,swrast
    platforms: wayland,x11,surfaceless,device
  API: OpenGL v: 4.6.0 compat-v: 4.5 vendor: amd mesa v: 24.0.5-arch1.1
    renderer: AMD Radeon Graphics (radeonsi renoir LLVM 17.0.6 DRM 3.57
    6.8.7-arch1-1-g14)
  API: Vulkan v: 1.3.279 drivers: nvidia surfaces: xcb,xlib,wayland

I do have nvidia_drm.modeset=1 as mentioned in NVIDIA#DRM_kernel_mode_setting

cat /proc/cmdline
root=UUID=xxx-xxx-xxx-xxx ro rootflags=subvol=@ initrd=initramfs-linux.img nvidia-drm.modeset=1 root=/dev/mapper/Cbtrfs cryptdevice=UUID=xxx-xxx-xxx-xxx:Cbtrfs:allow-discards

and

sudo cat /sys/module/nvidia_drm/parameters/modeset
Y

Offline

#33 2024-04-18 16:57:01

mesaprotector
Member
Registered: 2024-03-03
Posts: 34

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

RaZorr wrote:

I will test this soon. Have a deadline today and was frustrated when the laptop froze during training a neural network model. I was installing {{ic|btrbk}} to have snapshots during Pacman updates when this happened.

Edit: Yes I forced shut down using the power button. I have Sysrq enabled in

cat /etc/sysctl.d/99-magicsysrq.conf
kernel.sysrq=1

But forgot it was there.

SysRq will only work on a kernel oops (of course it also works if you have non-kernel-level problems like Xorg freezing). If your kernel panics then SysRq will do nothing. This bug seems capable of causing either an oops or a panic.

Right now I just pray every time I see an LTS update that the 545 driver will continue to build. I have heard that 550.40.07 (available in AUR) is the last version without a bug, and it should still work with the 6.8 kernels. I have not tried it myself.

Offline

#34 2024-04-19 04:36:44

ngoonee
Forum Fellow
From: Between Thailand and Singapore
Registered: 2009-03-17
Posts: 7,356

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

This annoyed me enough that I downgraded linux and nvidia-related to 6.7.6/545.29.06. Staying there until it's resolved.


Allan-Volunteer on the (topic being discussed) mailn lists. You never get the people who matters attention on the forums.
jasonwryan-Installing Arch is a measure of your literacy. Maintaining Arch is a measure of your diligence. Contributing to Arch is a measure of your competence.
Griemak-Bleeding edge, not bleeding flat. Edge denotes falls will occur from time to time. Bring your own parachute.

Offline

#35 2024-04-22 21:04:53

demensdeum
Member
From: Hyperborea
Registered: 2022-07-02
Posts: 40
Website

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

Had to downgrade to 535.171.04, because 550.76 is unusable, even with nvidia_uvm blacklisted

Offline

#36 2024-04-22 21:38:39

seth
Member
Registered: 2012-09-03
Posts: 51,811

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

https://wiki.archlinux.org/title/Arch_Linux_Archive
You'll need matching nvidia-dkms and nvidia-utils and the lts kernel (the older nvidia drivers won't build unpatched against 6.8 because of GPL restrictions) as well as linux-lts-headers (for dkms)

Online

#37 2024-04-23 05:41:25

demensdeum
Member
From: Hyperborea
Registered: 2022-07-02
Posts: 40
Website

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

Right now I am using my own kernel (6.8.7), as it's allowed in Arch wiki

Offline

#38 2024-04-23 06:13:23

seth
Member
Registered: 2012-09-03
Posts: 51,811

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

You don't need the permission of the wiki, but the 535xx and 545xx kernels won't build against 6.8
https://bugs.launchpad.net/ubuntu/+sour … comments/3 and the following comments link debian patches to build the older drivers against the 6.8 kernel.
The general suggestion would be to use the LTS kernel w/ the regular driver packages.

Online

#39 2024-04-23 06:29:58

demensdeum
Member
From: Hyperborea
Registered: 2022-07-02
Posts: 40
Website

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

Interesting situation, right now I have 6.8.7 kernel and 535.171.04 Nvidia driver, I installed drivers directly from NVIDIA-Linux-x86_64-535.171.04.run file

Offline

#40 2024-04-23 06:40:31

seth
Member
Registered: 2012-09-03
Posts: 51,811

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

The last version in the ALA is 535.113 - to use a newer subrelease it would have been advisable to use https://wiki.archlinux.org/title/Arch_b … ILD_source and update the PKGBUILD for the new release so you won't have to deal w/ getting the installation out of your system.
You'll also most likely be missing the config files from https://gitlab.archlinux.org/archlinux/ … idia-utils and a blacklist for nouveau this way.

Online

#41 2024-04-26 19:00:52

jongeduard
Member
Registered: 2022-01-06
Posts: 8

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

Ok, I now I am also going to here.
I do believe that I may also be affected by this problem as well, Legion 5 Pro laptop (NVIDIA GeForce RTX 3060 Laptop GPU), but it's still strange in what type of panic it includes.

For a couple of months I have been affected by random panics on shutdown. Often ending with the "Attempt to kill init" message. One person has also placed a screenshot with that on the mentioned Nvidia forum page by seth.
What is trying to kill the init process (systemd)? Does Nvidia run anything which forcibly shuts things down, or does it somehow trigger things to shutdown in incorrect order. I do not yet see the logical connection between things.

For a while now I have also blacklisted the nvidia_uvm module (I use the /etc/modprobe.d/ .conf file approach, and not the kernel command line, but to prove it to be actually disabled I just check it with `lsmod | grep -i nvidia`), but sadly I am still getting the panics every now and then. The frequency of the panics seems somewhat lower maybe.

Also note that I am experiencing the same issue with both kernels linux and linux-lts.

The main scary issue with all this is that I have experienced multiple instances of file system data loss now.
So bad that several folders in the root where gone and I couldn't login to LigtDM and Xfce anymore.
Which is also strange since I am using Btrfs which should be a lot more reliable and not suffer from such issues. I don't get it.

But luckily I also make snapshots on this system and I even backup these (which is all still really great with Btrfs), which I all wrote scripts for.
so I am quite easy able to keep recovering things after they got lost, which is good, but it's still beyond frustrating.

If anyone knows more, it would be really amazing of course.


Update 04-05-2024:

For anyone who might be reading this still and still feels somewhat unsure:
I have been using the version 535 from AUR for a while now and the panics have gone away completely. Everything works normal again.

And the very slow poweroff's are gone as well. There wasn't always a panic, but most of the time the system did shudown, but with delay immediately before actual power off. But now the system now shuts down fast again.

So I can fully confirm it to be the solution for now.
(I feel no hurries to switch back to the new 550 version quickly, so I am fine, but I still hope they are going to do their homework at NVIDIA in properly testing stuff better before rolling out newer versions)

Last edited by jongeduard (2024-05-04 19:28:46)

Offline

#42 2024-04-26 20:09:36

seth
Member
Registered: 2012-09-03
Posts: 51,811

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

systemd crashes when accessing the cgroupfs, which is userspace exposed kernel structure.
The theory is that nvidia somehow, for some reason compromises that.

The nvidia driver versions before 550xx arent affected and w/ the LTS kernel you can easily downgrade to 545xx or 535xx from the ALA (using the nvidia-dkms and nvidia-utils package of the same version, you'll also need the linux-lts-headers; not from the ALA but the regular repo, matching the actual LTS kernel you're running)
The older nvidia drivers in the ALA won't compile against 6.8, but there's also https://aur.archlinux.org/packages/nvidia-535xx-dkms

Online

#43 2024-04-28 02:48:14

nvann
Member
Registered: 2024-04-04
Posts: 6

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

seth wrote:

systemd crashes when accessing the cgroupfs, which is userspace exposed kernel structure

Sorry if it is irrelevant but it isn't an inherent systemd crash, the same issue appears on Artix with runit

Offline

#44 2024-04-28 06:16:55

seth
Member
Registered: 2012-09-03
Posts: 51,811

Re: [SOLVED] Repeated kernel problems/freezes since 6.7.6

Do you have a stacktrace and context for that?

The crash is inherent to the kernel (cgroupfs) that'll generate an Oops or just halt triggered by what access it.
cgroupfs is extremely likely corrupted by the nvidia module(s), pending questions:

1. why?
I understand why nvidia_uvm is interested in messing with that, but not the drm modules.

2. when?
systemd bangs the cgroupfs "enough" when the reexecution hook runs (which would be a systemd pecularity), from various reports also just when reloading the daemon and this somehow exposes the corruption while day-to-day business seems "fine" (at least I'm not aware of reports outside these contexts)

Online

Board footer

Powered by FluxBB