You are not logged in.
The system hung at "upgrading systemd" and I couldn't kill the terminal or change VT or do anything with the keyboard, although I think the cursor could still move.
Second time this happens this month, this time I just turned off the laptop and went to bed, so perhaps there is some useful info I can extract before I start trying to fix it on my own? If the problem isn't well known already.
Nvidia + Wayland + custom kernel (Asus ROG G14).
Recently the laptop has crashed several times during boot with a black screen and lit keyboard LEDs which are normally off. It might have mentioned something about Nvidia but I'm not sure. Each time, fsck would run after rebooting and Wayland would start normally again.
Last time the pacman crash happened I spent a weekend backing my stuff up over rsync because none of the pacman rescue techniques I could find would let me install anything. I know it sounds handwavy but I tried a lot of things, however the database was corrupted, the signing keys were broken, openssh didn't work, shared libraries were missing, no space left on the partition, nothing would be able to run. Also I have a newborn and my brain is mush.
Update: the boot entry disappearing was due to the kernel linux-g14 having been deleted due to outdated keys - not really related to the crash and broken pacman state.
Last edited by losipai (2024-05-01 13:27:37)
Offline
From installer:
arch-chroot /mnt
chroot: failed to run command /bin/bash: Input/output error
pacstrap -K /mnt bash
==> Creating install root at /mnt
==> Installing packages to /mnt
:: Synchronizing package databases...
error: failed to synchronize all databases (unable to lock database)
==> ERROR: Failed to install packages to new root
rm /var/lib/pacman/db.lck
pacstrap /mnt bash
# OK, installed it
arch-chroot /mnt
chroot: failed to run command /bin/bash: Input/output error
mount
/dev/nvme0n1p2 on /mnt type ext4 (rw,relatime)
/dev/nvme0n1p1 on /mnt/boot type vfat (rw, # rest omitted)
# after unmounting
fsck -f -y /dev/nvme0n1p2
fsck -f -y /dev/nvme0n1p1
nvme smart-log /dev/nvme0n1p2
# only including error info below:
critical_warning : 0
media_errors : 0
num_err_log_entries : 1
# chroot still gives Input/output error
pacman -r /mnt -Qnq | pacman -r /mnt -Syu - --cachedir /mnt/var/cache/pacman/pkg --dbpath /mnt/var/lib/pacman --gpgdir /mnt/etc/pacman.d/gnupg
# dependency resolution fails
pacman -r /mnt -Rs <offending packages> - --cachedir /mnt/var/cache/pacman/pkg --dbpath /mnt/var/lib/pacman --gpgdir /mnt/etc/pacman.d/gnupg
(1/2) removing xxx
(2/2) removing yyy
call to execv failed (Exec format error)
error: command failed to execute correctly
# trying again...
pacman -r /mnt -Qnq | pacman -r /mnt -Syu - --cachedir /mnt/var/cache/pacman/pkg --dbpath /mnt/var/lib/pacman --gpgdir /mnt/etc/pacman.d/gnupg
<package>: xxx exists in filesystem # for every package
Errors occurred, no package were upgraded
pacman -r /mnt -Qnq | pacman -r /mnt -Syu - --cachedir /mnt/var/cache/pacman/pkg --dbpath /mnt/var/lib/pacman --gpgdir /mnt/etc/pacman.d/gnupg --overwrite="*"
# 789 packages reinstalled, but with several errors that scroll by too quickly for me to read
mount --bind /proc /mnt/proc
mount -o bind /dev /mnt/dev
pacman -r /mnt -Qnq | pacman -r /mnt -Syu - --cachedir /mnt/var/cache/pacman/pkg --dbpath /mnt/var/lib/pacman --gpgdir /mnt/etc/pacman.d/gnupg --overwrite="*" | tee -a pacman_log.txt
# arch-chroot /mnt now works!
# downgrade nvidia stuff to 535xx
# fix Asus G14 repo keys and reinstall kernel which had been deleted
Last edited by losipai (2024-05-01 13:13:42)
Offline
https://bbs.archlinux.org/viewtopic.php?id=293400
https://aur.archlinux.org/packages/nvidia-535xx-dkms
https://wiki.archlinux.org/title/Pacman … an_upgrade
https://bbs.archlinux.org/viewtopic.php … 6#p2168066
error: failed to synchronize all databases (unable to lock database)
https://bbs.archlinux.org/viewtopic.php … 2#p2168282
To just re-install all packages
pacman --root /mnt --cachedir /mnt/var/cache/pacman/pkg -S --dbonly $(pacman --root /mnt -Qnq)
pacman --root /mnt --cachedir /mnt/var/cache/pacman/pkg -S $(pacman --root /mnt -Qnq)
If there're persisting IO errros, check dmesg to make sure the device isn't just genuinely broken.
Also I have a newborn and my brain is mush.
I'd tell you to put some beer into the milk, but apparently that's "problematic" and I "should be cancelled" and "get help"
Offline
Thank you seth, I'm able to boot again after the above steps!
I downgraded to nvidia-535xx as mentioned in the links. Hyprland won't start with nVidia anymore so it's running on the iGPU - who knows what I broke in the process.
I'd tell you to put some beer into the milk, but apparently that's "problematic" and I "should be cancelled" and "get help"
Sounds about as safe as using Nvidia drivers!
Last edited by losipai (2024-05-01 13:23:46)
Offline
Did the dkms build fail?
dkms status
lspci -k
cat /proc/cmdline
pacman -Qs kernel
nvidia-smi # does your GPU respond?
Offline
dkms status
nvidia/535.171.04, 6.8.7-arch1-1.1-g14, x86_64: installed
rtl8814au/5.8.5.1.r180.gb5a6f96, 6.8.7-arch1-1.1-g14, x86_64: installed
lspci -k
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex
Subsystem: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU
Subsystem: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge
Subsystem: ASUSTeK Computer Inc. Device 1662
Kernel driver in use: pcieport
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
Subsystem: ASUSTeK Computer Inc. Device 1662
Kernel driver in use: pcieport
00:02.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
Subsystem: ASUSTeK Computer Inc. Device 1662
Kernel driver in use: pcieport
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus
Subsystem: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus
Kernel driver in use: pcieport
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 51)
Subsystem: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller
Kernel driver in use: piix4_smbus
Kernel modules: i2c_piix4, sp5100_tco
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
Subsystem: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3
Kernel driver in use: k10temp
Kernel modules: k10temp
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7
01:00.0 3D controller: NVIDIA Corporation GA107M [GeForce RTX 3050 Ti Mobile] (rev a1)
Subsystem: ASUSTeK Computer Inc. Device 148c
Kernel driver in use: nvidia
Kernel modules: nvidia_drm, nvidia
06:00.0 Network controller: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
Subsystem: Intel Corporation Device 008c
Kernel driver in use: iwlwifi
Kernel modules: iwlwifi
07:00.0 Non-Volatile memory controller: Sandisk Corp IX SN530 NVMe SSD (DRAM-less) (rev 01)
Subsystem: Sandisk Corp IX SN530 NVMe SSD (DRAM-less)
Kernel driver in use: nvme
Kernel modules: nvme
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c4)
Subsystem: ASUSTeK Computer Inc. Device 148c
Kernel driver in use: amdgpu
Kernel modules: amdgpu
08:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
08:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
Subsystem: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
Kernel driver in use: ccp
Kernel modules: ccp
08:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
Subsystem: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
Kernel driver in use: xhci_hcd
Kernel modules: xhci_pci
08:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
Subsystem: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
Kernel driver in use: xhci_hcd
Kernel modules: xhci_pci
08:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] ACP/ACP3X/ACP6x Audio Coprocessor (rev 01)
Subsystem: ASUSTeK Computer Inc. Device 1662
Kernel modules: snd_pci_acp3x, snd_rn_pci_acp3x, snd_pci_acp5x, snd_pci_acp6x, snd_acp_pci, snd_rpl_pci_acp6x, snd_pci_ps, snd_sof_amd_renoir, snd_sof_amd_rembrandt, snd_sof_amd_vangogh, snd_sof_amd_acp63
08:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller
Subsystem: ASUSTeK Computer Inc. Device 1662
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
08:00.7 Signal processing controller: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub
Subsystem: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub
Kernel driver in use: pcie_mp2_amd
Kernel modules: amd_sfh
cat /proc/cmdline
ibt=off pm_debug_messages amd_pmc.dyndbg="+p" acpi.dyndbg="file drivers/acpi/x86/s2idle.c +p" initrd=\initramfs-linux-g14.img root=UUID=1baf899e-df18-41a1-b98d-c1cff60bea28 rw
pacman -Qs kernel
local/dkms 3.0.12-1
Dynamic Kernel Modules System
local/iptables 1:1.8.10-1
Linux kernel packet control tool (using legacy interface)
local/kmod 32-1
Linux kernel module management tools and library
local/lib32-libdrm 2.4.120-1
Userspace interface to kernel DRM services (32-bit)
local/libdrm 2.4.120-1
Userspace interface to kernel DRM services
local/libnetfilter_conntrack 1.0.9-2
Library providing an API to the in-kernel connection tracking state table
local/libnfnetlink 1.0.2-2
Low-level library for netfilter related kernel/userspace communication
local/libsysprof-capture 46.0-1
Kernel based performance profiler - capture library
local/linux-api-headers 6.7-1
Kernel headers sanitized for use in userspace
local/linux-g14 6.8.7.arch1-1.1
The Linux kernel and modules
local/linux-g14-headers 6.8.7.arch1-1.1
Headers and scripts for building modules for the Linux kernel
local/mtdev 1.1.6-2
A stand-alone library which transforms all variants of kernel MT events to the slotted type B protocol
nvidia-smi
# It says "Off" now, used to say 8W or whatever the power draw was.
Last edited by losipai (2024-05-01 13:51:17)
Offline
01:00.0 3D controller: NVIDIA Corporation GA107M [GeForce RTX 3050 Ti Mobile] (rev a1)
Subsystem: ASUSTeK Computer Inc. Device 148c
Kernel driver in use: nvidia
Kernel modules: nvidia_drm, nvidia
This is a pure GPU device, you can use it for DRI_PRIME, but not actually run a display server on it (exclusively)
You can remove ibt=off (nvidia fixed that), but maybe enable https://wiki.archlinux.org/title/NVIDIA … de_setting - use the "nvidia_drm.modeset=1" kernel parameter (modprobe.conf won't do!), to prevent any confusion through device re-ordering when the simpledrm device gets removed by the intel driver (I guess nvidia would then be card1 and intel card2)
Otherwise this looks uncritical, the GPU probably has entered RTD3, https://wiki.archlinux.org/title/PRIME# … Management
Offline
Thanks!
I'm struggling to understand the GPU stuff.
Not sure where ibt=off is set from, can't find this string in /etc or /boot except that it's baked into vmlinuz-linux-g14 for some reason. Maybe an upstream thing?
use the "nvidia_drm.modeset=1" kernel parameter (modprobe.conf won't do!)
It already says Y if I sudo cat /sys/module/nvidia_drm/parameters/modeset, is that enough or do I still need to load the kernel with it?
This is a pure GPU device, you can use it for DRI_PRIME
Yes, it seems to activate when I run stuff with prime-run. Still trying to get CUDA to work, it's saying CL_PLATFORM_NOT_FOUND_KHR.
Last edited by losipai (2024-05-05 17:54:07)
Offline
ibt=off is a https://wiki.archlinux.org/title/Kernel_parameters which depends on your bootloader
modprobe.conf won't do!
- you'll be able to add the parameter once you figured the above
https://wiki.archlinux.org/title/GPGPU#CUDA - do you have opencl-nvidia and cuda installed?
Offline
ibt=off is a https://wiki.archlinux.org/title/Kernel_parameters which depends on your bootloader
modprobe.conf won't do! - you'll be able to add the parameter once you figured the above
Yes, I use systemd-boot but nothing in my configuration sets that. It's not on the initrd or options lines, or anywhere in any other file. That's why I thought that maybe linux-g14 sets it by default. ¯\_(ツ)_/¯
https://wiki.archlinux.org/title/GPGPU#CUDA - do you have opencl-nvidia and cuda installed?
Yes!
Last edited by losipai (2024-05-12 12:36:29)
Offline
What if you additioanlly to "prime-run" also "export VK_DRIVER_FILES=/usr/share/vulkan/icd.d/nvidia_icd.json"?
Offline
No difference.
VK_DRIVER_FILES=/usr/share/vulkan/icd.d/nvidia_icd.json prime-run katrain
...
OpenCL error at /home/user/data/kata/kata0/cpp/neuralnet/openclhelpers.cpp, func err, line 308, error CL_PLATFORM_NOT_FOUND_KHR
I'm trying to get TensorRT to work. But I'll install cuda-tools and see if I can run something simpler first.
Offline
Do the cuda device nodes exist, "ls -l /dev/nvidia*"?
Offline
$ ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 254 May 12 10:57 /dev/nvidia-modeset
crw-rw-rw- 1 root root 507, 0 May 12 10:57 /dev/nvidia-uvm
crw-rw-rw- 1 root root 507, 1 May 12 10:57 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195, 0 May 12 10:57 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 May 12 10:57 /dev/nvidiactl
/dev/nvidia-caps:
total 0
cr-------- 1 root root 511, 1 May 12 11:00 nvidia-cap1
cr--r--r-- 1 root root 511, 2 May 12 11:00 nvidia-cap2
Last edited by losipai (2024-05-12 18:03:47)
Offline
crw-rw-rw- 1 root root 195, 0 May 12 10:57 /dev/nvidia0
"yes"
Try to
export OPENCL_VENDOR_PATH=/etc/OpenCL/vendors/nvidia.icd
- the error in #12 is from opencl, not cuda
Offline
Same error!
I might need a more straight-forward way to verify either of OpenCL, CUDA or TensorRT than what I've been trying to run...
Offline
Offline
Do you know how to tell when it's safe to run recent nVidia drivers again? Or where to track this issue?
It might be easier to fix this CUDA issue if I'm not mixing 535xx version packages from AUR into the system.
As for the software I'm trying to run, it's compiled against older versions of CUDA and cudNN and TensorRT than the newer ones in the Arch repos.
I tried to symlink libcudnn.so.9 to libcudnn.so.8, but the app doesn't buy it.
I also tried to install older CUDA and cudNN versions from AUR, but they depend on legacy versions of gcc which tend to fail after compiling for hours.
Installing "pocl" at least makes the OpenCL version run, although it uses the CPU as backend. That's why I'm wondering if using the latest nvidia/cuda would help.
Offline
I doubt the CUDA issue being directly related to the driver version, officially speaking https://docs.nvidia.com/cuda/cuda-toolk … notes/#id3 CUDA 12 is supported for 525 and above.
As for the crashing issue, it's reported that it doesn't happen on nvidia-open and I have yet to see anything conclusive that it might not even already be fixed on nvidia-beta 555.
On a cursory check as for "tracking" https://forums.developer.nvidia.com/t/s … 284772/179 makes mention of the issue we've seen and it might be fixed by the latest versions that have just been released, so ideally on the next normal driver update you might get this fix in one way or another.
Offline
Thanks a lot - now I know where to look for updates.
I switched to nvidia-open-dkms. After installing the latest version of all the nvidia stuff, compiling and running CUDA/TensorRT finally works again.
Really appreciate all the guidance.
Offline