You are not logged in.
Doesn't boot with the nomodeset. Will post the journalctl output tomorrow, and the configuration outline.
Last edited by vorlket (2025-07-18 05:48:34)
Offline
[vorlket@server ~]$ sudo journalctl -b | curl -F 'file=@-' 0x0.st
http://0x0.st/8dWh.txtIn /etc/default/grub:
GRUB_CMDLINE_LINUX=""
In /etc/mkinitcpio.conf:
MODULES=(mgag200)Offline
In /etc/default/grub:
GRUB_CMDLINE_LINUX="nomodeset"
In /etc/mkinitcpio.conf:
MODULES=(mgag200 nvidia nvdia_uvm)[vorlket@server ~]$ sudo journalctl -b | curl -F 'file=@-' 0x0.st
http://0x0.st/8d4l.txtOffline
Google suggests:
The latest Arch Linux kernel can indeed conflict with the nvidia-470xx-dkms driver, particularly when the kernel is updated to a version that is not officially supported by the 470xx driver series. This often requires users to either downgrade the kernel or find a workaround, such as using a patched PKGBUILD from the AUR or waiting for a compatible driver update.
Here's a more detailed breakdown:
The Problem:
End-of-life support: The nvidia-470xx-dkms driver is an older, legacy driver series.
Kernel incompatibility: Newer Arch Linux kernels may not be compatible with this older driver, leading to issues like failed driver installation, no graphical desktop, or Xorg crashes.
DKMS issues: Dynamic Kernel Module Support (DKMS) relies on the kernel headers to build the driver module. If the kernel headers are incompatible, DKMS will fail.
Possible Solutions:
1. Downgrade the kernel:
The easiest solution is to downgrade the kernel to a version that is known to be compatible with the nvidia-470xx-dkms driver. For example, if 6.12 is the last LTS kernel to support the 470xx driver, downgrading to that version might resolve the issue.
2. Use a patched PKGBUILD:
The Arch User Repository (AUR) might contain a patched PKGBUILD for nvidia-470xx-dkms that is compatible with the newer kernel. This requires manually building and installing the driver from the AUR.
3. Wait for an official update:
NVIDIA might release a new version of the nvidia-470xx-dkms driver that is compatible with the latest kernel. This is the ideal solution, but it might take time.
4. Use a different driver:
If your hardware is still supported by newer NVIDIA drivers, consider upgrading to a more recent driver series (e.g., 5xx or 5xx-dkms).
5. Use the LTS kernel:
If you need to stick with the 470xx driver, consider using the LTS (Long Term Support) kernel, which is likely to have a longer support window for older drivers.
6. Check the Arch Linux forums and AUR:
Users often discuss workarounds and solutions for this issue in the Arch Linux forums and on the nvidia-470xx-dkms AUR page. Searching for your specific kernel version and the driver issue might lead you to a solution.
Example using the AUR (with a patched PKGBUILD):
Find a patched PKGBUILD: Look for a PKGBUILD in the AUR comments or on relevant forums (e.g., from Gist or Gist).
Download and extract the PKGBUILD: Download the PKGBUILD and any necessary patches.
Modify the PKGBUILD: If necessary, apply the patch to the PKGBUILD using patch (as described in the Gist).
Build and install: Use makepkg -si to build and install the driver.
Important Considerations:
Backups:
Before making any changes to your system, it's always a good idea to create backups.
Kernel headers:
Ensure you have the correct kernel headers installed for your current kernel.
Driver compatibility:
Check the NVIDIA website or the AUR page for the nvidia-470xx-dkms package to see if your specific hardware is compatible with this driver.Offline
That's not google but their artificial idiocy "gemini", the 470xx driver on your system loads.
Remove the nvidia modules from the initramfs again and
Jul 19 13:47:32 server kernel: [drm] [nvidia-drm] [GPU ID 0x00000600] Loading driver
Jul 19 13:47:32 server kernel: [drm] Initialized nvidia-drm 0.0.0 for 0000:06:00.0 on minor 1
Jul 19 13:47:32 server kernel: Failed to initialize the nv-hotplug-helper DRM client (ensure DRM kernel mode setting is enabled via nvidia-drm.modeset=1).
Jul 19 13:47:32 server kernel: [drm] [nvidia-drm] [GPU ID 0x00000600] Unloading driver
Jul 19 13:47:32 server kernel: [drm] [nvidia-drm] [GPU ID 0x00000700] Loading driver
Jul 19 13:47:32 server kernel: [drm] Initialized nvidia-drm 0.0.0 for 0000:07:00.0 on minor 1Add "nvidia_drm.modeset=1" to the https://wiki.archlinux.org/title/Kernel_parameters
Semi-OT rant:
And stop talking to these bullshit generators.
Everytime I google a topic to mostly find a specific url to reference or code snippet (since afaic neither gitlab nor github have decent search functions), google presents me this shit and I *know* that it has no fucking idea what it's talking about but makes up a question (I didn't ask for any kind of explanation, did it?) and then touts some completely wild theories that usually veer between "narrowed perception" and "nonsensical" (this might work better if it wouldn't make up the question itfp, no idea)
And from 3rd hand reports about what chatgpt told people, the competition isn't doing any better.
"AI" is in reality "LLM" (google what that could mean
) ie. a pimped up word prediction that is insanely good at producing coherent English texts (ChatGPT should run for office in 2028…) but do not even know what topic they're currently babbling about. They're probabilistically stringing together words based on the seed you provide.
Online
Never got into LLM. I aim to use cuda for deep reinforcement learning, predicting short term stock price return.
Anyway, travelling to home at the moment. Will remove the nvidia modules and add the drm kernel parameter. From what I recall, this results in the restart at boot, but I will have to check and confirm.
Do you think trying the lts kernel, version 6.12 may help? Never built a kernel without following the information given in archlinux Web pages. But, I suppose it's time to learn.
I think two 1100w psus are enough for my setup and would be surprised if there's hardware issue.
Offline
Adding "nvidia_drm.modeset=1" to GRUB_CMDLINE_LINUX="" in /etc/default/grub results in the reboot at boot. The debug output at boot: https://imgur.com/a/HrGJMZW.
Offline
Never got into LLM. I aim to use cuda for deep reinforcement learning, predicting short term stock price return.
The rant wasn't about you training an LLM, but the quality of AI responses.
results in the reboot at boot
Can you easily take out one/some of the GPUs?
Are there actually
Jul 19 12:12:41 server kernel: pci 0000:06:00.0: [10de:102d] type 00 class 0x030200 PCIe Endpoint
Jul 19 12:12:41 server kernel: pci 0000:07:00.0: [10de:102d] type 00 class 0x030200 PCIe Endpoint
Jul 19 12:12:41 server kernel: pci 0000:84:00.0: [10de:102d] type 00 class 0x030200 PCIe Endpoint
Jul 19 12:12:41 server kernel: pci 0000:85:00.0: [10de:102d] type 00 class 0x030200 PCIe Endpointfour of them??
Spontanous reboots are because of
- underpowered
- overheated
- bad/overclocked CPU
- bad/overclocked RAM
That's not a software issue and the kernel would rather halt than reboot on critical errors.
Online
K80 is a dual GPU so two cards have 4 GPUs. It could be a faulty or damaged GPU. I'll test one by one.
Offline
Turned out one of the GPU is damaged: nvidia-smi works with one GPU but, with the other, it reboots. Thanks for all your help.
Last edited by vorlket (2025-07-20 04:59:11)
Offline