You are not logged in.

#1 2022-02-08 21:45:48

defborn
Member
Registered: 2022-02-08
Posts: 6

[SOLVED] Full system freezes happening at random

UPDATE:

After further reading on the forum I saw somebody suggesting to do a memory test. I have two 16gb modules and after running memtest I got 42 errors on both modules (took 4 hours to complete... with 4 passes). This morning I ran it again on each module individually, with one pass, of which one module passed and the other one failed with 12 errors. So maybe all of this boils down to faulty memory! I will return the bad module and continue to work with one and see what that gives. 16gb is enough most of the time, with hinesight 32gb was a bit much ^^. 

--

I have looked throughout the forum in other posts but after trying various solutions, I am still experiencing these full system freezes from time to time, and it is utterly annoying. Today I had three freezes, two quickly one after another, a third one after hours of working. I also had one 'black screen' after resuming from suspend, but that was definitely an Xorg problem since I could ssh to my machine, kill xorg and then my monitor displayed the virtual console again, after which I could just run `startx` and carry on.

I am fairly new to Arch so I will need some further instructions with what is needed a of information/logs in order to pinpoint the problem. I will update the post accordingly.
I am already very very grateful for any help from anyone!

Sadly I am still learning a lot, which does not really help towards fixing my problem. My knowledge of ACPI, kernel params, initramfs, ... is rather limited.

So here is a brief overview of what is happening, followed by what system I am running on and some logs.

System overview

Hardware specs:

CPU:          AMD Ryzen 9 5950X 16-Core @ 32x 3.4GHz
GPU:          NVIDIA GeForce GTX 1650 (Gigabyte GeForce GTX 1650 D6 OC 4G V2)
Motherboard:  MSI B450 GAMING PRO CARBON MAX WIFI enable wifi
RAM:          G.Skill DDR4 Ripjaws-V 2x16GB 4000MHz (running at 4000MHz via XMP 1 setting in BIOS)
Disk:         SSD NVMe with boot and home and root partition + seperate HDD for media
Monitor:      Benq 32" PD3200U monitor (connected via DP-0)

Software:

Bootloader:   Grub
Kernel:       x86_64 Linux 5.15.21-1-lts (and the non LTS kernel available too)
WM:           XMonad, I use xlock too.

The GPU is running under the propietary NVIDIA driver.

Behaviour

The first three months I used the regular linux kernel with the matchin NVidia modules. For the last two weeks I have been using the lts version. I have experienced two sorts of problems:
- not able to resume properly after suspend
- random full system freezes, with weird power button behaviour on my desktop case as a result

For me, the latter is far more important to fix than the first. The resume blank screen happens only 1 out of 3, and as of today I discovered that by ssh-ing from a second older computer I was able to just kill xorg and restart it again. Not great, but I'll live for now. The problem here might be resolved by resolving the bigger issue or can be something different. So while we can try to fix both, the freezes are much more important since it is my daily driver to work on.

The freezes occur randomly and so far I have not been able to link it to a specific application. I do think it is graphics related, since it always seems to happen after doing something in:
- Chrome
- Kitty (terminal emulator)
- Firefox
- Gimp (rarely use it)
- Darktable (rarely use it, but triggers it)

Kitty just happened to trigger it today and I first thought 'well now I know it is not really graphics related', but then I realised that Kitty is a GPU-based terminal emulator.

Following a system freeze I cannot:
- open other virtual consoles using `alt-Fx`
- use the REISUB sequence or even just RB using the SysRq. The keys themselves work since I can use them in a normal usage session to reboot [1]
- power off the computer, neither by short press or long press [2]

Logging and config

Xorg config, only "/etc/X11/xorg.conf.d/20-nvidia.conf", there is no `/etc/X11/xorg.conf`:

# from: https://bbs.archlinux.org/viewtopic.php?pid=2019098#p2019098
Section "Device"
    Identifier  "Default nvidia Device"
    Driver	"nvidia"
    Option	"TripleBuffer" "true"
EndSection

Journal from last boot: https://gist.githubusercontent.com/jero … tfile1.txt
Xorg log: https://gist.githubusercontent.com/jero … tfile1.txt

mkinitcpio.conf:

MODULES=(nvidia nvidia_modeset nvidia_uvm nvidia_drm)
BINARIES=()
FILES=()
HOOKS=(base udev autodetect modconf block filesystems keyboard fsck)

Grub config (kernel params only):

GRUB_CMDLINE_LINUX_DEFAULT="loglevel=7 nvidia-drm.modeset=1 nouveau.modeset=0 sysrq_always_enabled=1"

1: I found a similar description in https://bbs.archlinux.org/viewtopic.php?id=270704:

Enabling and using the SysRq key. All SysRq commands are enabled on my system, as confirmed by the fact that I can REISUB out of the system, as long as no crash happened. Once a crash happens, all keyboard functions (including SysRq commands) become entirely unresponsive. In addition, the lock LEDs remain in the state they were before the crash, and the lock keys don't respond. It is also not possible to switch to another TTY in this state.

2: Concerning that last issue: I have a power button with a LED on my case. After a crash the LED is still active and (long) pressing the power button has no effect. When switching the power supply off completely I have to make sure all power has drained (waiting for 10+ secs). If I do not, after switching back on, the power button is still active, the system restarts but nothing becomes active (no boot/bios screen. nothing, just vents spinning). This is probably an issue with the case, still, I find it weird that the system does not boot.

Thanks again!

Last edited by defborn (2022-02-14 13:48:34)

Offline

#2 2022-02-11 06:18:29

FallenSnow
Member
Registered: 2014-04-07
Posts: 39

Re: [SOLVED] Full system freezes happening at random

Did you just say you can't long press to power off the machine??
I'm not entirely sure how that is handled but I assume the motherboard is the authority on long press to power off. How long are you holding the power button?

Also, another thing to try. Uninstall nvidia, reboot, use only the console, not a WM. Try running some system stressing app for a few hours. If it locks up that somewhat remove graphics from being the cause.

Offline

#3 2022-02-14 13:48:04

defborn
Member
Registered: 2022-02-08
Posts: 6

Re: [SOLVED] Full system freezes happening at random

@FallenSnow: up to 20 seconds... Strange is it not? And since this only happens after a freeze and at that moment there is no output on the screen, not even an actual boot, there are no logs I can consult.

OT: since I have removed the bad memory module I have not encoutered a single freeze. So I will close this.

Offline

Board footer

Powered by FluxBB