You are not logged in.
Hi everyone,
my pc, several days ago, start to reboot randomly. Until today I couldn't do anything.
Today I find a way to trigger the reboot. When I run a server written in rust and built in debug mode, after more or less 3 minutes the screen freeze and in few second the PC reboot. If I'm playing a video on youtube when the freeze happens the video stop and audio enter in a loop, then after few seconds something trigger the reboot.
I did some experiments:
booted with `processor.max_cstate=1 intel_idle.max_cstate=0 ` and run the server, still reboot after ~3min
booted with `processor.max_cstate=1 intel_idle.max_cstate=0 split_lock_detect=off` and run the server, still reboot after ~3min
removed the above kernel parameters, update the system with `pacman -Syu`, rebooted, run the server, still reboot after ~3min
boot without launching Hyprland, run the server, still reboot after ~3min
If I compile the server in release mode and run it, it do not reboot and it run without issues. The freeze + reboot thing is not happening only when I run the server, it also happens randomly in other occasion.
I runned stress with a lot of different combination of cpu io vm and vm-bytes for a lot of time, and really stressing the hardware, and I don't get any reboot.
I runned memtest86+ for several hours an no issue have been fund.
I can compile in parallel big projetcs and the PC do not reboot.
Usually minikube is running on my pc, not sure of that; I have the sensation that when minikube is running is easier that I end up in the situation where the PC reboot.
Hardware do not seems to overheat, when the reboot happens.
I runned `dmesg -w` and watch it, then runned the server, no message there when the PC reboot.
I runned `jurnalctl -f` and watch it, then runned the server, no message there when the PC reboot.
My hardware is:
Z790 AORUS ELITE AX
48GiB DIMM Synchronous 4800 MHz
48GiB DIMM Synchronous 4800 MHz
Intel(R) Core(TM) i9-14900K
2TB NVMe disk (with my encrypted os)
Firmaware:
[ 4.151414] i915 0000:00:02.0: [drm] Finished loading DMC firmware i915/adls_dmc_ver2_01.bin (v2.1)
[ 4.166282] i915 0000:00:02.0: [drm] GT0: GuC firmware i915/tgl_guc_70.bin version 70.20.0
[ 4.166287] i915 0000:00:02.0: [drm] GT0: HuC firmware i915/tgl_huc.bin version 7.9.3
uname -r:
6.8.9-arch1-2
Any clue of what could trigger the reboot?
Offline
When the audio is looping, before the reboot, are the lights on your keyboard flashing? (num lock, caps lock)
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
The shortest way to ruin a country is to give power to demagogues.— Dionysius of Halicarnassus
---
How to Ask Questions the Smart Way
Offline
I just tried it, and I can confirm that my keyboard's light do not flash. They stay off.
Offline
Okay. I had been checking to see if the kernel was panicing. Does not sound like it.
Any clues in the journal?
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
The shortest way to ruin a country is to give power to demagogues.— Dionysius of Halicarnassus
---
How to Ask Questions the Smart Way
Offline
I don't know what should I look for. I have a bunch of error from dockerd and containerd. Some issue with bluetooth and wifi (that I don't use). Here a pastebin https://pastebin.com/sTA5d9G6 from my last session (the one that I make reboot running the proxy to see if there were flashing light) https://pastebin.com/sTA5d9G6.
Last edited by antonio326 (2024-05-16 19:12:02)
Offline
Overheated, underpowered or bad RAM or CPU.
Does buildign in debug mode use considerably more RAM than building in release mode?
Monitor the temperature, check memtest86+ (and in case that are two 48GB DIMMs, test the behavior with only either of them) and see https://www.radgametools.com/oodleintel.htm (in case memtest86+ doesn't show anything)
Offline
yep running the server in debug mode is more resource intensive than running the one compiled with release mode. But is not so heavy, when I use `stress` or I compile a big project I use a lot more cpu and ram. I will try what you suggested about the RAM. For what concern temperature everything seems to work well, I can reach higher ones when I use stress. Is there a way to log temperatures so that I can look at them after that the reboot happens?
Offline
https://wiki.archlinux.org/title/List_o … m_monitors
The problem is that when the system spontanously reboots, the tail of any log will not be written to disk.
So it's not actually a build process but just running that "server" (sorry, I just skimmed the OP because there's not much what causes a spontanous reboot) - what is the "server" and how does its debug build differ from the regular one?
Just different compiler optimizations and debug symbols or does it do different things (eg. a lot of disk IO etc.)
Does the "server" utilize its own kernel modules?
The journal doesn't suggest you're relying on the IOMMU (for eg. vfio)?
Try "iommu=soft", https://wiki.archlinux.org/title/Solid_ … ST_support (but that's a blind guess)
Offline
Ok, so, I wanted to do some more experiments but that turned out to be impossible. So for now I have to stop looking for the cause of the reboot. I will certainly updated this thread if I will be able to find something more.
I wanted to try out what seth suggested, monitor temperature, one bank of ram at time, iommu.
First thing I re runned the proxy to see if I could get a reboot after few minutes, and it worked (in the sense that the PC rebooted)
Then I decided to start with temperature, I already checked it, but it seemed the most likely cause.
I runned the proxy watching the temperatures and to my big surprise some core and the overal cpu was going near and also touching critical (100 deg), after a little bit the PC rebooted as usual.
I was surprised cause when I runned the proxy and I was watching for temperature cpu did not overheat.
Then I tried again to run the proxy and cpu was not overheating any more, and PC stopped rebooting, I let it run for 20 minutes but nothing.
I runned stress for 10 minutes reaching critical temp on the CPU but the PC did not rebooted.
Then I rebooted myself the PC to see if I could get the server to make my PC crash but nothing, it do not crash anymore.
My PC today is the same as of yesterday, same hardware same software, I run the exact same things as yesterday also play the video on youtube (is handy cause I can listen at the audio and know when is crashing)
Today is colder then yesterday, but I do not thing that this can be the cause, the first 2 time that I tried to run the server (today), my PC rebooted.
About the seth question on what the server does and diff between build:
1. the server fetch some data from a bitcoin node on my PC, write some logs on a postgres server on minikube on my PC, and listen on 2 port (other then that it does nothing cause I do not connect any client to it)
2. debug build is not optimized, so is faster to build, so usually is what is used when you develop the software.
Thanks everyone for the help. And I hope to understand a little bit more about this issue and updated this thread.
Last edited by antonio326 (2024-05-17 10:21:25)
Offline
I ended up under-clocking my CPU and everything seems to work fine.
Offline