You are not logged in.
Pages: 1
Hi,
I have a system with an AMD Ryzen 9 3950X and an Nvidia Geforce 2080 Ti. This system freezes sporadically since some weeks, journalctl's log is empty at the time of freeze, and the system requires a hard reset. So far, I have not figured out a) how to reproduce the freeze or b) what the reason is (mostly because journalctl doesn't help).
I'm currently stuck at a remote place away from this system, but need its compute for some simulations. I can ask a trusted person to reboot it and unlock the encrypted HDD, but that's about the "physical interaction" I have with the system. Once booted, the system connects to a VPN, and I can remotely access it via ssh through the VPN.
Are there any recommendations how I can debug this situation? I have never used netconsole, so I don't know how to configure it to log stuff to a remote computer. In particular, I wouldn't know how to start with netconsole in the case that there's only a remote computer available once the system's VPN connection is established. Maybe there are other tools?
Any help or pointers to resources are highly appreciated.
Edit: As an additional note, this is a dual boot system with Windows and never showed any kind of freeze when using that OS.
Last edited by rochus (2022-07-20 20:39:21)
Offline
Edit: As an additional note, this is a dual boot system with Windows and never showed any kind of freeze when using that OS
Make sure to have seen the 3rd link below. Windows might be crashing your linux. (Don't believe the condition: check it - windows casually re-enables that w/ updates)
I'm currently stuck at a remote place away from this system
How "currently"?
This week "currently", this month "currently" or 8 to 12 years "currently"
?
(If it's short term and your trusted person not overly experienced, it might not be worth the effort to train them anything beyond pushing the power button - what needs to stop if you want some data out of the system)
The problem w/ https://docs.kernel.org/networking/netconsole.html is that if the system "freezes" because the network breaks down, you're not necessarily getting errr… network messages for that.
And certainly not, if the network breaks down because of some userspace issues (ie. the VPN crashes or whatever), because those aren't covered by it at all.
It would be good to get at least some idea of the nature of the freeze.
Ideally enable https://wiki.archlinux.org/title/Keyboard_shortcuts and train your trusted person to
1. check whether the keyboard LEDs are blinking (indicating a kernel panic)
2. use the sysrq sequence to reboot the system if possible (as this will -perhaps- allow the journal to be synced to disk)
And finally, there's https://wiki.archlinux.org/title/Kdump
Offline
Thank you for your great reply!
I suspected as much with netconsole, i.e. that it won't necessarily help me and that I need to find another way to sync the log to disk. The "currently" is about 3 weeks that I won't be able to touch the system. However, I'm confident that I can train my trusted person to follow the sysrq sequence
The "pain and suffering" from this issue was not big enough previously, because it happens so rarely and I was usually near the system the last one or two times it happened. With your explanation and references I think I might be finally able to hunt it down.
Thanks again! I hope to remember to report back in this thread once I managed to solve it.
Offline
Pages: 1