How to debug random complete system freeze?

AquaSZS · 2019-10-24 07:29:46

Posting here because my issue seems to be a hardware problem.

Ever since a kernel update 2 days ago (possibly not the cause), I have encountered complete system freezes (display frozen, only way to interact in power button) randomly. The freeze can happen even if the system is idle with no demanding background processes.

On the same day of kernel update, I was also working with a somewhat large dataset (more than half TB) on the HDD. I don't think this is relevant, but would like to mention just in case.

I tried switching to the LTS kernel, but the freeze occurred again in a few hours.

The most frustrating part of all this is that there is absolutely no log in journalctl before the freeze happens. The last time I had similar freezes (they were not complete freeze however), the logs showed problems in SATA data transfer. Those logs helped me determine that my power supply was defective and occasionally not providing enough power to HDD. I have had no system issue after replacing the PSU until now.

How should I go about figuring out the problem this time?

Last edited by AquaSZS (2019-11-04 09:39:51)

d_fajardo · 2019-10-24 10:26:07

memtest to check for RAM fault (include CPU cache). Also you can check smartctl for HD. I've also read somewhere about Blend for CPU testing but I don't have experience with it.

AquaSZS · 2019-10-24 14:32:59

Well damn, even memtest86+ freezes at some point (tried twice, froze at different tests, didn't get through any pass). At this point, should I consider just abandoning ship? Not knowing the source of problem still irks me though.

Last edited by AquaSZS (2019-10-24 14:34:19)

xse · 2019-10-24 15:05:29

Hey,

First thing first is to open your computer and removing the maximum possible number of hardware from it, leave one RAM thingy, no HDD, no pci thingy like no GPU and stuff, only bare CPU+1 RAM+usb live, you check that way, then you gradually adds back hardware to check if that issue comes from a specific part.

I have a similar issue for quite some time on one of my computers, on this specific PC every non windows OS i tried had this freeze issue, totally random and no logs whatsoever.
The one time I've come the closest to understanding what was happening was with OpenBSD which sometimes gave me access to https://man.openbsd.org/ddb.4 which allowed me to dump core and stuff.

I've long stopped trying to understand where that came from, after all this is a 10+ years computer, which i still use daily but which now runs windows.
If you really wants to have hints i can only recommend trying a simple obsd setup, without X and stuff and seeing if you could get some meaningful ddb outputs that way. There might be a way to get linux to do that same thing but i never managed to do it since once it froze on linux it was totally frozen.

Last edited by xse (2019-10-24 15:07:05)

AquaSZS · 2019-10-25 07:48:22

Somehow, my system has not frozen today after 6 hours of up time, despite being more active for most hours.

I noticed that, so far, all freezes happened when the system was not doing much in particular. I don't know enough about how kernels and OS work to make any intelligent diagnoses, but perhaps there is something wrong with how the kernel handles low processing load. Although this is a desktop PC, maybe some kind of power management went wrong?

schlitznase · 2019-10-25 08:10:37

Ramdom freezes sound like a overheating CPU. Did you check your thermal sensors? Maybe the cooling has an issue.

AquaSZS · 2019-10-25 08:13:57

I am processing a lot of data right now and the cores average around 80C, which is high but still safe. There is no freezes.

Ironically, freezes seem more likely when idle.

schlitznase · 2019-10-25 10:03:57

Did you try "xse's" suggestion? What happend there?

seth · 2019-10-25 12:56:41

Ironically, freezes seem more likely when idle.

https://wiki.archlinux.org/index.php/In … ete_freeze

AquaSZS · 2019-10-25 13:00:21

Can this still apply if I am not on Baytrail?

seth · 2019-10-25 13:47:09

Unknown. But the pattern fits.
So I'd just try.

AquaSZS · 2019-10-30 05:02:51

Looks like Intel's C states were the problem. I have not had a freeze since setting max_cstate to 1 in kernel options. The only downside is that the idle CPU temperature is now ~50C rather than ~40C.

I am not sure why the problem only started after a kernel update, however. I have been using this set of hardware for much longer.

seth · 2019-10-30 06:56:46

µcode issue?
Did intel-ucode update when this happened? You could also just downgrade it.

AquaSZS · 2019-10-30 08:34:16

As it was last updated in September, a month before the freezes, intel-ucode doesn't seem to be the problem.

seth · 2019-10-30 13:40:17

And it also affects memtest… Do you have a parallel windows installation?

jrl · 2019-11-01 07:40:23

This might not be related to Intel. My system runs on AMD Ryzen with integrated graphics and I'm facing the same issues since yesterday. Random system freezes that can only be solved with hard reboot and nothing relevant in journalctl logs.

seth · 2019-11-01 08:08:15

There're actually countless threads on "ryzen crashes" on this forum - similar symptom doesn't mean it's somehow the same problem.

AquaSZS · 2019-11-04 06:36:14

The freezes have restarted.

They seem more likely to happen after leaving the system logged off but idle over weekend (done so that I can ssh to it from other computers). Something else to note is that there was a power outage at the computer's location yesterday. However, I believe a freeze happened before the outage (could not ssh).

At this point, I do not want to spend more time on this problem. I will be replacing the computer. All the parts except PSU are fairly old anyway. Thanks for your inputs.

Last edited by AquaSZS (2019-11-04 06:41:04)

AquaSZS · 2019-11-04 09:41:25

For the record, here are kernel warnings during boot. There is no warning or error (or any other log entries) before freezes happen.

Currently using the latest kernel (5.3.8-arch1-1).

kernel: ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/Pm1aEventBlock: 32/16 (20190703/tbfadt-564)
kernel: ACPI BIOS Warning (bug): 32/64X length mismatch in FADT/PmTimerBlock: 32/24 (20190703/tbfadt-564)
kernel: ACPI BIOS Warning (bug): Invalid length for FADT/Pm1aEventBlock: 16, using default 32 (20190703/tbfadt-669)
kernel: ACPI BIOS Warning (bug): Invalid length for FADT/PmTimerBlock: 24, using default 32 (20190703/tbfadt-669)
kernel: core: CPUID marked event: 'bus cycles' unavailable
kernel: MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
kernel:  #2 #3 #4 #5 #6 #7
kernel: ata2.00: supports DRM functions and may not be fully accessible
kernel: ata2.00: supports DRM functions and may not be fully accessible
systemd-udevd[342]: /usr/lib/udev/rules.d/11-dm-lvm.rules:40 Invalid value for OPTIONS key, ignoring: 'event_timeout=180'
systemd-udevd[342]: /usr/lib/udev/rules.d/11-dm-lvm.rules:40 The line takes no effect, ignoring.
kernel: ACPI Warning: SystemIO range 0x0000000000000500-0x000000000000052F conflicts with OpRegion 0x0000000000000518-0x000000000000051B (\GPBL) (20190703/utaddress-204)
kernel: ACPI Warning: SystemIO range 0x0000000000000500-0x000000000000052F conflicts with OpRegion 0x000000000000050C-0x000000000000050F (\GPLV) (20190703/utaddress-204)
kernel: ACPI Warning: SystemIO range 0x0000000000000500-0x000000000000052F conflicts with OpRegion 0x0000000000000504-0x0000000000000507 (\GPIS) (20190703/utaddress-204)
kernel: ACPI Warning: SystemIO range 0x0000000000000500-0x000000000000052F conflicts with OpRegion 0x0000000000000500-0x0000000000000503 (\GPUS) (20190703/utaddress-204)
kernel: nvidia: loading out-of-tree module taints kernel.
kernel: nvidia: module license 'NVIDIA' taints kernel.
kernel: Disabling lock debugging due to kernel taint
kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  435.21  Sun Aug 25 08:17:57 CDT 2019
kernel: platform regulatory.0: Direct firmware load for regulatory.db failed with error -2

Last edited by AquaSZS (2019-11-04 09:42:20)

Arch Linux

#1 2019-10-24 07:29:46

How to debug random complete system freeze?

#2 2019-10-24 10:26:07

Re: How to debug random complete system freeze?

#3 2019-10-24 14:32:59

Re: How to debug random complete system freeze?

#4 2019-10-24 15:05:29

Re: How to debug random complete system freeze?

#5 2019-10-25 07:48:22

Re: How to debug random complete system freeze?

#6 2019-10-25 08:10:37

Re: How to debug random complete system freeze?

#7 2019-10-25 08:13:57

Re: How to debug random complete system freeze?

#8 2019-10-25 10:03:57

Re: How to debug random complete system freeze?

#9 2019-10-25 12:56:41

Re: How to debug random complete system freeze?

#10 2019-10-25 13:00:21

Re: How to debug random complete system freeze?

#11 2019-10-25 13:47:09

Re: How to debug random complete system freeze?

#12 2019-10-30 05:02:51

Re: How to debug random complete system freeze?

#13 2019-10-30 06:56:46

Re: How to debug random complete system freeze?

#14 2019-10-30 08:34:16

Re: How to debug random complete system freeze?

#15 2019-10-30 13:40:17

Re: How to debug random complete system freeze?

#16 2019-11-01 07:40:23

Re: How to debug random complete system freeze?

#17 2019-11-01 08:08:15

Re: How to debug random complete system freeze?

#18 2019-11-04 06:36:14

Re: How to debug random complete system freeze?

#19 2019-11-04 09:41:25

Re: How to debug random complete system freeze?

Board footer