You are not logged in.

#1 2020-03-24 10:15:54

Lonje
Member
Registered: 2013-11-15
Posts: 10

System freezes unrecoverably

Hi. First of all, sorry for the vague subject. But that's about as specific as I could make it without making it too long. Basically, my PC just freezes at more-or-less random intervals (usually within 30 minutes of booting, but not always). A freeze looks like this: the screen still displays everything just fine, but frozen. Like when you pause a video and it just shows a still image. No inputs have any effect. I can't switch to a TTY or use a SysRq, and I can't SSH into the system, either. The only thing left to do is reset the system entirely.

I initially thought that this has something to do with KDE/Plasma, but I've since had the freeze occur also in i3. When booting into a terminal and having no GUI running, I haven't had it occur, though I also haven't spent much time in that kind of UI.
The reason I thought it was a KDE issue is that I had an i3 session running overnight, and the freeze did not occur. However, when I was working this morning in an i3 session, it occurred again. It seems that the more programs are running (browsers, terminals, etc.), the more likely it is this occurs.
In general, though, I have not been able to identify any kind of consistent pattern. The only consistency is that it always occurs eventually.

At this point, I'm fairly sure that this is a hardware issue, and I'm posting this topic in the hope that someone can help me identify which component is at fault. Let's give this post some structure:

Meta

The machine runs an up-to-date Arch linux:

5.5.11-arch1-1 #1 SMP PREEMPT Sun, 22 Mar 2020 16:33:15 +0000 x86_64

I also have a laptop which also runs an up-to-date Arch linux, and which does not have any problems at all.

The machine is relatively new, I built it less than two weeks ago.

Hardware of the machine in question:

  • CPU: AMD Ryzen 5 3600X

  • GPU: AMD Radeon RX 5700 XT

  • RAM: 32 GB DDR4

  • Motherboard: B450 Aorus Elite

I have AMD microcode updates installed. I also have the mesa packages and the xf86-video-amdgpu driver installed.
My root filesystem is btrfs, but my boot partition is a "normal" FAT32 EFI system partition. I'm booting in EFI mode, using the refind boot manager.

What I've tried
  • One of the first things I did was run a memtest. I stopped this after one pass because it took a long time, and I needed to use the PC again. It did not find any errors. I'm currently running another memtest, and am planning to let that one complete without stopping it myself.

  • Use a different kernel. I've tried with the linux-lts kernel, and the freeze still occurred. The kernel has also been updated through normal pacman -Syu at some point, and the issue was not affected.

  • Updated my BIOS -- it's at the newest firmware that the vendor provides.

Weird things I've noticed
  • The frequency at which the issue occurs seems to have increased. Like I said, I've only had the PC for about two weeks, and the crashes went from occasional to usually occurring within maybe 30 minutes to an hour of booting.

  • The system is generally a bit unstable. Next to the complete freezes, I've also had programs crash fairly frequently, e.g. KWin, Firefox, Chromium. Software that does not crash on my laptop, even though it's the same OS and the same versions. I'm not experiencing any kind of stuttering or other types of performance issues, just crashes and the mysterious freeze.

  • Weird IRQ-related crashes in the kernel log. I can upload more system logs later (memtest is still running -- bad timing on my part, sorry), but I've been getting call traces and whatnot in dmesg output, sometimes just before a freeze, and sometimes the freeze occurs quite a bit later. Sometimes there are no errors in the kernel log, and the freeze still occurs anyway.

  • The freeze did not occur when I let the system run overnight in an i3 session. I even had glxgears from the mesa-demos package running to generate some GPU load, and it was fine this morning. Then, after experimenting some more, I was running an i3 session again after a fresh boot, and the freeze occured within less than an hour of uptime. I was running some terminals and a browser (which, interestingly, crashed some time before the freeze occurred). This supports my thesis that the more software is running, the more likely the crash, I guess

Logs

Here's a system log I uploaded to pastebin that ends pretty much when the freeze occurred. To my eyes, there's nothing that might indicate what the problem is, but maybe someone here sees something: https://pastebin.com/WHMT91Cd
It does not contain a crash in Kernel code, which I sometimes see. Again, when the memtest is done, I'll try to find a log that contains that and upload it as well.

Please help

I hope I've kept this topic somewhat readable. I'll be very, very thankful for any ideas and any input at all. I feel like I'm loosing my mind here. So thanks very much in advance. And sorry about the missing logs, I'll try to upload some when memtest is done and if the machine will let me without freezing first.

Offline

#2 2020-03-24 11:51:29

xerxes_
Member
Registered: 2018-04-29
Posts: 253

Re: System freezes unrecoverably

When you say that one pass of memtest was "long", then how long it take? I think you should do at least 3 passes to be sure. My RAM memtest took 4 hours for 3 passes. It was long, but I do it once and now I'm sure that my memory is OK (no errors).

In journalctl log I don't see anything suspicious, except plasmashell crash, but that should freeze/hang whole system (you should be able to switch to text console by ctrl+alt+Fn). Is this full log from that boot (up to the end)?

My guess would be that it might be something related with your GPU. Radeon RX 5700 is quite new and there may be some driver issues. Write what gpu driver you use (lspci -nnk) and Xorg log (if you use X server).

And look here: https://bbs.archlinux.org/viewtopic.php?id=253848. Also search forum for '5700' string.

Last edited by xerxes_ (2020-03-24 11:56:57)

Offline

#3 2020-03-24 12:03:27

Lonje
Member
Registered: 2013-11-15
Posts: 10

Re: System freezes unrecoverably

xerxes_ wrote:

When you say that one pass of memtest was "long", then how long it take? I think you should do at least 3 passes to be sure. My RAM memtest took 4 hours for 3 passes. It was long, but I do it once and now I'm sure that my memory is OK (no errors).

In journalctl log I don't see anything suspicious, except plasmashell crash, but that should freeze/hang whole system (you should be able to switch to text console by ctrl+alt+Fn). Is this full log from that boot (up to the end)?

My guess would be that it might be something related with your GPU. Radeon RX 5700 is quite new and there may be some driver issues. Write what gpu driver you use (lspci -nnk) and Xorg log (if you use X server).

And look here: https://bbs.archlinux.org/viewtopic.php?id=253848. Also search forum for '5700' string.

Thanks for the reply! I stopped the initial memtest after maybe 1h 30m because I needed to use the machine again. Currently running a full memtest that I won't interrupt until it completes. Looks like that will take about 4 hours as well.

Yeah, the plasmashell crash was recoverable. That's one of those crashes that just kind of happen, not "The Freeze" that the system cannot recover from. I will post the information you requested as soon as the memtest is done.

I had found the topic you linked, too, but thought that that's a different issue, since I was able to run some games that put full load on the GPU without the system freezing. Though I suppose since I don't have any better ideas I will try the proposed solution. I can also try ripping out my GPU and using an older nvidia GPU. Will do all of that later today, when the memtest is done and I'm done with work. Thanks again!

Offline

#4 2020-03-24 14:55:40

Lonje
Member
Registered: 2013-11-15
Posts: 10

Re: System freezes unrecoverably

Okay, so memtest is still running, and hasn't found any errors so far. However, this happened at some point during its run: https://imgur.com/a/mOUeb55
Basically, the UI of memtest got corrupted. It has since updated itself so that I can still make out the error count, which is still at zero. Still, the UI remains corrupted, similar to what can be seen in the image.

Offline

#5 2020-03-24 22:09:10

xerxes_
Member
Registered: 2018-04-29
Posts: 253

Re: System freezes unrecoverably

I forgot to ask earlier: what is the version of that memtest and from what source? Is it from arch repository, builtin in UEFI, or from some site and burned to CD/USB stick and started within?

I run my memtest from arch repo (memtest86+), which was installed into grub menu. When memtest starts press F2 to turn on SMT (up to 32 cores).
I don't trust memtests builtin UEFI.

Last edited by xerxes_ (2020-03-24 22:10:24)

Offline

#6 2020-03-25 06:57:39

Lonje
Member
Registered: 2013-11-15
Posts: 10

Re: System freezes unrecoverably

I was using this memtest: https://aur.archlinux.org/packages/memtest86-efi/ -- version 8.3-2

Some logs

Full system log from a session that froze: https://pastebin.com/0KeEBb1Y/?e=1
Just the kernel log with a crash near the end from a session that froze: https://pastebin.com/DsWm9zXm

Other things I've tried
  • I've replaced my PSU -- still freezes

  • I've replaced the GPU with an nvidia one -- still freezes

  • I've disconnected an older SSD that I once had trouble with on Windows -- still freezes

The only things I haven't replaced are RAM, CPU, mainboard, and two m2 SSDs. The memtest shows that the RAM should be fine. I guess next would be getting my hands on a different mainboard and seeing if that fixes it?

Offline

#7 2020-03-25 08:32:59

Lonje
Member
Registered: 2013-11-15
Posts: 10

Re: System freezes unrecoverably

Okay, uhm, I just realized that I'm a massive idiot. The CPU fan was not connected. I remember doing some cable management a few days after first assembling the machine, and I re-routed the cable to reduce some clutter. Obviously, I forgot to plug it back in. Did that now, so far no new freezes.
I am surprised, though, that the firmware didn't stop the machine turning on when it didn't detect a CPU fan. And I'm also a bit puzzled that the machine simply froze instead of shutting down. I assume that this always happened when the CPU got too hot. Then again, I never did any super demanding tasks, and obviously I had a heat sink sitting on the CPU, in addition to three case fans that were definitely running.

Furthermore, I'm still getting ominous errors in dmesg, e.g.

[ 1850.192409] ------------[ cut here ]------------
[ 1850.192411] corrupted preempt_count: htop/1169/0x1
[ 1850.192419] WARNING: CPU: 0 PID: 1169 at kernel/sched/core.c:3203 finish_task_switch+0x25d/0x270
[ 1850.192419] Modules linked in: rfkill it87 hwmon_vid nvidia_drm(POE) nvidia_modeset(POE) edac_mce_amd snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 nvidia(POE) kvm vfat fat snd_hda_codec_realtek irqbypass snd_hda_codec_generic ledtrig_audio drm_kms_helper snd_hda_intel snd_intel_dspcfg snd_hda_codec wmi_bmof crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_core ipmi_devintf snd_hwdep r8169 ipmi_msghandler snd_pcm syscopyarea realtek sysfillrect aesni_intel snd_timer snd ccp crypto_simd sysimgblt cryptd sp5100_tco glue_helper pcspkr i2c_piix4 k10temp joydev mousedev libphy input_leds soundcore rng_core fb_sys_fops wmi evdev gpio_amdpt mac_hid pinctrl_amd acpi_cpufreq drm crypto_user agpgart ip_tables x_tables hid_generic usbhid hid btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq ahci libahci crc32c_intel libata xhci_pci xhci_hcd scsi_mod
[ 1850.192449] CPU: 0 PID: 1169 Comm: htop Tainted: P           OE     5.5.11-arch1-1 #1
[ 1850.192450] Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS ELITE/B450 AORUS ELITE, BIOS F51 12/18/2019
[ 1850.192452] RIP: 0010:finish_task_switch+0x25d/0x270
[ 1850.192453] Code: ff 65 48 8b 04 25 c0 8b 01 00 8b 90 08 05 00 00 48 8d b0 c0 06 00 00 48 c7 c7 a8 76 50 b2 c6 05 9d 8f 45 01 01 e8 45 09 fd ff <0f> 0b e9 68 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44
[ 1850.192453] RSP: 0018:ffff8ef188ea7988 EFLAGS: 00010082
[ 1850.192454] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 1850.192454] RDX: 0000000000000002 RSI: 0000000000000082 RDI: 00000000ffffffff
[ 1850.192455] RBP: ffff8ef188ea79c0 R08: 000000000000048b R09: 0000000000000004
[ 1850.192455] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ad29e82ce40
[ 1850.192456] R13: ffffffffb2814780 R14: ffff8ad242159100 R15: ffff8ad29e82ce40
[ 1850.192457] FS:  00007f98dff1a740(0000) GS:ffff8ad29e800000(0000) knlGS:0000000000000000
[ 1850.192457] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1850.192458] CR2: 000023781d49c000 CR3: 000000079206e000 CR4: 0000000000340ef0
[ 1850.192458] Call Trace:
[ 1850.192462]  ? __switch_to+0x10d/0x440
[ 1850.192465]  ? __switch_to_asm+0x34/0x70
[ 1850.192469]  __schedule+0x2f0/0x7a0
[ 1850.192471]  schedule+0x46/0xf0
[ 1850.192472]  schedule_hrtimeout_range_clock+0xa5/0x120
[ 1850.192474]  ? hrtimer_init_sleeper+0xa0/0xa0
[ 1850.192477]  poll_schedule_timeout.constprop.0+0x42/0x70
[ 1850.192478]  do_sys_poll+0x411/0x540
[ 1850.192481]  ? poll_select_finish+0x280/0x280
[ 1850.192482]  ? poll_select_finish+0x280/0x280
[ 1850.192486]  ? ktime_get_ts64+0x46/0xe0
[ 1850.192487]  __x64_sys_poll+0xaf/0x140
[ 1850.192489]  do_syscall_64+0x4e/0x150
[ 1850.192491]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1850.192492] RIP: 0033:0x7f98e0035a87
[ 1850.192493] Code: 00 00 00 5b 49 8b 45 10 5d 41 5c 41 5d 41 5e c3 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 07 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
[ 1850.192494] RSP: 002b:00007ffe23ce6fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000007
[ 1850.192495] RAX: ffffffffffffffda RBX: 00007ffe23ce7020 RCX: 00007f98e0035a87
[ 1850.192495] RDX: 00000000000005dc RSI: 0000000000000001 RDI: 00007ffe23ce7020
[ 1850.192495] RBP: 00007ffe23ce7010 R08: 00007ffe23d83080 R09: 00007ffe23ce6fb8
[ 1850.192496] R10: 00007ffe23ce6fb0 R11: 0000000000000246 R12: 00000000000005dc
[ 1850.192496] R13: 00005610b0b512a0 R14: 00000000000c6d1f R15: 00005610b0b512a0
[ 1850.192499] ---[ end trace c68474a0c31b2ffa ]---

Is it possible I did permanent damage to the CPU? Running the stress command works fine with 12 CPU workers, and the system seems stable so far, but I'm worried about the continued mysterious kernel-space crashes.

Edit: just got another freeze... does anyone have an opinion on which component is more likely at fault, CPU or motherboard?

Edit 2:

Possible solution found

So I googled the issue and also included "Ryzen" in the search query, and found this: https://www.reddit.com/r/Amd/comments/c … ll_of_zen/

It essentially states that this is a know issue and just kind of happens with Ryzen CPUs. The solution is also described in that post (emphasis mine):

The solution is to set 'Power Supply Idle Control' (can be found in AMD CBS -> CPU/Zen Options in BIOS) to 'Typical current idle', from 'Auto' or 'Low current idle'.

Note that for me, this option was hidden away in a different path in my BIOS settings, but the name of the option ("Power Supply Idle Control") was the same.

So far, my machine has been running for about 3h 30m without the freeze occurring, which is a new record. I will give it some time and then update this thread again if I'm either fairly sure that this was indeed the issue and that it is now fixed, or if the freeze occurs again...
One thing to note is that I have another Ryzen PC with a 2400G (I think -- something in the 2000s with a G at the end), and at work myself and two other people have workstations with Ryzen 2700X CPUs, and I've never encountered this issue before. I find that curious, since allegedly, all Ryzens are affected, and the X's are supposedly those that are most severely affected. Anyway, we'll see.

Edit 3: it froze again.

Last edited by Lonje (2020-03-25 19:37:38)

Offline

#8 2020-03-25 19:50:42

xerxes_
Member
Registered: 2018-04-29
Posts: 253

Re: System freezes unrecoverably

But it survive longer? If yes, this might be something with power save states; try to disable deep/all power save states (Cn like and maybe Pn like) to see if that helps. Did you run some program to keeping processor from going into sleep?

Offline

#9 2020-03-26 08:06:32

Lonje
Member
Registered: 2013-11-15
Posts: 10

Re: System freezes unrecoverably

xerxes_ wrote:

But it survive longer? If yes, this might be something with power save states; try to disable deep/all power save states (Cn like and maybe Pn like) to see if that helps. Did you run some program to keeping processor from going into sleep?

It did survive longer yesterday. Today I had two more freezes, each within maybe 10 minutes of booting, so I'm not sure that's be it. I was running KDE Plasma and Firefox and some other programs most of the time, plus I had processor.max_cstate=5 in my kernel command line, so I dunno... I'm trying max_cstate=1 now.
I'm seriously considering trying to return my CPU, cooler, and motherboard, and going with Intel instead. What the hell is going on?

Oh, I also did what Ropid said here:

Ropid wrote:

Because you use 32GB RAM, there's also another thing you should know about: there's "resistances" for the signal on the wiring that can be tweaked on Ryzen. The defaults for those resistances often don't work for high memory speeds for systems that use all four memory slots, or that use "dual-rank" memory sticks (that's most 16GB sized sticks). Here's what to try with dual-rank sticks or with four sticks:

ProcODT = 48 Ohm or 53 Ohm or 60 Ohm
RttNom = off
RttWr = RZQ/3 (80 Ohm)
RttPark = RZQ/1 (240 Ohm)

The important ones here are the RttPark and RttWr settings.

But to no avail.

I think maybe I'll try contacting my motherboard's manufacturer and see if they can help -- but I doubt they'll try very hard as soon as they hear the word "Linux".

Offline

#10 2020-03-26 12:09:32

xerxes_
Member
Registered: 2018-04-29
Posts: 253

Re: System freezes unrecoverably

I was thinking to disable power save states in UEFI rather then in kernel parameter, but all things that don't damage hardware are worth to try...

About this new RAM "magic timing" I know nothing, because I have older and looks like less problematic hardware.

Checkout once more all connections on motherboard, check if all fans are working, view your motherboard guide and look for any instability reason which might be mentioned in it. Also look on settings in UEFI.

Edit:
How many RAM banks do you use? Maybe think about switch RAM to other banks or even or odd banks?

Last edited by xerxes_ (2020-03-26 12:40:28)

Offline

Board footer

Powered by FluxBB