You are not logged in.

#1 2020-04-03 00:52:13

RainmakerRaw
Member
Registered: 2015-03-30
Posts: 11

Threadripper kernel crash on linux 5.5.13 but not LTS

My desktop system is:

AMD Threadripper 3960X (24c48t)
Asus RoG Strix TRX40-E Gaming motherboard with the latest UEFI/BIOS
32GB (4x 8GB) TeamGroup 8-Pack Edition 3600MHz RAM (16,16,16,36 at 1.35v per DOCP/XMP)
Sapphire Pulse Vega 56 graphics
1TB Samsung Evo Plus NVMe drive
2TB Toshiba X300 high performance 7,200rpm HDD

Arch is on my mechanical drive, installed as per the Wiki, on GPT/EFI with three partitions (500MB FAT32 EFI, 2GB swap, 1.8TB ext4 for / of install).

I have been running foldingathome [aur] and boinc [community]. FaH has only a GPU slot using amd-opencl [aur]. BOINC is capped at 90% of CPU cores with 100% CPU time available. This works flawlessly on Windows 10 Pro for Workstations, but obviously I want to run Arch.

I have been having near constant crashes on Arch, whereby the GUI freezes completely. The mouse can't move, I can't switch TTY, and there's no response whatsoever to the keyboard (eg CTRL+ALT+DEL, CTRL+ALT+BACKSPACE). It's just dead. If the display has timed out when it happens, it just stays black. If it was at the desktop, that's still visible without corruption, but frozen. However, I can still log in via SSH remotely using my gpg key and everything is actually still running apparently, behind the scenes. It seems to be brought on, or at least made worse/happen faster when I'm using all of the CPU (eg running BOINC).

I haven't been able to see much of use in dmesg or journalctl, until today:

Apr 02 18:13:45 arch-pc kernel: general protection fault: 0000 [#1] PREEMPT SMP NOPTI
Apr 02 18:13:45 arch-pc kernel: CPU: 39 PID: 1434 Comm: Xorg Tainted: G           OE     5.5.13-arch2-1 #1
Apr 02 18:13:45 arch-pc kernel: Hardware name: System manufacturer System Product Name/ROG STRIX TRX40-E GAMING, BIOS 0902 03/17/2020
Apr 02 18:13:45 arch-pc kernel: RIP: 0010:__ksize+0x4c/0xe0
Apr 02 18:13:45 arch-pc kernel: Code: 00 80 48 2b 05 ad 62 f5 00 48 01 f8 48 c1 e8 0c 48 c1 e0 06 48 03 05 8b 62 f5 00 48 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 <48> 8b 48 08 48 8d 51 ff 83 e1 01 48 0f 44 d0 48 8b 12 80 e6 02 74
Apr 02 18:13:45 arch-pc kernel: RSP: 0018:ffffb1fbc207fb60 EFLAGS: 00010202
Apr 02 18:13:45 arch-pc kernel: RAX: bfffde1a5f6f3200 RBX: ffff9ff85bccfa00 RCX: bfffde1a5f6f3200
Apr 02 18:13:45 arch-pc kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9ff8dbccfa00
Apr 02 18:13:45 arch-pc kernel: RBP: ffff9ff8a78324c0 R08: ffff9ff8a574e4c0 R09: ffff9ff8a9c06d80
Apr 02 18:13:45 arch-pc kernel: R10: 0000000000000001 R11: ffffb1fbc207fd98 R12: ffff9ff87b357700
Apr 02 18:13:45 arch-pc kernel: R13: 0000000000000000 R14: 0000000000400cc0 R15: 00000000ffffffff
Apr 02 18:13:45 arch-pc kernel: FS:  00007f77dc1fedc0(0000) GS:ffff9ff8ad9c0000(0000) knlGS:0000000000000000
Apr 02 18:13:45 arch-pc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 02 18:13:45 arch-pc kernel: CR2: 00007fd7481cc000 CR3: 00000007efb7c000 CR4: 0000000000340ea0
Apr 02 18:13:45 arch-pc kernel: Call Trace:
Apr 02 18:13:45 arch-pc kernel:  __alloc_skb+0x9b/0x1d0
Apr 02 18:13:45 arch-pc kernel:  alloc_skb_with_frags+0x4d/0x1c0
Apr 02 18:13:45 arch-pc kernel:  sock_alloc_send_pskb+0x1f3/0x220
Apr 02 18:13:45 arch-pc kernel:  unix_stream_sendmsg+0x270/0x3f0
Apr 02 18:13:45 arch-pc kernel:  sock_sendmsg+0x5e/0x60
Apr 02 18:13:45 arch-pc kernel:  sock_write_iter+0x97/0x100
Apr 02 18:13:45 arch-pc kernel:  do_iter_readv_writev+0x166/0x1e0
Apr 02 18:13:45 arch-pc kernel:  do_iter_write+0x7d/0x190
Apr 02 18:13:45 arch-pc kernel:  vfs_writev+0xe0/0x130
Apr 02 18:13:45 arch-pc kernel:  do_writev+0xe6/0x120
Apr 02 18:13:45 arch-pc kernel:  do_syscall_64+0x4e/0x150
Apr 02 18:13:45 arch-pc kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 02 18:13:45 arch-pc kernel: RIP: 0033:0x7f77dd0473fd
Apr 02 18:13:45 arch-pc kernel: Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 ca f5 f8 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 14 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2f 44 89 c7 48 89 44 24 08 e8 fe f5 f8 ff 48
Apr 02 18:13:45 arch-pc kernel: RSP: 002b:00007ffed365d230 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
Apr 02 18:13:45 arch-pc kernel: RAX: ffffffffffffffda RBX: 0000557ee1fe8430 RCX: 00007f77dd0473fd
Apr 02 18:13:45 arch-pc kernel: RDX: 0000000000000001 RSI: 00007ffed365d520 RDI: 0000000000000037
Apr 02 18:13:45 arch-pc kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
Apr 02 18:13:45 arch-pc kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001
Apr 02 18:13:45 arch-pc kernel: R13: 00007ffed365d520 R14: 0000000000000000 R15: 00007f77dc1fed40
Apr 02 18:13:45 arch-pc kernel: Modules linked in: rfcomm iptable_mangle iptable_raw xt_connmark xt_mark ip6table_mangle xt_comment ip6table_raw wireguard(OE) ip6_udp_tunnel udp_tunnel cmac algif_hash algif_skcipher af_alg bnep nct6775 hwmon_vid nls_iso8859_1 nls_cp437 vfat fat snd_usb_audio snd_usbmidi_lib btusb iwlmvm snd_rawmidi btrtl amdgpu btbcm snd_seq_device mc btintel mac80211 joydev edac_mce_amd input_leds bluetooth kvm_amd kvm libarc4 snd_hda_codec_hdmi ecdh_generic ecc snd_hda_intel irqbypass mousedev gpu_sched snd_intel_dspcfg ttm snd_hda_codec iwlwifi crct10dif_pclmul drm_kms_helper snd_hda_core snd_hwdep snd_pcm snd_timer snd crc32_pclmul r8169 igb cfg80211 ghash_clmulni_intel soundcore syscopyarea realtek aesni_intel sysfillrect sysimgblt i2c_algo_bit fb_sys_fops crypto_simd sp5100_tco libphy ccp cryptd dca glue_helper k10temp i2c_piix4 pcspkr tpm_crb tpm_tis tpm_tis_core eeepc_wmi evdev tpm asus_wmi pinctrl_amd battery rng_core mac_hid sparse_keymap rfkill acpi_cpufreq wmi_bmof mxm_wmi
Apr 02 18:13:45 arch-pc kernel:  nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_LOG xt_recent xt_limit xt_addrtype xt_tcpudp xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) crypto_user wmi drm agpgart ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 uas usb_storage hid_generic usbhid hid crc32c_intel xhci_pci xhci_hcd
Apr 02 18:13:45 arch-pc kernel: ---[ end trace f4a27dad9f9402ab ]---
Apr 02 18:13:45 arch-pc kernel: RIP: 0010:__ksize+0x4c/0xe0
Apr 02 18:13:45 arch-pc kernel: Code: 00 80 48 2b 05 ad 62 f5 00 48 01 f8 48 c1 e8 0c 48 c1 e0 06 48 03 05 8b 62 f5 00 48 8b 50 08 48 8d 4a ff 83 e2 01 48 0f 45 c1 <48> 8b 48 08 48 8d 51 ff 83 e1 01 48 0f 44 d0 48 8b 12 80 e6 02 74
Apr 02 18:13:45 arch-pc kernel: RSP: 0018:ffffb1fbc207fb60 EFLAGS: 00010202
Apr 02 18:13:45 arch-pc kernel: RAX: bfffde1a5f6f3200 RBX: ffff9ff85bccfa00 RCX: bfffde1a5f6f3200
Apr 02 18:13:45 arch-pc kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9ff8dbccfa00
Apr 02 18:13:45 arch-pc kernel: RBP: ffff9ff8a78324c0 R08: ffff9ff8a574e4c0 R09: ffff9ff8a9c06d80
Apr 02 18:13:45 arch-pc kernel: R10: 0000000000000001 R11: ffffb1fbc207fd98 R12: ffff9ff87b357700
Apr 02 18:13:45 arch-pc kernel: R13: 0000000000000000 R14: 0000000000400cc0 R15: 00000000ffffffff
Apr 02 18:13:45 arch-pc kernel: FS:  00007f77dc1fedc0(0000) GS:ffff9ff8ad9c0000(0000) knlGS:0000000000000000
Apr 02 18:13:45 arch-pc kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 02 18:13:45 arch-pc kernel: CR2: 00007fd7481cc000 CR3: 00000007efb7c000 CR4: 0000000000340ea0

As you can see this shows a general protection fault, which I believe means the kernel itself? Having finally spotted this in the journal, as a last resort to stop the freezes happening I installed linux-lts and linux-lts-headers, updated GRUB and rebooted. So far I've been running BOINC at full tilt and FaH has been hammering the GPU, and everything works perfectly. No more crashes (touch wood!) in over six hours. I couldn't make 20 minutes without a crash before, on the main kernel.

I had originally spent the last 24h suspecting the AMD OpenCL driver, as apparently the amdgpu-pro install pulls in some related amdgpu-pro-libGL stuff which is causing similar graphical crashes for people. I'd switched from opencl-amdgpu-propal [aur] (and the other modules that pulls in) to plain opencl-amd [aur]. Still crashing. That's when I found the (kernel?) general protection fault and realised the LTS kernel has fixed things.

Does the log output give anyone more of a clue than it does me? Is there anything known about this issue? Is there anything I can do for it except stay on LTS? Obviously with such a new chip I want the newest kernel possible for optimisations. Thanks in advance.

Offline

Board footer

Powered by FluxBB