You are not logged in.
Hi,
so since a few days I've started playing a game once again - Mass Effect 1 - that has worked fine on my PC half a year ago. Now it triggers a complete freeze or lockup of the system. Usually after playing the game for about 10 - 15 minutes my system will freeze in place and nothing works. MagicSysRq doesn't work and the Num-Lock on my keyboard stays on. On reboot there are usually some ext4 errors that - as of yet - were always fixable.
My last resort was setting
sysctl kernel.hardlockup_panic=1
but that didn't do anything either, I waited for a solid minute for a kernel panic to occur but nothing.
journalctl -b -1
does not show any entries when the freeze happens. Not sure where to go from here, shouldn't there always be at least some log entry?
* edit
The question in the title was solved by seth's recommendation to use kdump. However it is worth reading through the answers as there are some other tips that might work in other situations.
Last edited by Sidekick (2024-07-26 20:14:53)
Offline
This is interesting, as I have also been experiencing my system freezing of late as well. I thought it was only due to me experimenting with VMs, but I have also had it happen when just browsing the web. Nothing shows up for me in my journal entries as well.
Maybe it would be useful to compare what our hardware is?
I was hoping that someone else was also having this same issue, so that I know it's not just my system.
I had a feeling that it had something to do with my BIOS version, and I have since updated to the latest stable release for it. I've personally had no crash yet, but it's honestly hard to say, as the crash seems to be random as to when it actually happens.
Does this crash happen for you absolutely every time you play Mass Effect?
Having ext4 errors is to be expected when having to hard reset your system. I've been having the same issues.
I'm wondering if it's related to the following post: here. If so, it seems the issue may be resolved in the next kernel update.
Some help from someone more experienced would probably be helpful for both of us.
Last edited by Caelence (2024-07-25 07:00:25)
Offline
https://wiki.archlinux.org/title/Kdump
MagicSysRq doesn't work
Just to be sure, you explicitly enabled it beforehand?
Is this a ryzen CPU?
https://wiki.archlinux.org/title/Ryzen#System_halts
Generally you may want to detail the HW/config - there're also issues around the recent 55yxx nvidia drivers.
Offline
This is interesting, as I have also been experiencing my system freezing of late as well. I thought it was only due to me experimenting with VMs, but I have also had it happen when just browsing the web. Nothing shows up for me in my journal entries as well.
Maybe it would be useful to compare what our hardware is?
sure:
Machine:
MSI MAG X670E TOMAHAWK WIFI
CPU:
AMD Ryzen 5 7600X
GPU:
AMD Radeon 6800
Driver: amdgpu
I was hoping that someone else was also having this same issue, so that I know it's not just my system.
I had a feeling that it had something to do with my BIOS version, and I have since updated to the latest stable release for it. I've personally had no crash yet, but it's honestly hard to say, as the crash seems to be random as to when it actually happens.Does this crash happen for you absolutely every time you play Mass Effect?
Having ext4 errors is to be expected when having to hard reset your system. I've been having the same issues.
I'm wondering if it's related to the following post: here. If so, it seems the issue may be resolved in the next kernel update.Some help from someone more experienced would probably be helpful for both of us.
I'm not on the latest stable BIOS version for this mainboard as they continue to struggle with, let's say 'weird' issues and this one seems good to me. It is different to the one I was using half a year ago. As for the ext4 bug, I think that's the same as this one? In which case: Yes, one of my ext4 partitions was affected by this, once I had found what the issue was I downgraded to 6.9.8-arch1-1 and the ext4 issue disappeared, but the crashing/locking up was not fixed by this.
As for if it happens every time I play the game: Almost, yes. In the game there are a bunch of 'mini games' and any time you interact with that a small window opens in the game and then the mini game starts. Usually it freezes when I open one of those mini games. Alternatively it can happen when a loading screen appears, though that only happened twice.
https://wiki.archlinux.org/title/Kdump
MagicSysRq doesn't work
Just to be sure, you explicitly enabled it beforehand?
Is this a ryzen CPU?
https://wiki.archlinux.org/title/Ryzen#System_haltsGenerally you may want to detail the HW/config - there're also issues around the recent 55yxx nvidia drivers.
MagicSysRq is always enabled on my system but after the first three freezes I also added it as a parameter to the kernel options
cat /etc/default/grub
# GRUB boot loader configuration
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="Arch"
GRUB_CMDLINE_LINUX_DEFAULT="loglevel=3 quiet sysrq_always_enabled=1"
and yes I checked that it actually works by doing REISUB.
Yes it is a Ryzen CPU but, part of the 7xxx series, not the 5xxx series and as far as I understand the wiki article, that subsection talks about the 5xxx series CPUs. In any case this:
Reset button or forced shutdowns don't work, peripherals powers off, video output might stop, and unplugging is the only way out of this state.
shows that I'm not affected as I can do a forced shutdown by longpressing the power button - only input that works - and my peripherels aren't powered off as indicated by the glowing Num-Lock key that doesn't react anymore after the freeze/crash.
Last edited by Sidekick (2024-07-25 09:24:53)
Offline
Interesting. I've been running a MSI MPG X670E Carbon WiFi, and I've heard from others that it's also had a number of 'weird' issues. I can't really say that I've experienced those myself, though.
I happen to be running a 7900X, along with a Radeon 7900XTX.
I'm unsure if that is the same ext4 bug, but it's interesting that changing your kernel version never solved the issue.
I don't normally have my system lock-up, but it does happen every once in awhile. Normally when it does, I can actually move my mouse and see the cursor move on my screen. But not other input works, besides the power button.
I'd love to try and help further, but sadly, that's all I can really say. Hopefully someone with more knowledge can help you further.
Last edited by Caelence (2024-07-25 09:31:52)
Offline
You could also see if a makeshift remotelog yields any resultss: https://wiki.archlinux.org/title/Genera … ng_freezes
Online
@Sidekick, try this procedure and tell if it work for you:
If sysrq keys to sync, umount disk and reboot system don't work and you can't do anything, then wait over 90 seconds (default systemd service timeout) and press power button (but don't hold it), wait next 30 seconds. If system don't power off itself, then press and hold power button until system turn off.
The reason for doing this is to check if system hanged completely or not; if not then it may save full journalctl log (which you can post here, so we can see what happened) and turn off system gracefully.
I have had some situations like this, that I thought system hanged definitively, but doing that procedure, the system power off itself saving journalctl log.
Offline
Interesting. I've been running a MSI MPG X670E Carbon WiFi, and I've heard from others that it's also had a number of 'weird' issues. I can't really say that I've experienced those myself, though.
I happen to be running a 7900X, along with a Radeon 7900XTX.I'm unsure if that is the same ext4 bug, but it's interesting that changing your kernel version never solved the issue.
I don't normally have my system lock-up, but it does happen every once in awhile. Normally when it does, I can actually move my mouse and see the cursor move on my screen. But not other input works, besides the power button.I'd love to try and help further, but sadly, that's all I can really say. Hopefully someone with more knowledge can help you further.
that is suspiciously close in hardware but in my case it's completely locked up. As in I can't even move the mouse cursor. As for the kernel downgrade: Just to be sure, the kernel downgrade did solve the ext4 issue, however it did not solve the freezing issue. This was to be expected as the freezing did happen already before I updated to 6.10!
@gromit the SSH thing might work. After a full night of sleep I figured maybe I can launch the game in windowed mode, put a terminal with journalctl -f in the foreground and play until the game freezes. Obviously the advantage of using a remote SSH client would be that I can actually copy the log and have it in text form, so I'll try that instead.
@xerxes_ that is interesting to know. I'll try that as well!
*edit
as a sidenote, I just found out that my mkinitcpio.conf was an old one that did not have kms nor microcode included as a hook. I just fixed it. Maybe doesn't change anything in regards to this freeze but that was a bad oversight on my part.
Last edited by Sidekick (2024-07-25 12:02:04)
Offline
The SSH tip did not work, the last entry I got from the
journalctl -f
command on my laptop was the same entry that is now the last entry when I enter
journalctl -b -1
and that entry appeared 15 minutes before the freeze happened. The last couple entries are
Jul 25 21:22:38 sddm-helper-start-x11user[1434]: "(EE) event25 - Logitech Gaming Mouse G502: client bug: event processing lagging behind by 31ms, your system is too slow\n"
Jul 25 21:27:10 sddm-helper-start-x11user[1434]: "(EE) event25 - Logitech Gaming Mouse G502: client bug: event processing lagging behind by 33ms, your system is too slow\n"
Jul 25 21:39:06 sddm-helper-start-x11user[1434]: "(EE) event25 - Logitech Gaming Mouse G502: client bug: event processing lagging behind by 30ms, your system is too slow\n"
Jul 25 21:39:06 sddm-helper-start-x11user[1434]: "(EE) event25 - Logitech Gaming Mouse G502: WARNING: log rate limit exceeded (5 msgs per 60min). Discarding future messages.\n"
Also once the system froze I was not able to get another SSH connection and the one I had timed out.
These messages appeared every time I was in a loading screen. But again, the last entry is 15 minutes before the game crashed.
The tip @xerxes_ was talking about did not work either. Waited 90 seconds, pushed the power button, waited 30 seconds and nothing. I tried REISUB again as well with no effect.
There was also no further indication that the PC was busy. The HDD LED was dark and the fans of the CPU & GPU were, after a minute or so, spinning with the same noise they spin when the system is idle.
Last edited by Sidekick (2024-07-25 21:08:30)
Offline
If the CPU stalls or the kernel halts to the point where sysrq don't work, you're not getting any network activity, nor will systemd be able to intercept the short power switch.
If it's only the kernel, you'll have to setup a crash kernel (see the kdump link, the arch kernels have the relevant options set, you don't need to build a kernel)
Regardless of what's in the wiki (though the limitation actually applies to the random reboots only), ryzen CPUs are generally subject to this.
Try to limit the c-states (this will prevent the CPU from stepping down entirely - the issue is frequently triggered by core-cycling)
processor.max_cstate=1
And if the firmware allows this, to add an offset to the curve optimizer.
Last but not least: temperatures while playing are ok-ish? The system doesn't overheat? Are you overclocking?
On an off-chance it's teh GPU, https://wiki.archlinux.org/title/AMDGPU
But those "complete freezes" usually only affect the graphics stack - nevertheless, do you have a spare GPU?
Offline
ryzen CPUs are generally subject to this
oh, that is good to know.
As for the rest:
* Would setting the CPU governor to performance do the same thing as setting the c-states?
* Temperatures are good, especially with Mass Effect 1. At least there are no temperature warnings in the journalctl log and the temperatures reported by smartd hover around 67°C, indicating that heat is properly vented.
* I agree with the GPU part, I do have a spare GPU so could give that a try.
* I'll first enable the crash kernel part and see if that results in something useful, thank you!
Last edited by Sidekick (2024-07-26 08:03:17)
Offline
Would setting the CPU governor to performance do the same thing as setting the c-states?
Technically no, it might achieve the same outcome, but I suggest to draw some hard lines to isolate the issue. If limiting the cstates desn't help, running the system in performance mode likely won't either, but there not guarantee that the performance mode doesn't at some point performs and unfortunate core state cycle.
nb. that iff it's the CPU, kdump most likely won't work either.
Offline
Thank you for the explanation on the C-states, I don't think that is the problem here though.
kdump worked. I followed the instructions for the kdumpst AUR package - something that is missing is the information on how to enable it, the systemd service is not called kdumpst but kdumpst-init and after enabling and starting it one needs to reboot the machine, check journalctl to make sure it is loaded and then proceed - and after checking journalctl to see that it was properly loaded I started the game. The second loading screen I got crashed my PC, after a couple seconds my NUM lock turned off on its own and after a few more seconds the system rebooted automatically. Checked the directories and yep, got a log. More importantly I got an actual stacktrace:
[ 450.004842] [ C10] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 450.004846] [ C10] #PF: supervisor read access in kernel mode
[ 450.004848] [ C10] #PF: error_code(0x0000) - not-present page
[ 450.004849] [ C10] PGD 0 P4D 0
[ 450.004852] [ C10] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 450.004854] [ C10] CPU: 10 PID: 0 Comm: swapper/10 Kdump: loaded Not tainted 6.9.8-arch1-1 #1 94e07224523d037399a143bde732b0ca153835c7
[ 450.004856] [ C10] Hardware name: Micro-Star International Co., Ltd. MS-7E12/MAG X670E TOMAHAWK WIFI (MS-7E12), BIOS 1.A0 04/23/2024
[ 450.004858] [ C10] RIP: 0010:dcn10_set_drr+0xa6/0x100 [amdgpu]
[ 450.005060] [ C10] Code: 8b 07 48 85 c0 74 e0 48 8b 80 28 01 00 00 48 85 c0 74 08 4c 89 ee ff d0 0f 1f 00 45 84 e4 74 c7 48 8b 03 48 8b b8 f8 00 00 00 <48> 8b 07 48 8b 80 40 01 00 00 48 85 c0 74 ae 48 83 c3 08 ba 02 00
[ 450.005062] [ C10] RSP: 0018:ffffa0ccc0468da8 EFLAGS: 00010002
[ 450.005064] [ C10] RAX: ffff8c82ccc002d8 RBX: ffffa0ccc0468df8 RCX: 0000000000000000
[ 450.005065] [ C10] RDX: 0000000080010015 RSI: 0000000000004ff9 RDI: 0000000000000000
[ 450.005066] [ C10] RBP: ffffa0ccc0468de8 R08: 0000000000000030 R09: 0000000060c0f800
[ 450.005067] [ C10] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000001
[ 450.005068] [ C10] R13: ffffa0ccc0468da8 R14: ffffa0ccc0468e00 R15: ffff8c800d130f00
[ 450.005069] [ C10] FS: 0000000000000000(0000) GS:ffff8c871e700000(0000) knlGS:0000000000000000
[ 450.005071] [ C10] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 450.005072] [ C10] CR2: 0000000000000000 CR3: 0000000718020000 CR4: 0000000000f50ef0
[ 450.005073] [ C10] PKRU: 55555554
[ 450.005074] [ C10] Call Trace:
[ 450.005077] [ C10] <IRQ>
[ 450.005079] [ C10] ? __die_body.cold+0x19/0x27
[ 450.005083] [ C10] ? page_fault_oops+0x15a/0x2b0
[ 450.005085] [ C10] ? srso_alias_return_thunk+0x5/0xfbef5
[ 450.005088] [ C10] ? generic_reg_set_ex+0x11a/0x180 [amdgpu 1241b963d458061d1cf0c66988e5ad18d9dcaff6]
[ 450.005238] [ C10] ? exc_page_fault+0x81/0x190
[ 450.005241] [ C10] ? asm_exc_page_fault+0x26/0x30
[ 450.005244] [ C10] ? dcn10_set_drr+0xa6/0x100 [amdgpu 1241b963d458061d1cf0c66988e5ad18d9dcaff6]
[ 450.005404] [ C10] ? dcn10_set_drr+0x94/0x100 [amdgpu 1241b963d458061d1cf0c66988e5ad18d9dcaff6]
[ 450.005555] [ C10] dc_stream_adjust_vmin_vmax+0xd0/0x110 [amdgpu 1241b963d458061d1cf0c66988e5ad18d9dcaff6]
[ 450.005687] [ C10] dm_crtc_high_irq+0x22d/0x2b0 [amdgpu 1241b963d458061d1cf0c66988e5ad18d9dcaff6]
[ 450.005834] [ C10] amdgpu_dm_irq_handler+0x82/0x1f0 [amdgpu 1241b963d458061d1cf0c66988e5ad18d9dcaff6]
[ 450.005978] [ C10] amdgpu_irq_dispatch+0xce/0x210 [amdgpu 1241b963d458061d1cf0c66988e5ad18d9dcaff6]
[ 450.006109] [ C10] amdgpu_ih_process+0x83/0x100 [amdgpu 1241b963d458061d1cf0c66988e5ad18d9dcaff6]
[ 450.006228] [ C10] amdgpu_irq_handler+0x23/0x60 [amdgpu 1241b963d458061d1cf0c66988e5ad18d9dcaff6]
[ 450.006353] [ C10] __handle_irq_event_percpu+0x4a/0x190
[ 450.006356] [ C10] handle_irq_event+0x3b/0x80
[ 450.006359] [ C10] handle_edge_irq+0x9a/0x260
[ 450.006362] [ C10] __common_interrupt+0x3f/0xb0
[ 450.006366] [ C10] common_interrupt+0x80/0xa0
[ 450.006369] [ C10] </IRQ>
[ 450.006370] [ C10] <TASK>
[ 450.006371] [ C10] asm_common_interrupt+0x26/0x40
[ 450.006373] [ C10] RIP: 0010:cpuidle_enter_state+0xc6/0x420
[ 450.006375] [ C10] Code: 00 00 e8 2d 46 31 ff e8 18 f1 ff ff 49 89 c5 0f 1f 44 00 00 31 ff e8 49 4d 30 ff 45 84 ff 0f 85 aa 01 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 84 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
[ 450.006377] [ C10] RSP: 0018:ffffa0ccc024fe80 EFLAGS: 00000246
[ 450.006378] [ C10] RAX: ffff8c871e700000 RBX: 0000000000000002 RCX: 0000000000000000
[ 450.006380] [ C10] RDX: 00000068c65cd5a2 RSI: fffffffbea02b9b2 RDI: 0000000000000000
[ 450.006381] [ C10] RBP: ffff8c8006e8b000 R08: 0000000000000002 R09: 00000000000004ed
[ 450.006382] [ C10] R10: 0000000000000018 R11: ffff8c871e7347e4 R12: ffffffffb1d56280
[ 450.006383] [ C10] R13: 00000068c65cd5a2 R14: 0000000000000002 R15: 0000000000000000
[ 450.006387] [ C10] cpuidle_enter+0x2d/0x40
[ 450.006390] [ C10] do_idle+0x1b0/0x210
[ 450.006394] [ C10] cpu_startup_entry+0x29/0x30
[ 450.006397] [ C10] start_secondary+0x11c/0x140
[ 450.006399] [ C10] common_startup_64+0x13e/0x141
[ 450.006403] [ C10] </TASK>
[ 450.006404] [ C10] Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security nf_tables libcrc32c ip6table_filter ip6_tables iptable_filter cmac algif_hash algif_skcipher af_alg bnep nct6683 btusb btrtl btintel btbcm btmtk bluetooth ecdh_generic mousedev vfat fat joydev amd_atl intel_rapl_msr intel_rapl_common kvm_amd hid_generic mt7921e kvm mt7921_common mt792x_lib crct10dif_pclmul crc32_pclmul snd_usb_audio mt76_connac_lib polyval_clmulni polyval_generic mt76 snd_usbmidi_lib gf128mul ghash_clmulni_intel snd_ump snd_hda_codec_realtek snd_rawmidi uas sha512_ssse3 snd_seq_device snd_hda_codec_generic mc sha256_ssse3 snd_hda_codec_ca0132 mac80211 snd_hda_codec_hdmi snd_hda_scodec_component
[ 450.006451] [ C10] usb_storage usbhid sha1_ssse3 aesni_intel snd_hda_intel snd_intel_dspcfg crypto_simd snd_intel_sdw_acpi cryptd libarc4 snd_hda_codec sp5100_tco rapl snd_hda_core r8169 wmi_bmof pcspkr snd_hwdep ccp cfg80211 i2c_piix4 k10temp snd_pcm realtek mdio_devres snd_timer rfkill snd soundcore libphy gpio_amdpt gpio_generic mac_hid usbip_host usbip_core uinput i2c_dev sg crypto_user loop dm_mod nfnetlink ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 nvme crc32c_intel sr_mod xhci_pci nvme_core cdrom xhci_pci_renesas nvme_auth amdgpu video wmi amdxcp i2c_algo_bit drm_ttm_helper ttm drm_exec gpu_sched drm_suballoc_helper drm_buddy drm_display_helper cec
[ 450.006492] [ C10] CR2: 0000000000000000
Looks like a GPU driver problem, or at least the GPU driver is involved?
*edit
I tried analyzing the dump with the crash tool as explained in the article but the best output I get is:
crash: /usr/lib/modules/6.9.8-arch1-1/build/vmlinux: no debugging data available
Not sure where to go from here, a web search did bring a few other Arch users to light who had the same issue but all of them without answer. The best I found was someone who was using their Arch machine to debug a Debian crash and thus was able to download the vmlinux file from Debian's repositories. Maybe I don't need this at all given that I already have a stacktrace?
Last edited by Sidekick (2024-07-26 19:46:13)
Offline
Offline
Damn you work fast. What search terms did you use to find those? Now that I know what I'm looking for I would just search for
dcn10_set_drr
. Did you just take the last entry in the stacktrace that wasn't part of the exception handling - as in the last entry before the methods with 'page fault' in their name?
Offline
"dc_stream_adjust_vmin_vmax" sounds the least generic, "dcn10_set_drr" probably gets called all the time.
I also made clear that I was looking for a "kernel NULL pointer dereference"
Offline
Awesome, thank you!
Offline
In case you wanna play ME, did you pick up the comments in the upstream bug suggesting that the LTS kernel is unaffected?
Offline
Yes, saw that but also that only one person reported that. Might give it a try, but right now I'm playtesting a bunch of games that I haven't played in a while to see if any of them can trigger this bug.
Last edited by Sidekick (2024-07-26 23:33:29)
Offline