You are not logged in.
HI Guys.. Recently I've been facing recurring system crashes on my machine, and I need some assistance diagnosing the cause. After the most recent crash, I pulled the logs using journalctl, and I’ve noticed several issues, but I’m not sure how to proceed. The system was under medium to light load during the crash. I had a good few browser tabs open in firefox, two tabs open in thunar and a video open in mpv.
Here’s a summary of what I’ve found in the logs..
1. Memory issues: I’m seeing multiple "Out of memory" (OOM killer) messages where processes were killed to free memory. This leads me to believe the crash might be related to my system running out of RAM.
2. Disk/Device timeouts: I also see timeouts related to systemd services, specifically related to `udevd` and devices not responding or timing out. The log includes entries like:
- timed out waiting for device dev-disk-by\x2duuid-XXX.device
3. General protection faults: There are kernel messages indicating protection faults, which might indicate deeper hardware issues (e.g., RAM or CPU faults).
Here are links to my logs:
journalctl -b -1
journalctl -k -b -1
journalctl -p 3 -xb
My System info:
Fully up-to-date Arch Linux running LTS kernel.
CPU: Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
GPU: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev ef)
RAM: 16gig
Questions:
- Based on the logs, does this look more like a memory issue, disk issue, or something else entirely?
- Any recommendations on steps to further diagnose and resolve this?
Thanks in advance for any help!!
Notes:
I've noticed that a majority of the crash have happen whenever I have had mpv open..
Some of the crashes have happened whenever I've only had firefox open..
Last edited by furycd001 (2024-10-11 11:07:51)
- - - -
Offline
This looks ominous
Oct 09 15:03:45 furycd001 kernel: [drm] VRAM is lost due to GPU reset!
This, plus your general protection faults causes the electrical engineer in me to twitch and wonder about the health or adequacy of the power supply, or to wonder about the temperature of things.
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way
Offline
intenso T1125A0
if your ssd is as old as the rest of the hardware chances are it's dying
although the kernel panic mentions a crash in the amdgou driver - so could also be an issue with that
overall classic: overheating or faulty power supply
// update F5
going with ewaller here - could be a bad psu
Last edited by cryptearth (2024-10-09 15:07:01)
Offline
This looks ominous
Oct 09 15:03:45 furycd001 kernel: [drm] VRAM is lost due to GPU reset!
This, plus your general protection faults causes the electrical engineer in me to twitch and wonder about the health or adequacy of the power supply, or to wonder about the temperature of things.
The power supply is brand new and was only installed into the system about two months ago. Could the power supply be faulty, or maybe I could have connected it wrong?
- - - -
Offline
intenso T1125A0
if your ssd is as old as the rest of the hardware chances are it's dying
although the kernel panic mentions a crash in the amdgou driver - so could also be an issue with that
overall classic: overheating or faulty power supply// update F5
going with ewaller here - could be a bad psu
Should I try and get the power supply replaced being that I've only had it for around 2 months & bought it brand new from ebuyer....
I'm a complete novice whenever it comes to fixing computers & can get flustered very easily. I will take a picture of the inside of my case so that someone here can have a look at my cables....
Last edited by furycd001 (2024-10-09 15:25:21)
- - - -
Offline
Could the power supply be faulty, or maybe I could have connected it wrong?
Both are a possibility. I use in integral Intel GPU, so I don't know about powering GPUs, but I know they generally require a power connection other than that which they get from the bus. Maybe someone else here can help with that.
A third possibility is that it is underrated for what you asking it to do,
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way
Offline
furycd001 wrote:Could the power supply be faulty, or maybe I could have connected it wrong?
Both are a possibility. I use in integral Intel GPU, so I don't know about powering GPUs, but I know they generally require a power connection other than that which they get from the bus. Maybe someone else here can help with that.
A third possibility is that it is underrated for what you asking it to do,
I don't know anything about powering GPUs either. This is literally my first ever GPU that requires extra power to run. I only upgraded the power supply because of the GPU. Hmmm yea maybe it is underrated, but I was told that it wasn't. Who knows....
- - - -
Offline
You might try running without the external GPU for a day to eliminate other things as the cause. But, that will have impact on RAM usage so it is not a perfect test. Nor is it a perfect test if you cannot do what you need to do without it.
What is the rated power of the supply? Have you a make and model number?
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way
Offline
This is the power supply I bought....
https://www.ebuyer.com/1904164-corsair- … 9020277-uk
The only reason I bought the GPU is because I play some TF2 & do a bit of upscaling with upscayl....
Last edited by furycd001 (2024-10-09 15:51:06)
- - - -
Offline
Nothing wrong with the specs on that bad boy.
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way
Offline
Nothing wrong with the specs on that bad boy.
Well that's something at least I'm going to take a picture here, and hope that someone may be able to make something of it....
Here's a link to the image....
https://drive.google.com/file/d/1MTG8Ps … p=drivesdk
Last edited by furycd001 (2024-10-09 16:28:44)
- - - -
Offline
well - even if it's "just bought" doesn't mean it's new
but as you have fresh warranty I would give it a try anyway to rule it out
Offline
well - even if it's "just bought" doesn't mean it's new
but as you have fresh warranty I would give it a try anyway to rule it out
Hmm yea I may try and use the warranty to see. I'll probably also try using my computer with the graphics card....
- - - -
Offline
Oct 09 15:03:34 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x00024802 for process mpv pid 504060 thread mpv:cs0 pid 504077
Oct 09 15:03:34 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000C00
Oct 09 15:03:34 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x07048002
Oct 09 15:03:34 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 3, pasid 32774) at page 3072, write from 'TC4' (0x54433400) (72)
Oct 09 15:03:34 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x00004802 for process mpv pid 504060 thread mpv:cs0 pid 504077
Oct 09 15:03:34 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000C00
Oct 09 15:03:34 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06048002
Oct 09 15:03:34 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 3, pasid 32774) at page 3072, read from 'TC4' (0x54433400) (72)
Oct 09 15:03:34 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x00004402 for process mpv pid 504060 thread mpv:cs0 pid 504077
Oct 09 15:03:34 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x03800000
Oct 09 15:03:34 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x06048001
Oct 09 15:03:34 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x01, vmid 3, pasid 32774) at page 58720256, read from 'TC4' (0x54433400) (72)
Oct 09 15:03:45 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 146 0x0000480c for process mpv pid 504060 thread mpv:cs0 pid 504077
Oct 09 15:03:45 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000
Oct 09 15:03:45 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0604800C
Oct 09 15:03:45 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x0c, vmid 3, pasid 32774) at page 0, read from 'TC4' (0x54433400) (72)
Oct 09 15:03:45 furycd001 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=8945266, emitted seq=8945269
Oct 09 15:03:45 furycd001 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process mpv pid 504060 thread mpv:cs0 pid 504077
Oct 09 15:03:45 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
Oct 09 15:03:45 furycd001 kernel: amdgpu: cp is busy, skip halt cp
Oct 09 15:03:45 furycd001 kernel: amdgpu: rlc is busy, skip halt rlc
Oct 09 15:03:45 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: BACO reset
Oct 09 15:03:45 furycd001 kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset succeeded, trying to resume
Oct 09 15:03:45 furycd001 kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400900000).
Oct 09 15:03:45 furycd001 kernel: [drm] VRAM is lost due to GPU reset!
Oct 09 15:03:46 furycd001 kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.2.0 test failed (-110)
The GPU resets after 5½h uptime triggered by some mpv video playback.
Reset precondition looks like https://gitlab.freedesktop.org/mesa/mesa/-/issues/3584
Also shows up in https://forum.manjaro.org/t/system-freq … 97?page=10 and you've a mathing GPU.
pacman -Qs opencl
and please post your Xorg log, https://wiki.archlinux.org/title/Xorg#General and the output of "mpv someporn.mp4" (should list the output ande decoder module)
Online
pacman -Qs opencl
local/ocl-icd 2.3.2-2
OpenCL ICD Bindings
Last edited by furycd001 (2024-10-09 19:56:34)
- - - -
Offline
You've no opencl implementaiton, so that's not overly suspicious.
Try to remove xf86-video-amdgpu and run on the modesetting driver.
Online
You've no opencl implementaiton, so that's not overly suspicious.
Try to remove xf86-video-amdgpu and run on the modesetting driver.
Okay I will try that first thing in the morning. I will also post the output of mpv running a video....
- - - -
Offline
I’ve managed to get to the desktop using the modesetting driver. Do you think this will fully resolve the issues I’ve been having, or is there more I might need to address?
sudo cat /var/log/Xorg.0.log | grep "modesetting"
[sudo] password for furycd001:
[ 13.410] (II) LoadModule: "modesetting"
[ 13.410] (II) Loading /usr/lib/xorg/modules/drivers/modesetting_drv.so
[ 13.586] (II) Module modesetting: vendor="X.Org Foundation"
[ 13.586] (II) modesetting: Driver for Modesetting Kernel Drivers: kms
mpv output
● Video --vid=1 (h264 720x1280 30 fps) [default]
● Audio --aid=1 (aac 2ch 44100 Hz 81 kbps) [default]
Using hardware decoding (vaapi).
[E] pw.loop [loop.c:69 pw_loop_new()] 0x5a270178e410: can't make support.system handle: No such file or directory
AO: [pulse] 44100Hz stereo 2ch float
VO: [gpu] 720x1280 vaapi[nv12]
AV: 00:00:22 / 00:00:23 (99%) A-V: 0.000
Exiting... (End of file)
- - - -
Offline
sudo cat /var/log/Xorg.0.log | grep "modesetting"
https://porkmail.org/era/unix/award#cat
The file is world readable but the grep also doesn't look like you're using the modesetting driver at all.
Post the log.
Please use [code][/code] tags, not some markup. It'll result in monospace fonts and a scrollbox for long content.
Online
Offline
There's somewhere an explicit device config, but you're running on the modesetting driver.
Whether that is beneficial, time will tell - the last incident too 5½h uptime, afaiu it's not a problem that you'd expect somewhat instantly?
Reg. the other issues mentioned, they don't seem to be reflectedin the shared journal, but generally OOM means you lack memory - either because of a rogue process or because you're systematically underprovisioned.
Next to zswap you've a 16GB swap partition, though - so more like a rogue process?
Online
There's somewhere an explicit device config, but you're running on the modesetting driver.
Whether that is beneficial, time will tell - the last incident too 5½h uptime, afaiu it's not a problem that you'd expect somewhat instantly?Reg. the other issues mentioned, they don't seem to be reflectedin the shared journal, but generally OOM means you lack memory - either because of a rogue process or because you're systematically underprovisioned.
Next to zswap you've a 16GB swap partition, though - so more like a rogue process?
I'd say it's most likely a rogue process, but I'm a novice at troubleshooting, so can't confirm with 100%. I will report back later today when my computer has passed the 5½h mark. The problem seemed to occur mainly when using mpv, but launching steam to play TF2 & doing various other things always seemed fine....
Update....
My computer has been running for 8h22m now and there has been no crashes....
Last edited by furycd001 (2024-10-10 14:08:44)
- - - -
Offline
Okay so the computer never crashed yesterday, and that was me running a bunch of stuff to test the system later in the day. The one thing I've noticed is that screen tearing is present with the modesetting driver. I created the file below and rebooted the system, but it's still present. Any ideas?
/etc/X11/xorg.conf.d/20-modesetting.conf
Section "Device"
Identifier "AMD GPU"
Driver "modesetting"
Option "TearFree" "true"
Option "DRI" "3"
EndSection
- - - -
Offline
Despite https://www.phoronix.com/news/xf86-vide … ing-Fix-TF the TearFree option in the modesetting driver isn't implemented in any yet released X11 server, consider running some compositor if it's too annoying (you get shadows and transparency as a bonus
Online
Despite https://www.phoronix.com/news/xf86-vide … ing-Fix-TF the TearFree option in the modesetting driver isn't implemented in any yet released X11 server, consider running some compositor if it's too annoying (you get shadows and transparency as a bonus
Balls.. Thank you for making me aware of that. For now I'll probably just live with it, because it's not awful, and I don't really want to install a whole bunch of extra stuff. I'm currently on XFCE, btw.
Do you know when it would be safe for me to return back to the xf86-video-amdgpu driver?
Last edited by furycd001 (2024-10-11 09:31:27)
- - - -
Offline