You are not logged in.

#26 2024-11-17 07:55:01

seth
Member
Registered: 2012-09-03
Posts: 58,974

Re: amdgpu reset, not entirely sure why

Does this continue on a completely different SW stack like https://grml.org/ ?

Offline

#27 2024-11-18 09:19:06

gtf21
Member
Registered: 2020-06-28
Posts: 136
Website

Re: amdgpu reset, not entirely sure why

seth wrote:

Does this continue on a completely different SW stack like https://grml.org/ ?

I downloaded grml, booted it, configured the network interface (ie spent a decent amount of time in the console), then started X, opened firefox, system froze and crashed about 3 seconds later.

Offline

#28 2024-11-18 09:21:23

gtf21
Member
Registered: 2020-06-28
Posts: 136
Website

Re: amdgpu reset, not entirely sure why

I wonder if it is possible that something my Arch system did set something in the eeprom (or equivalent) on the chip / motherboard? Otherwise my only remaining theory is that this is a hardware failure which seems strange but not impossible.

Offline

#29 2024-11-18 15:08:06

gtf21
Member
Registered: 2020-06-28
Posts: 136
Website

Re: amdgpu reset, not entirely sure why

Oh and I also tried 2022.11 grml (6.0.0 kernel) and get the same thing.

What’s odd is that desktop seems to perform fine, but when I either opened Firefox or dragged a terminal around the screen it crashed. Perhaps this is when OpenGL gets used or something?

Offline

#30 2024-11-18 16:34:04

seth
Member
Registered: 2012-09-03
Posts: 58,974

Re: amdgpu reset, not entirely sure why

Perhaps this is when OpenGL gets used or something

LIkely.

my Arch system did set something in the eeprom (or equivalent) on the chip / motherboard

Unlikely, the CPU/GPU firmware is transient and would at least be lost when you cut the system from power for a while.
Some UEFI update (via fwupd or not) might have caused such, but I'd look at power supply or maybe RAM (memtest86+)

Can you also trigger this when banging the GPU w/ some https://wiki.archlinux.org/title/Benchmarking#Graphics or maybe https://github.com/ProjectPhysX/OpenCL-Benchmark ?

Offline

#31 2024-11-18 20:35:33

gtf21
Member
Registered: 2020-06-28
Posts: 136
Website

Re: amdgpu reset, not entirely sure why

but I'd look at power supply or maybe RAM (memtest86+)

Could you help me understand the thesis here? I'm discovering a lot of this as I go along...

In the meantime, I'm going to try memtest and basemark and OpenCL-benchmark while monitoring dmesg for anything. Will report back.

Offline

#32 2024-11-18 20:39:33

seth
Member
Registered: 2012-09-03
Posts: 58,974

Re: amdgpu reset, not entirely sure why

If the GPU starves from insufficient voltage/current anything can happen and since it's an APU it uses the system RAM as VRAM.
Ryzens are notoriously underpowered, https://wiki.archlinux.org/title/Ryzen#Troubleshooting
But typically the CPU becomes the victim.

Offline

#33 2024-11-18 23:23:34

gtf21
Member
Registered: 2020-06-28
Posts: 136
Website

Re: amdgpu reset, not entirely sure why

1. Memtest passes just fine, so it's not a RAM problem.
2. "If the GPU starves from insufficient voltage/current anything can happen" This is interesting to me. I'm using this PSU from Streamcom https://streacom.com/products/zf240-fanless-psu/ -- is it possible it's under-supplying the CPU/GPU?
3. I ran the OpenCL benchmark, which seemed to work fine with no kernel errors appearing I couldn't run the others as they all require graphical sessions. I'll try and do this in a graphical session tomorrow, if I can get it to live long enough.

Offline

#34 2024-11-19 06:50:17

seth
Member
Registered: 2012-09-03
Posts: 58,974

Re: amdgpu reset, not entirely sure why

1. useful memtest56+ runs are measured in days, not minutes.
Run it over night (16+h) at least
2. It's not so much what the PSU can put out but how much of that makes it to the APU, see the ryzen links
3. Do you have another monitor? What happens if you https://wiki.archlinux.org/title/Intel_ … on_(VSYNC) ?

Offline

#35 2024-11-19 08:24:52

gtf21
Member
Registered: 2020-06-28
Posts: 136
Website

Re: amdgpu reset, not entirely sure why

On (1), I didn’t realise. Will do. AMD responded to me saying

I would like to inform you that AMD Ryzen™ 9 7900 processor supports memory up 2x1R DDR5-5200, 2x2R DDR5-5200, 4x1R DDR5-3600, 4x2R DDR5-3600.

My RAM is G.Skill Trident Z5 Neo 64 GB (2 x 32 GB) DDR5-6000 CL30, so is it likely/possible that this is the issue?

The thing that perplexes me is the increasing frequency of the issues, however.

Offline

#36 2024-11-19 09:20:30

gtf21
Member
Registered: 2020-06-28
Posts: 136
Website

Re: amdgpu reset, not entirely sure why

Another update: I went into the BIOS as suggested in the Ryzen link and boosted the curve optimiser by +4 points for *both* the CPU and GFX. I've now run Basemark, glmark2, gputest and all of them have worked fine.

I now have firefox up and things seem to be working, although I'll see for how long the situation remains stable, as until this week I would often have a couple of days without a crash.

I'll also run a longer memtest.

Offline

#37 2024-11-20 22:13:19

gtf21
Member
Registered: 2020-06-28
Posts: 136
Website

Re: amdgpu reset, not entirely sure why

I spoke too soon, had another crash just now. I ran a 10h memtest successfully. Will run another one tonight.

Offline

#38 2024-11-20 22:51:06

seth
Member
Registered: 2012-09-03
Posts: 58,974

Re: amdgpu reset, not entirely sure why

Does reducing the optimizer offset increase the frequency again?
If there's a correlation, try +5

Offline

#39 2024-11-21 16:55:27

gtf21
Member
Registered: 2020-06-28
Posts: 136
Website

Re: amdgpu reset, not entirely sure why

Good shout. I set it back to zero and it froze as soon as I opened Basemark. Set it to +4 for the GFX optimiser only this time and ran Basemark 3 times without a problem. If it crashes again I’ll try +5.

Memtest ran for >16h with no failures.

Looks like the voltage hypothesis is reasonable. I have no feel at ever for what “+5” means, however (eg what are the units? Is that and X-axis shift? What *is* the X-axis? Etc.). Do you?

Offline

#40 2024-11-21 20:52:42

seth
Member
Registered: 2012-09-03
Posts: 58,974

Re: amdgpu reset, not entirely sure why

It describes the AMD Precision Boost Overdrive that allows you to run cores at lower voltages, thus cooler, so they can run at higher frequencies.
The problem is that if the voltage is too low, the APU no longer acts deterministically. Higher positive offsets run the core at higher voltage, thus more stable, but hotter, thus the maximum frequency goes down to stay within the TDP.

I do not know how the offset translates to volts, but since he different CPUs have different TDPs it probably depends on the spcific chip (or not, feel free to google and tell us wink

Offline

#41 2024-11-24 15:36:07

orangecza
Member
Registered: 2024-11-24
Posts: 3

Re: amdgpu reset, not entirely sure why

What's your cpu core voltage?
Please try to control core voltage around 1.3v and SoC voltage around 1.1v.
I believe it is AMD hardware to blame.

Offline

#42 2024-11-26 21:44:38

gtf21
Member
Registered: 2020-06-28
Posts: 136
Website

Re: amdgpu reset, not entirely sure why

FTR: I set the GFX curve optimiser to +5 and system has been up for >3 days with no issues. Am cautiously hopeful. If this doesn't work, I'll try looking at the other voltage controls in the BIOS.

Offline

#43 2024-11-27 07:14:42

seth
Member
Registered: 2012-09-03
Posts: 58,974

Re: amdgpu reset, not entirely sure why

I set the GFX curve optimiser to +5 and system has been up for >3 days with no issues.

If this pans out to be true, please bump this thread at that point, this should probably be in the amdgpu wiki.

Offline

Board footer

Powered by FluxBB