You are not logged in.

#1 2021-06-26 19:14:05

graysky
Wiki Maintainer
From: :wq
Registered: 2008-12-01
Posts: 10,731
Website

AMD Curve Optimizer/which core threw the Machine Check Error

Has anyone attempted to optimize an AMD 5000 processor booting into Arch?  As I understand it, you set some negative voltage offsets in the BIOS and then stress test until you experience a failure.  Since you do this core-by-core, it is important to verify which of them caused the fault.  My assumption is that a hardware failure event will be recorded in the journal but I do not know if the individual core ID is logged/cannot verify by googling.  I did find the MCE wiki page advising to install rasdaemon, but I have the same question about whether or not it records the core that fails.

Anyone with a 5900X or 5950X tried this and have some experience to share or links?

Reference under Windows: https://youtu.be/dU5qLJqTSAc?t=1105

In that video, the WIndows event viewer logs them as "WHEA error" which captures the core ID:

A fatal hardware error has occurred.

Reported by component Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 6

EDIT: I'm no expert, but grepping for ;pr_err' in core.c didn't look promising.

Last edited by graysky (2021-06-26 20:12:03)

Offline

#2 2021-06-27 12:42:26

enrique
Member
Registered: 2005-10-25
Posts: 95
Website

Re: AMD Curve Optimizer/which core threw the Machine Check Error

I have a 5800X, and I get this in `dmesg`, the first boot after a crash:

mar 18 20:27:20 hostname kernel: mce: [Hardware Error]: Machine check events logged
mar 18 20:27:20 hostname kernel: mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 1: bc800800060c0859
mar 18 20:27:20 hostname kernel: mce: [Hardware Error]: TSC 0 ADDR 43cba92c0 MISC d012000000000000 IPID 100b000000000 
mar 18 20:27:20 hostname kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1616095637 SOCKET 0 APIC 9 microcode a201009

CPU 12 is CPU core 7, as there are 2 threads pr core and start counting from 0.


Kind regards, enrique

Offline

#3 2021-06-27 14:08:42

graysky
Wiki Maintainer
From: :wq
Registered: 2008-12-01
Posts: 10,731
Website

Re: AMD Curve Optimizer/which core threw the Machine Check Error

Thanks for the confirmation.  Re: numbering.  5800X has 8 physical cores and 8 virtual ones.  If CPU 12 = Core 7 the numbering scheme must be:

Core 1 = CPU 0 + CPU 1
Core 2 = CPU 2 + CPU 3
Core 3 = CPU 4 + CPU 5
Core 4 = CPU 6 + CPU 7
Core 5 = CPU 8 + CPU 9
Core 6 = CPU 10 + CPU 11
Core 7 = CPU 12 + CPU 13
Core 8 = CPU 14 + CPU 15
Core 9 = CPU 16 + CPU 17
Core 10 = CPU 18 + CPU 19
Core 11 = CPU 20 + CPU 21
Core 12 = CPU 22 + CPU 23

Since my BIOS starts Core numbering from 0, those reading this on a search need to verify that and subtract 1 from the Core in the table above.

So a failure of CPU 12 or 13 = Core 7.  Is that consistent with what you know?  Need to hit google up.

Last edited by graysky (2021-07-02 14:37:34)

Offline

#4 2021-06-27 14:36:28

enrique
Member
Registered: 2005-10-25
Posts: 95
Website

Re: AMD Curve Optimizer/which core threw the Machine Check Error

Yeah, I'm pretty sure.

I've never gotten my 5800X stable if I enable anything PBO/Curve optimizer, tried all the different bios' and tried a ton of settings, so I hope you have more luck smile


Kind regards, enrique

Offline

#5 2021-06-27 16:56:10

graysky
Wiki Maintainer
From: :wq
Registered: 2008-12-01
Posts: 10,731
Website

Re: AMD Curve Optimizer/which core threw the Machine Check Error

The correct process to do this isn't totally obvious to me/haven't found a well written guide or documentation.  Here are a few links that might help if you want to give it a try again.

Videos:
First one shows ALL CORE method: https://www.youtube.com/watch?v=dfkrp25dpQ0
Second one shows PER CORE method: https://www.youtube.com/watch?v=dU5qLJqTSAc

Links with some information/unsure about accuracy:
https://www.reddit.com/r/Amd/comments/k … _of_curve/
https://www.reddit.com/r/Amd/comments/k … in_msi_mb/
https://albertherd.com/2020/12/13/my-ex … n-a-5900x/
https://joseph.to/ultimate-guide-to-und … yzen-5000/

Offline

#6 2021-06-27 17:37:22

mcloaked
Member
From: Yorkshire, UK
Registered: 2012-02-02
Posts: 1,351

Re: AMD Curve Optimizer/which core threw the Machine Check Error

I would be very interested to see if there is a good way in arch to optimise the dram and core voltage settings for my AMD desktop - of course some users will be doing so to maximise performance, but others will be optimising to get stability. I have a 4750GE cpu, and the system occasionally reboots spontaneously but too quickly to even allow an MCE event to be recorded - I have been led to believe by the support centre for the motherboard hardware in which it is installed, that tweaking the bios settings may be able to improve stability though work is going on behind the scenes at the motherboard manufacturer with their engineers looking at recoding the BIOS to give a more stable default set of settings.  That may take a while.  The overclocker forums seem to be full of thread discussions about various AMD machines and overclocked dram settings predicted using "The Ryzen DRAM Calculator ", but that will only run in MS Windows to offer optimised bios settings to achieve improvement, and there seems to be no equivalent linux tool that will achieve the same output. If anyone has managed to find a way forward to getting suggested sets of improved BIOS settings for the board and dram settings it would be very much appreciated. Of course simply adjusting settings at random can lead to a system that won't boot at all, and in some cases won't even post or get to the BIOS screen, so changes have to be made with great caution, though there are motherboard jumpers that will allow the BIOS to be re-flashed if the worst happens.  However being able to make small changes to dram timings, or board voltages that we can be confident are 'safe' would be very useful.

Edit: In addition to Graysky's links this may also be useful: https://fosspost.org/overclock-amd-ryzen-linux/ and to use the suggestions there are two arch AUR packages at https://aur.archlinux.org/packages/ryze … oller-bin/ and https://aur.archlinux.org/packages/ryzenadj-git/

Last edited by mcloaked (2021-06-27 17:51:10)


Mike C

Offline

#7 2021-06-27 18:28:34

graysky
Wiki Maintainer
From: :wq
Registered: 2008-12-01
Posts: 10,731
Website

Re: AMD Curve Optimizer/which core threw the Machine Check Error

Thanks for the links, mcloaked.  I don't yet have all the parts of my Ryzen system.  Did you install RyzenAdj on your system?  I am wondering if it can display your chip's PBO limits which I understand a WIndows-only piece of software called Ryzen Master can do.

To test it, Disable PBO limits from your BIOS (ie use the defaults for the chip).  The run

ryzenadj -i

Offline

#8 2021-06-27 19:49:33

mcloaked
Member
From: Yorkshire, UK
Registered: 2012-02-02
Posts: 1,351

Re: AMD Curve Optimizer/which core threw the Machine Check Error

I haven't run tests on my system yet - I have to wait until no users of the system (and it is a server!) could be impacted as well as getting a block of my time to do it when it isn't needed. AMD system stablity is a topic that could usefully do with more input as well as trial and error testing for different combinations of motherboard and CPU chip!  If I had a spare duplicate system to run tests it would make it easier! It would seem from the overclocker forums in particular that a lot of users of many different AMD systems are looking for improved stability as well as improved performance. The only changes from defaults that I have in my BIOS is to reduce the DRAM frequency from the nominal default of 3200 Mt/s to 2666 Mt/s, which seems to have reduced the number of occurrences of spontaneous reboots per week - though it is also possible that one or more of the many amd commits in recent kernels may have contributed to stability also. But this will be a slow and very cautious approach to improving the hardware parameters.  Running the RyzenAdj as you suggest should be benign though without further changes to my BIOS. (By the way my system only has arch on it - there is no other OS and no Windows so the only approach I can take is to run arch codes that are available)

Last edited by mcloaked (2021-06-27 19:56:35)


Mike C

Offline

#9 2021-06-29 16:18:09

graysky
Wiki Maintainer
From: :wq
Registered: 2008-12-01
Posts: 10,731
Website

Re: AMD Curve Optimizer/which core threw the Machine Check Error

That's OK, if you can get some time to run that command, I am curious to see what it shows.  I believe the -i switch is just informative/should not modify anything.

Offline

#10 2021-07-04 15:29:44

mcloaked
Member
From: Yorkshire, UK
Registered: 2012-02-02
Posts: 1,351

Re: AMD Curve Optimizer/which core threw the Machine Check Error

I got around to trying this!

[mike@incus ryzenadj-git]$ sudo ryzenadj -i
CPU Family: Renoir
SMU BIOS Interface Version: 19
Version: v0.8.2 
PM Table Version: 370005
|        Name         |   Value   |      Paramter      |
|---------------------|-----------|--------------------|
| STAPM LIMIT         |    65.000 | stapm-limit        |
| STAPM VALUE         |     6.141 |                    |
| PPT LIMIT FAST      |    88.000 | fast-limit         |
| PPT VALUE FAST      |     8.515 |                    |
| PPT LIMIT SLOW      |    88.000 | slow-limit         |
| PPT VALUE SLOW      |     5.768 |                    |
| StapmTimeConst      |   100.000 | stapm-time         |
| SlowPPTTimeConst    |     5.000 | slow-time          |
| PPT LIMIT APU       |    88.000 | apu-slow-limit     |
| PPT VALUE APU       |     5.768 |                    |
| TDC LIMIT VDD       |    65.000 | vrm-current        |
| TDC VALUE VDD       |     0.456 |                    |
| TDC LIMIT SOC       |    50.000 | vrmsoc-current     |
| TDC VALUE SOC       |     3.632 |                    |
| EDC LIMIT VDD       |    95.000 | vrmmax-current     |
| EDC VALUE VDD       |    28.861 |                    |
| EDC LIMIT SOC       |    75.000 | vrmsocmax-current  |
| EDC VALUE SOC       |     6.503 |                    |
| THM LIMIT CORE      |    95.000 | tctl-temp          |
| THM VALUE CORE      |    31.768 |                    |
| STT LIMIT APU       |     0.000 | apu-skin-temp      |
| STT VALUE APU       |     0.000 |                    |
| STT LIMIT dGPU      |     0.000 | dgpu-skin-temp     |
| STT VALUE dGPU      |     0.000 |                    |
| CCLK Boost SETPOINT |    50.000 | power-saving /     |
| CCLK BUSY VALUE     |    24.139 | max-performance    |
[mike@incus ryzenadj-git]$ 

Mike C

Offline

Board footer

Powered by FluxBB