You are not logged in.

#1 2023-11-17 23:33:38

Nathan67
Member
Registered: 2023-11-17
Posts: 14

Random reboots with green screen and mce events : bea0000000000108

Hello,

My PC reboots randomly with MCE errors.
I have already applied the recommendations on this page (in particular the AMD Curve Optimizer recommendations), but the problem is still there:
https://wiki.archlinux.org/title/Ryzen#Random_reboots

With the following kernel parameters, the problem still persists:
idle=nomwait processor.max_cstate=5

I have an AMD Ryzen 5 5600X processor and an AMD Radeon RX 7800 XT graphics card.

Have a nice day, Nathan.

Last edited by Nathan67 (2024-03-03 22:22:41)

Offline

#2 2023-11-17 23:58:21

Nathan67
Member
Registered: 2023-11-17
Posts: 14

Re: Random reboots with green screen and mce events : bea0000000000108

I forgot to mention it, but I use the 6.6.1-arch1-1 kernel.

Offline

#3 2023-11-18 08:27:50

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 69,404

Re: Random reboots with green screen and mce events : bea0000000000108

The wiki wrote:

You will recognise those by dmesg logs that look like:

Do they fit? Post them in doubt.

The wiki wrote:

To solve this problem you need to supply higher voltage to your CPU so that it is stable when running at peak frequencies. The easiest way to achieve this is to use the AMD curve optimiser which is accessible via your motherboard's bios.

The cstate limitation mainly addresses soft-locks.

Offline

#4 2023-11-18 09:40:07

Nathan67
Member
Registered: 2023-11-17
Posts: 14

Re: Random reboots with green screen and mce events : bea0000000000108

seth wrote:
The wiki wrote:

You will recognise those by dmesg logs that look like:

Do they fit? Post them in doubt.

The wiki wrote:

To solve this problem you need to supply higher voltage to your CPU so that it is stable when running at peak frequencies. The easiest way to achieve this is to use the AMD curve optimiser which is accessible via your motherboard's bios.

The cstate limitation mainly addresses soft-locks.

Hello,

Thank you for your reply.
I typically get this error:

nov. 14 23:56:05 pc kernel: mce: [Hardware Error]: CPU 7: Machine Check: 0 Bank 5: bea0000000000108
nov. 14 23:56:05 pc kernel: mce: [Hardware Error]: TSC 0 ADDR 35913f0e6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
nov. 14 23:56:05 pc kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1700002559 SOCKET 0 APIC 3 microcode a20120e

Offline

#5 2023-11-18 13:18:33

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 69,404

Re: Random reboots with green screen and mce events : bea0000000000108

Yup.

The wiki wrote:

To solve this problem you need to supply higher voltage to your CPU so that it is stable when running at peak frequencies.

Offline

#6 2023-11-18 13:21:05

Nathan67
Member
Registered: 2023-11-17
Posts: 14

Re: Random reboots with green screen and mce events : bea0000000000108

seth wrote:

Yup.

The wiki wrote:

To solve this problem you need to supply higher voltage to your CPU so that it is stable when running at peak frequencies.

I did this by applying the AMD Curve Optimizer recommendations, but the problem is still there.
On the other hand, it seems that kernel 6.7 (https://aur.archlinux.org/packages/linux-mainline) is more stable, for the moment I haven't had any more crashes.

Offline

#7 2023-11-20 15:02:08

jojo06
Member
Registered: 2023-11-04
Posts: 299

Re: Random reboots with green screen and mce events : bea0000000000108

[NOT SOLUTION AT ALL] I thought of something like this. Since Linux is open source and everything can be done manually and there are no limits and barriers, I wonder if it is possible to OC processors that do not accept OC in this way. I mean locked ones.
Considering that at low frequencies the system performance drops (or resets due to self-protection), you are actually improving performance by over-volting. Did you observe an increase in performance?

Offline

#8 2023-11-20 18:07:09

agapito
Member
From: Who cares.
Registered: 2008-11-13
Posts: 699

Re: Random reboots with green screen and mce events : bea0000000000108

Nathan67 wrote:

I typically get this error:

nov. 14 23:56:05 pc kernel: mce: [Hardware Error]: CPU 7: Machine Check: 0 Bank 5: bea0000000000108
nov. 14 23:56:05 pc kernel: mce: [Hardware Error]: TSC 0 ADDR 35913f0e6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
nov. 14 23:56:05 pc kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1700002559 SOCKET 0 APIC 3 microcode a20120e

That message means your Core 1 needs more voltage.

Check my comments here: https://bugzilla.kernel.org/show_bug.cgi?id=206903#c314  and https://bugzilla.kernel.org/show_bug.cgi?id=206903#c316

You need to use Curve Optimizer to ensure all your cores have the correct voltage curve. My advice is to use -30 per core and decreasing values every time Core Cycler fails. I spent a month trying to stabilize my Zen 3 CPU, but after every core spent hours of error-free testing, I have never seen an error like that again.

Last edited by agapito (2024-05-31 05:57:34)


Excuse my poor English.

Offline

#9 2023-11-20 18:11:05

Nathan67
Member
Registered: 2023-11-17
Posts: 14

Re: Random reboots with green screen and mce events : bea0000000000108

jojo06 wrote:

[NOT SOLUTION AT ALL] I thought of something like this. Since Linux is open source and everything can be done manually and there are no limits and barriers, I wonder if it is possible to OC processors that do not accept OC in this way. I mean locked ones.
Considering that at low frequencies the system performance drops (or resets due to self-protection), you are actually improving performance by over-volting. Did you observe an increase in performance?

By setting a positive offset of 4 points, I don't think this will improve performance considerably. I don't want to go any higher, because it will consume more power and I'll also have to dissipate more.
I'm not looking to maximise performance, just to avoid crashes.
Even so, I think you're right about performance.

Offline

#10 2023-11-20 18:13:04

Nathan67
Member
Registered: 2023-11-17
Posts: 14

Re: Random reboots with green screen and mce events : bea0000000000108

agapito wrote:
Nathan67 wrote:

I typically get this error:

nov. 14 23:56:05 pc kernel: mce: [Hardware Error]: CPU 7: Machine Check: 0 Bank 5: bea0000000000108
nov. 14 23:56:05 pc kernel: mce: [Hardware Error]: TSC 0 ADDR 35913f0e6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
nov. 14 23:56:05 pc kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1700002559 SOCKET 0 APIC 3 microcode a20120e

That message means your Core 3 needs more voltage.

Check my comments here: https://bugzilla.kernel.org/show_bug.cgi?id=206903#c314  and https://bugzilla.kernel.org/show_bug.cgi?id=206903#c316

You need to use Curve Optimizer to ensure all your cores have the correct voltage curve. My advice is to use -30 per core and decreasing values every time Core Cycler fails. I spent a month trying to stabilize my Zen 3 CPU, but after every core spent hours of error-free testing, I have never seen an error like that again.

Thanks for the information, but if I set it to -30, the cores will be underpowered, even though we're trying to give them more power.
Perhaps I've misunderstood?

Offline

#11 2023-11-20 18:24:43

agapito
Member
From: Who cares.
Registered: 2008-11-13
Posts: 699

Re: Random reboots with green screen and mce events : bea0000000000108

Nathan67 wrote:
agapito wrote:
Nathan67 wrote:

I typically get this error:

nov. 14 23:56:05 pc kernel: mce: [Hardware Error]: CPU 7: Machine Check: 0 Bank 5: bea0000000000108
nov. 14 23:56:05 pc kernel: mce: [Hardware Error]: TSC 0 ADDR 35913f0e6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
nov. 14 23:56:05 pc kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1700002559 SOCKET 0 APIC 3 microcode a20120e

That message means your Core 3 needs more voltage.

Check my comments here: https://bugzilla.kernel.org/show_bug.cgi?id=206903#c314  and https://bugzilla.kernel.org/show_bug.cgi?id=206903#c316

You need to use Curve Optimizer to ensure all your cores have the correct voltage curve. My advice is to use -30 per core and decreasing values every time Core Cycler fails. I spent a month trying to stabilize my Zen 3 CPU, but after every core spent hours of error-free testing, I have never seen an error like that again.

Thanks for the information, but if I set it to -30, the cores will be underpowered, even though we're trying to give them more power.
Perhaps I've misunderstood?

Yes, that is right and some quick crashes will be expected.

What we are looking for is to find the exact voltage for each core. Not all cores need the same voltage. That way in addition to stabilizing the CPU, you'll be getting the most out of it. Usually the two best cores of the processor are the ones that need more voltage. Sometimes you have to set a positive value until they are stable.

The downside of this is that you need many hours of testing, if you want to be 100% sure each core has to pass 8 hours of testing. The good part is that your CPU only has 6 cores, mine has 16...


Excuse my poor English.

Offline

#12 2023-11-20 23:34:08

Nathan67
Member
Registered: 2023-11-17
Posts: 14

Re: Random reboots with green screen and mce events : bea0000000000108

Random reboots with mce events are still present on the 6.7.0-rc1-1-mainline kernel :

mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: bea0000001000108
mce: [Hardware Error]: TSC 0 ADDR 7f6d8b96ec69 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1700522659 SOCKET 0 APIC 2 microcode a20120e

It seems to happen less frequently, though.

On kernel 6.7, I also get these errors, but they don't seem to have any impact on system stability :

nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000808418286000 from client 10
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00041B53
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: SDMA0 (0xd)
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x1
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x1
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x1
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x1
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000808418286000 from client 10
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00041B52
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: SDMA0 (0xd)
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x1
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x1
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x1
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:   in page starting at address 0x0000808418286000 from client 10
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          Faulty UTCL2 client ID: CB/DB (0x0)
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          MORE_FAULTS: 0x0
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          WALKER_ERROR: 0x0
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          PERMISSION_FAULTS: 0x0
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          MAPPING_ERROR: 0x0
nov. 21 00:36:21 pc kernel: amdgpu 0000:09:00.0: amdgpu:          RW: 0x0

Last edited by Nathan67 (2023-11-20 23:36:54)

Offline

#13 2023-11-21 09:23:31

agapito
Member
From: Who cares.
Registered: 2008-11-13
Posts: 699

Re: Random reboots with green screen and mce events : bea0000000000108

Nathan67 wrote:

Random reboots with mce events are still present on the 6.7.0-rc1-1-mainline kernel

NO. Random reboots occur in your CPU because you have not calibrated the voltage core per core, as I told you to do with the Core Cycler tool.

Kernel is completely fine.


Excuse my poor English.

Offline

#14 2023-11-21 18:51:16

Nathan67
Member
Registered: 2023-11-17
Posts: 14

Re: Random reboots with green screen and mce events : bea0000000000108

agapito wrote:
Nathan67 wrote:

Random reboots with mce events are still present on the 6.7.0-rc1-1-mainline kernel

NO. Random reboots occur in your CPU because you have not calibrated the voltage core per core, as I told you to do with the Core Cycler tool.

Kernel is completely fine.

The kernel is fine with the tweaks, but maybe they can improve it to have better power management on Ryzen.
I'll have to implement your solution, but it takes time.

Offline

#15 2023-11-26 16:46:56

Nathan67
Member
Registered: 2023-11-17
Posts: 14

Re: Random reboots with green screen and mce events : bea0000000000108

agapito wrote:
Nathan67 wrote:

Random reboots with mce events are still present on the 6.7.0-rc1-1-mainline kernel

NO. Random reboots occur in your CPU because you have not calibrated the voltage core per core, as I told you to do with the Core Cycler tool.

Kernel is completely fine.

I used to have an Nvidia GeForce GTX 1080 graphics card and I never had a crash with this same processor.
Until I got my RX 7800XT, I had an R9 270X and no crashes.
How can I be sure that it's the CPU, if the problem has been there since I changed cards?

Offline

#16 2023-11-26 16:51:06

Nathan67
Member
Registered: 2023-11-17
Posts: 14

Re: Random reboots with green screen and mce events : bea0000000000108

Maybe the MCE error is linked to something else.
During the crash, my main screen is black and the second one is green.

Offline

#17 2023-11-26 16:51:22

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 69,404

Re: Random reboots with green screen and mce events : bea0000000000108

The GPU won't cause a reboot, the kernel would rather halt or you just get a lot of oopses from the GPU module.
The new GPU likely draws more power from the board?
Did you attempt to alter the voltage?

Offline

#18 2023-12-05 20:57:13

Nathan67
Member
Registered: 2023-11-17
Posts: 14

Re: Random reboots with green screen and mce events : bea0000000000108

seth wrote:

The GPU won't cause a reboot, the kernel would rather halt or you just get a lot of oopses from the GPU module.
The new GPU likely draws more power from the board?
Did you attempt to alter the voltage?

According to the manufacturer's documentation, my card is designed to run at 2430 Mhz maximum. By limiting its frequency to 2200 Mhz, my system crashes much less.

I use CoreCtrl for this and to be able to change the card frequency, I had to pass the following parameter to my kernel: amdgpu.ppfeaturemask=0xffffffff

Crashes still occur, but less frequently.

I've found a way to cause a crash systematically by using the following command, which causes my main screen to be black and the second green, and when I boot up I get the same MCE error (bea0000000000108
) :

! Warning: if you try this, your system will crash !

sudo cat /sys/kernel/debug/dri/0000:09:00.0/amdgpu_regs

If I've understood correctly, the amdgpu_regs file is used to access the GPU's MMIO registers for debugging purposes. Using "cat", you read all the values in the register and, as some values don't exist, this leads to a crash.

Last edited by Nathan67 (2023-12-05 21:07:15)

Offline

#19 2023-12-06 08:21:17

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 69,404

Re: Random reboots with green screen and mce events : bea0000000000108

Is your PSU simply underdimensioned? Did you forget to attach the dedicated 6/8-pin power connector(s) to the GPU? Are you using the proper PEG slot?
Did you attempt to adjust the voltage on the CPU?

Offline

#20 2023-12-10 13:55:57

micke1m
Member
Registered: 2017-08-27
Posts: 13

Re: Random reboots with green screen and mce events : bea0000000000108

I am also getting random reboots with the same MCE messages.
My most recent hardware change was 7900 XT in January, and it has worked flawlessly since then until I started getting these random reboots on November 24 and onwards.
I have tried increasing the CPU voltage from the default 1.1v up to 1.26v and it still reboots randomly.

Ryzen 3900X
PowerColor Radeon RX 7900 XT
Gigabyte X570 Aorus Elite
Corsair AX860i 860W

Last edited by micke1m (2023-12-10 13:59:34)

Offline

#21 2023-12-10 14:00:19

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 69,404

Re: Random reboots with green screen and mce events : bea0000000000108

Offline

#22 2023-12-10 14:05:52

Nathan67
Member
Registered: 2023-11-17
Posts: 14

Re: Random reboots with green screen and mce events : bea0000000000108

seth wrote:

Is your PSU simply underdimensioned? Did you forget to attach the dedicated 6/8-pin power connector(s) to the GPU? Are you using the proper PEG slot?
Did you attempt to adjust the voltage on the CPU?

My power supply is 650 watts, which is more than enough.
All the connectors are connected to the graphics card, and it can go up to over 200 watts without any problems.
My card is in the PCIEX16_1 slot.
I've adjusted the processor voltage by setting a positive offset of 4 points, but it doesn't seem to make any difference.

Offline

#23 2023-12-10 14:06:47

Nathan67
Member
Registered: 2023-11-17
Posts: 14

Re: Random reboots with green screen and mce events : bea0000000000108

seth wrote:

For me, the problem is present on kernel 6.7.0-rc3-1-mainline with the following parameters:
BOOT_IMAGE=/vmlinuz-linux-mainline root=UUID=2e401464-c3fd-48e2-9da6-969247763033 rw loglevel=3 quiet amdgpu.ppfeaturemask=0xffffffffff

Offline

#24 2023-12-12 10:38:26

micke1m
Member
Registered: 2017-08-27
Posts: 13

Re: Random reboots with green screen and mce events : bea0000000000108

Just had another random reboot with 6.7.0-rc4-1-mainline.

Offline

#25 2023-12-12 14:56:19

seth
Member
From: Don't DM me only for attention
Registered: 2012-09-03
Posts: 69,404

Re: Random reboots with green screen and mce events : bea0000000000108

What if you drastically limit the frequencies or use the powerscale governor?
https://wiki.archlinux.org/title/CPU_fr … requencies
https://wiki.archlinux.org/title/CPU_fr … _governors

Offline

Board footer

Powered by FluxBB