You are not logged in.

#51 2024-04-27 09:28:01

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

noisypiano wrote:

Alright, this is the closest I could get to. It's hard to find exact increment of 0.1 and 0.2.

P0 - Enabled - FID = 8C - DID = 8 - VID = 48 - Ratio = 35.00 - vCore = 1.10000
P1 - Enabled - FID = 8C - DID = A - VID = 56 - Ratio = 28.00 - vCore = 1.01250
P2 - Enabled - FID = 84 - DID = C - VID = 64 - Ratio = 22.00 - vCore = 0.92500

I got a green screen short after setting the above voltages. Right now my voltages are,

P0 - Enabled - FID = 8C - DID = 8 - VID = 48 - Ratio = 35.00 - vCore = 1.10000
P1 - Enabled - FID = 8C - DID = A - VID = 54 - Ratio = 28.00 - vCore = 1.02500
P2 - Enabled - FID = 84 - DID = C - VID = 63 - Ratio = 22.00 - vCore = 0.93125

Last edited by noisypiano (2024-04-27 09:37:52)

Offline

#52 2024-04-27 10:14:50

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

I got another crash with the above voltages. The very end of the journalctl log from the boot where the crash happened contains kernel messages.

Apr 27 15:45:18 kernel: clocksource: Long readout interval, skipping watchdog check: cs_nsec: 1887037486 wd_nsec: 1887036480
Apr 27 15:45:21 kernel: amdgpu 0000:07:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Apr 27 15:45:21 kernel: amdgpu 0000:07:00.0: amdgpu: Failed to export SMU metrics table!
Apr 27 15:45:22 kernel: amdgpu 0000:07:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Apr 27 15:45:22 kernel: amdgpu 0000:07:00.0: amdgpu: Failed to export SMU metrics table!
Apr 27 15:45:22 kernel: amdgpu 0000:07:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Apr 27 15:45:22 kernel: amdgpu 0000:07:00.0: amdgpu: Failed to export SMU metrics table!
Apr 27 15:45:22 kernel: amdgpu 0000:07:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Apr 27 15:45:22 kernel: amdgpu 0000:07:00.0: amdgpu: Failed to export SMU metrics table!
Apr 27 15:45:22 kernel: amdgpu 0000:07:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Apr 27 15:45:22 kernel: amdgpu 0000:07:00.0: amdgpu: Failed to export SMU metrics table!
Apr 27 15:45:22 kernel: amdgpu 0000:07:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Apr 27 15:45:22 kernel: amdgpu 0000:07:00.0: amdgpu: Failed to export SMU metrics table!

Seems like each time the kernel provide different type of errors.

However, I have set the following voltages now after the above failed

P0 - Enabled - FID = 8C - DID = 8 - VID = 48 - Ratio = 35.00 - vCore = 1.10000
P1 - Enabled - FID = 8C - DID = A - VID = 53 - Ratio = 28.00 - vCore = 1.03125
P2 - Enabled - FID = 84 - DID = C - VID = 61 - Ratio = 22.00 - vCore = 0.94375

Offline

#53 2024-04-27 10:32:18

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

I just had another crash, but unlike the previous crash, there was no kernel message.
However, I think I have found how to reproduce the crash.
First log in to your DE, log out, log in, log out. Do it 3/4 times. Open Firefox or Chromium or whatever browser you use, then open a page that doesn't really do anything, just do a random google search and leave your computer like that at that state. After 20/30 minutes, the computer will crash.

EDIT: After testing with other kernels, seems like this only valid on 6.8.7 kernel.

Last edited by noisypiano (2024-05-03 00:37:54)

Offline

#54 2024-04-27 10:43:20

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

noisypiano wrote:

However, I think I have found how to reproduce the crash.
First log in to your DE, log out, log in, log out. Do it 3/4 times. Open Firefox or Chromium or whatever browser you use, then open a page that doesn't really do anything, just do a random google search and leave your computer like that at that state. After 20/30 minutes, the computer will crash.

100% sure. Just got another green screen following the procedure. Didn't even take 10 minutes

Offline

#55 2024-04-27 10:55:05

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

Just got another green screen crash after following the same procedure.
This time I used these voltage settings in Zenstates

P0 - Enabled - FID = 8C - DID = 8 - VID = 40 - Ratio = 35.00 - vCore = 1.15000
P1 - Enabled - FID = 8C - DID = A - VID = 50 - Ratio = 28.00 - vCore = 1.05000
P2 - Enabled - FID = 84 - DID = C - VID = 60 - Ratio = 22.00 - vCore = 0.95000

Seems like the voltage settings in Zenstates are being rejected by the BIOS or whatever. Or maybe I have to increase the voltage a little more? I don't know anything at this point.

However this time at the very end of journalctl I got some other kernel messages. This is the journalctl of the boot where the crash happened.

Apr 27 16:51:38 kernel: snd_hda_intel 0000:07:00.1: Unable to change power state from D3hot to D0, device inaccessible
Apr 27 16:51:41 kernel: clocksource: timekeeping watchdog on CPU10: hpet wd-wd read-back delay of 403411816ns
Apr 27 16:51:41 kernel: clocksource: wd-tsc-wd read-back delay of 403412235ns, clock-skew test skipped

Last edited by noisypiano (2024-04-27 10:59:59)

Offline

#56 2024-04-27 11:59:10

agapito
Member
From: Who cares.
Registered: 2008-11-13
Posts: 664

Re: Random reboots with green screen and MCE events

noisypiano wrote:

Seems like the voltage settings in Zenstates are being rejected by the BIOS or whatever. Or maybe I have to increase the voltage a little more? I don't know anything at this point.

Probably the voltages are not being changed, in fact you have increased them quite a bit and it hasn't worked. Stop doing that.

noisypiano wrote:

However, I think I have found how to reproduce the crash.

That's the typical situation almost all Zen 3 users experienced: no load = core sleeps = low temps = light load comes = high boost thanks to good temps = no enough voltage to handle that frequency = crash.

If you install Windows try CoreCycler and you will see your cores crashing sooner or later; how broken your CPU is, you will know by how long it takes to fail that test. From the frequency with which you suffer reboots, I am sure that at least one of your cores would fail the test before 10 seconds.


Excuse my poor English.

Offline

#57 2024-04-28 09:45:34

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

I cannot reproduce this on Windows 11 23H2 following the same procedure as Linux. I will inform in case this happens on Windows as well.

Offline

#58 2024-04-28 11:44:31

agapito
Member
From: Who cares.
Registered: 2008-11-13
Posts: 664

Re: Random reboots with green screen and MCE events

noisypiano wrote:

I cannot reproduce this on Windows 11 23H2 following the same procedure as Linux. I will inform in case this happens on Windows as well.

Have you tried CoreCycler yet?

https://github.com/sp00n/corecycler


Excuse my poor English.

Offline

#59 2024-04-28 15:55:00

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

agapito wrote:

Have you tried CoreCycler yet?

I have already looked at that tool. But no, I don't have the sanity to run a 60+ hours test to find out if all my cores are stable. Also, it basically seems to stress each core to find out which one is faulty, but I don't have a problem with higher loads as I have said before. I'd just use Windows like I used to use Linux and try to do a idle test to reproduce this error. If my cores are bad and it happens on Linux on daily usage, it should happen on Windows that way.

Last edited by noisypiano (2024-04-29 09:02:57)

Offline

#60 2024-04-29 05:55:11

agapito
Member
From: Who cares.
Registered: 2008-11-13
Posts: 664

Re: Random reboots with green screen and MCE events

noisypiano wrote:
agapito wrote:

Have you tried CoreCycler yet?

I have already looked at that tool. But no, I don't have the sanity to run a 60+ hours test to find out if all my cores are stable. Also, it basically seems to stress each core to find out which one is faulty, but I don't have a problem with higher loads as I have said before. I'd just use this like I used to in Linux and try to do a idle test to reproduce this error. If my cores are bad and it happens on Linux on daily usage, it should happen on Windows that way.

It is the tool that will allow you to certify that your CPU is really broken and your restarts are not due to any other reason than that. Transient loads is what CoreCycler test, not higher loads. You don´t need 10 hours, you will crash in a few minutes or even seconds. If you make use of the warranty it will be easier if you say that you have instantaneous errors in CoreCycler and your cheap motherboard does not allow you to change the CPU parameters.


Excuse my poor English.

Offline

#61 2024-04-29 07:27:27

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

Alright I get this error in CoreCycler

ERROR: 13:21:27
ERROR: Prime95 seems to have stopped with an error!
ERROR: At Core 1 (CPU 2)
ERROR MESSAGE: The Prime95 process doesn't use enough CPU power anymore (only 0% instead of the expected 8.33%)
                 + No FFT size provided in the error message, make an educated guess.
ERROR: The last *passed* FFT size before the error was: 8960K
ERROR: Unfortunately FFT size fail detection only works for Smallest, Small or Large FFT sizes.

Every core throws an error except core 0. But I don't have any problems in daily usage. What does this error mean?

Last edited by noisypiano (2024-04-29 07:29:51)

Offline

#62 2024-04-29 08:18:21

agapito
Member
From: Who cares.
Registered: 2008-11-13
Posts: 664

Re: Random reboots with green screen and MCE events

noisypiano wrote:

Alright I get this error in CoreCycler

ERROR: 13:21:27
ERROR: Prime95 seems to have stopped with an error!
ERROR: At Core 1 (CPU 2)
ERROR MESSAGE: The Prime95 process doesn't use enough CPU power anymore (only 0% instead of the expected 8.33%)
                 + No FFT size provided in the error message, make an educated guess.
ERROR: The last *passed* FFT size before the error was: 8960K
ERROR: Unfortunately FFT size fail detection only works for Smallest, Small or Large FFT sizes.

Every core throws an error except core 0. But I don't have any problems in daily usage. What does this error mean?

It seems like you hit this bug: https://github.com/sp00n/corecycler/issues/49

Disable C-States and try it again or just boot in safe mode (best option).


Excuse my poor English.

Offline

#63 2024-04-29 09:00:56

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

I had debloated my Windows installation using a tool called ReviOS. I was getting into a lot of problems including this error from CoreCycler. Right now I have reinstalled Windows, everything is stock and CoreCycler is passing all the tests. I'm on iteration 15 right now, no errors reported yet. Will check more.

Offline

#64 2024-04-29 09:31:14

agapito
Member
From: Who cares.
Registered: 2008-11-13
Posts: 664

Re: Random reboots with green screen and MCE events

noisypiano wrote:

I had debloated my Windows installation using a tool called ReviOS. I was getting into a lot of problems including this error from CoreCycler. Right now I have reinstalled Windows, everything is stock and CoreCycler is passing all the tests. I'm on iteration 15 right now, no errors reported yet. Will check more.

OK, boot Windows 11 in safe mode and use CoreCycler with this settings:

FFTSize = 720-720
suspendPeriodically = 1
mode = SSE

Don't use the computer for a couple of hours and if when you come back you haven't found any errors, then it is probably the GPU that is causing your restarts.


Excuse my poor English.

Offline

#65 2024-04-29 10:33:43

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

noisypiano wrote:

I had debloated my Windows installation using a tool called ReviOS. I was getting into a lot of problems including this error from CoreCycler. Right now I have reinstalled Windows, everything is stock and CoreCycler is passing all the tests. I'm on iteration 15 right now, no errors reported yet. Will check more.

The script ran for 01 hours, 04 minutes, 43 seconds. 51 iterations have passed. No errors.

Offline

#66 2024-04-29 10:48:51

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

agapito wrote:

OK, boot Windows 11 in safe mode and use CoreCycler with this settings:

FFTSize = 720-720
suspendPeriodically = 1
mode = SSE

I might but not sure. I think it's obvious at this point that the CPU is working fine under Windows.

agapito wrote:

Don't use the computer for a couple of hours and if when you come back you haven't found any errors, then it is probably the GPU that is causing your restarts

I think the first thing I will throw out of the equation is the GPU. I don't think It's possible for a GPU to trigger a CPU failure. I think it's impossible. But on the contrary, it's really possible for the CPU to not only trigger a GPU, but a whole system failure. I have run multiple graphics stress tests and the GPU works just fine. I have kept Windows 40/50mins in idle and still no crashes. I am not gonna fix a problem if it ain't broke. All the errors and reports point towards the CPU.

Last edited by noisypiano (2024-04-29 11:14:49)

Offline

#67 2024-04-29 10:58:43

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

The final conclusion I have come to is that as said in the ArchWiki, maybe the CPU has degraded and adding some voltage offset to all CPU cores might fix the issue. AMD is probably aware of the situation and made some driver patches that automatically does this on Windows.

I have tried with the 6.6 LTS kernel but it still crashed. I have used Linux on this machine since the regular 6.6 kernel and I had no errors. I think rather than using a LTS kernel, I have to use the exact kernel that I had no errors on to find out more precisely. I have only tried ArchLinux but I think I should try with other distros as well.

Last edited by noisypiano (2024-04-29 10:59:51)

Offline

#68 2024-04-29 11:56:26

agapito
Member
From: Who cares.
Registered: 2008-11-13
Posts: 664

Re: Random reboots with green screen and MCE events

noisypiano wrote:

The final conclusion I have come to is that as said in the ArchWiki, maybe the CPU has degraded and adding some voltage offset to all CPU cores might fix the issue.

No. If your CPU passes CoreCycler without errors, then your CPU is fine. If your CPU is broken, it would also reboot using Windows.


noisypiano wrote:

I think the first thing I will throw out of the equation is the GPU. I don't think It's possible for a GPU to trigger a CPU failure.

A bad CPU voltage setting can cause errors even on NVMe hard disks. It is perfectly possible that the GPU is the cause of the instability and in this forum we have had some examples:

https://bbs.archlinux.org/viewtopic.php … 2#p2151162
https://bbs.archlinux.org/viewtopic.php … 3#p2154573

And I personally also suffered some years ago with an orange screen before reboot : https://bugzilla.kernel.org/show_bug.cgi?id=206903#c262


Excuse my poor English.

Offline

#69 2024-04-29 13:38:52

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

agapito wrote:

A bad CPU voltage setting can cause errors even on NVMe hard disks. It is perfectly possible that the GPU is the cause of the instability and in this forum we have had some examples: ...

I have never tweaked my CPU or GPU.

agapito wrote:

And I personally also suffered some years ago with an orange screen before reboot : https://bugzilla.kernel.org/show_bug.cgi?id=206903#c262

I have never executed a command to modify the GPU voltages. I did not do anything. If it's a GPU issue it might be because of the Linux kernel not my hardware. Also please don't tell me my CPU/GPU is not getting enough power. 650W is more than enough for the setup I have. My 245W GPU is capped at 200W in Linux. It is already more than enough with 245W. Also you cannot modify the settings of 7000 series GPUs in Linux no matter what kernel parameter you set. I couldn't even modify the fan curve.
However, I am gonna install arch and try with amd.ppfeaturemask kernel parameter soon.

Last edited by noisypiano (2024-04-29 13:49:15)

Offline

#70 2024-04-29 14:24:59

agapito
Member
From: Who cares.
Registered: 2008-11-13
Posts: 664

Re: Random reboots with green screen and MCE events

noisypiano wrote:

I have never tweaked my CPU or GPU. I have never executed a command to modify the GPU voltages. I did not do anything.

And who is talking here about you having modified the voltages? I have only shown you some examples of how the GPU can cause an automatic MCE reboot even with stock settings. Remember what i wrote in this message: https://bbs.archlinux.org/viewtopic.php … 7#p2151647

2 different causes with the same consequence (reboot).

Thanks to CoreCycler you have discarded one.

noisypiano wrote:

If it's a GPU issue it might be because of the Linux kernel not my hardware.

Really? https://bugzilla.kernel.org/show_bug.cgi?id=206903#c279  T

This is not an OS problem this is a broken or misconfigured hardware.

noisypiano wrote:

Also you cannot modify the settings of 7000 series GPUs in Linux no matter what kernel parameter you set. I couldn't even modify the fan curve.
However, I am gonna install arch and try with amd.ppfeaturemask kernel parameter soon.

Not entirely true: https://github.com/ilya-zlobintsev/LACT


Excuse my poor English.

Offline

#71 2024-04-29 14:50:48

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

agapito wrote:

And who is talking here about you having modified the voltages? I have only shown you some examples of how the GPU can cause an automatic MCE reboot even with stock setting

The bugzilla link you mentioned that you got an orange screen about, I believe you were modifying the GPU voltage. That's why I said that.

noisypiano wrote:

If it's a GPU issue it might be because of the Linux kernel not my hardware.

The error clearly indicates a CPU failure. I don't understand where did you get GPU from. I also don't understand your reasoning to think a broken GPU can cause a CPU cease to function.

noisypiano wrote:

If it's a GPU issue it might be because of the Linux kernel not my hardware.

agapito wrote:

Really? https://bugzilla.kernel.org/show_bug.cgi?id=206903#c279

THIS BUG IS CAUSED OR TRIGGERED BY THE GPU.101% SURE, BECAUSE I JUST EXPERIENCED IT IN WINDOWS 10.

If you look at my previous posts you will see that I never suffered this error until 2 months ago when i was trying to change my GPU overclock echoing commands. Since then, I have not suffered from it again because i didn´t touch my GPU overclock settings.

Today i was testing my CPU stability after flashing ComboAm4v2PI 1.2.0.5 firmware using Prime95 (Windows 10). 10 hours and 0 errors later I tried to change all GPU power states frequencies to stable and well tested values using Radeon Software while Prime95 was still running, and then, black screen and spontaneous reboot to Linux. When Linux was loading i could see the infamous kernel message: mce: [Hardware Error]: CPU 0: Machine Check.

Before Ryzen 3700x i had paired my RX570 with an Intel CPU and i can remember my GPU crashing or freezing when i tried to change overclock settings in Windows.

This is not an OS problem this is a broken or misconfigured hardware.

A broken or misconfigured hardware that works magically on Windows without any issues and doesn't on Linux? Okay!

agapito wrote:
noisypiano wrote:

Also you cannot modify the settings of 7000 series GPUs in Linux no matter what kernel parameter you set. I couldn't even modify the fan curve.
However, I am gonna install arch and try with amd.ppfeaturemask kernel parameter soon.

Not entirely true: https://github.com/ilya-zlobintsev/LACT

I was talking about CoreCtrl. I tried adjusting my fan curve on December, 2023. Seems like they brought support for 7000 series GPUs at the end of March, 2024: https://www.phoronix.com/news/CoreCtrl-1.4-Released

Last edited by noisypiano (2024-04-29 16:10:46)

Offline

#72 2024-04-29 15:54:14

seth
Member
Registered: 2012-09-03
Posts: 51,870

Re: Random reboots with green screen and MCE events

A broken or misconfigured hardware that works magically on Windows without any issues and doesn't on Linux? Okay!

https://wiki.archlinux.org/title/Ryzen#Random_reboots

After investigating and reading reports on the Internet, It seems that out of the box, Windows seems to run the CPUs at higher voltage and lower peak frequencies, compared to the stock linux kernel, which depending on your draw from the silicon lottery could cause a host of random application crashes or hardware errors that lead to reboots.

Also, what kind of desktop session do you run and what if you try openbox (no picom)?

Offline

#73 2024-04-29 16:03:19

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

noisypiano wrote:

A broken or misconfigured hardware that works magically on Windows without any issues and doesn't on Linux? Okay!

I have read that archwiki already. I know and I already gave an explanation in post #67. I said that because agapito thinks my CPU is okay and it's my GPU that is broken or misconfigured. I think you missed the context.

seth wrote:

Also, what kind of desktop session do you run and what if you try openbox (no picom)?

I run Hyperand and I don't have picom. I haven't tried with openbox yet. I think I am gonna stay a few days with Windows for testing purpose. Also please don't mind because I have already told you I'm gonna do it later a lot of times but later I didn't do anything. I am gonna really try that this time.

Last edited by noisypiano (2024-04-29 16:04:51)

Offline

#74 2024-04-29 16:13:30

agapito
Member
From: Who cares.
Registered: 2008-11-13
Posts: 664

Re: Random reboots with green screen and MCE events

noisypiano wrote:

The error clearly indicates a CPU failure. I don't understand where did you get GPU from. I also don't understand your reasoning to think a broken GPU can cause a CPU cease to function.

Ryzen processors not only contain the cores they also have an integrated memory controller, they control USB & NVMe ports and the PCI-Express port where your GPU is connected.

noisypiano wrote:

A broken or misconfigured hardware that works magically on Windows without any issues and doesn't on Linux? Okay!

So tell me why I had an orange screen of death on Windows and on the next Linux reboot I could see the same MCE error message you have seen in the last few days.

Tell me why, I had that those colored reboots with my Polaris card on Windows paired with an Intel CPU first and a Ryzen CPU later, and they stopped when I changed that Polaris card for a Navi 2 card.

If this is a Linux Ryzen-CPU bug tell me why I haven't seen any unexpected reboot in the last two years. Why my 5950x is not affected by that bug?


Excuse my poor English.

Offline

#75 2024-04-29 16:18:50

noisypiano
Member
Registered: 2024-04-23
Posts: 52

Re: Random reboots with green screen and MCE events

agapito wrote:

Ryzen processors not only contain the cores they also have an integrated memory controller, they control USB & NVMe ports and the PCI-Express port where your GPU is connected.

Not only Ryzen but Intel as well uses integrated memory controller. A broken GPU might cause your PC to restart, but not cease your CPU to function.

agapito wrote:

So tell me why I had an orange screen of death on Windows and on the next Linux reboot I could see the same MCE error message you have seen in the last few days.

Tell me why, I had that those colored reboots with my Polaris card on Windows paired with an Intel CPU first and a Ryzen CPU later, and they stopped when I changed that Polaris card for a Navi 2 card.

I understand that you got colored reboots in Windows as well, but the thing you are missing here is, I am not having any in Windows! I am still using it for testing, if it ever happens in Windows as well, I can say it might be the GPU.

agapito wrote:

If this is a Linux Ryzen-CPU bug tell me why I haven't seen any unexpected reboot in the last two years. Why my 5950x is not affected by that bug?

I can't really say for sure.

Last edited by noisypiano (2024-04-29 16:26:14)

Offline

Board footer

Powered by FluxBB