After crash, mc: [Hardware Error] errors when using modesetting.

jonnyDumb · 2022-07-31 14:51:56

Hello

This post is the last straw with the intent to safe myself some money by not having to invest in a range of new Hardware. Maybe more experienced users can judge the situation better and so I hope a few people could look at the issue and voice their opinion on what the problem could be, or indeed, if this is considered not fixable by software. I am stuck as i am not aware on how to interpret mce errors or how modesetting exactly works so it can trigger mce errors.

> After a crash one evening before last weekend, somewhat more "violent" than others before, which I mindlessly attribute to glitches in the mains supply, my system kept crashing at different stages afterwards:
- Never with nomodeset as a boot option in the "Linux" line when editing the grub boot option by pressing 'e' in the grub boot menu or selecting the nomodeset option running a systemrescuecd live boot CD/pendrive.
- Always with modset.
- Normally crashes some time before a login prompt is available (I use startx).
- If the log-in prompt is reached and I log in, it crashes if i try to run any command like vim or dmesg, startx gets to change the screen white, with colorful glitches here and there.
- When crashing, it will reboot sometimes; other times it will freeze indefinitely, indicated by all fans running at full speed.
- If the computer reboots without there being another issue that forces a reset earlier, "mc: [Hardware Error]" messages are shown (See other things to note).

> Other things to note:
- With "violent" is meant that the fans started spinning fully, the graphics where glitching, i think i heard a pop through the headphones, then a reboot wich required several resets to get into bios setup/grub.
- There where issues before this event, i think pretty much since I bought and build the PC:
-- Every second or third boot, the fans would spin at full speed, but then not slow down as they should. I can not get into the bios setup, screens go into power safe. If this happens after a "modeset-crash" no mce errors appear. Unfortunately, this is rather often. I understand that this is because the bios holds this information. If a reboot happens before it was displayed, the information is lost.
-- Sometimes, crashes occurred during use, this was either a freeze, dealt with the reset button, or the PC rebooted on its own. I attributed these to glitches in the mains supply...

> Troubleshooting:
Searching for "mce Hardware error" and the likes, i ran into not very comforting posts about such issues not being fixable. Getting worse with time swapping one component may break the replacement, wich may break the replacement for the other component now thought to be at fault. So care is key. One post that gave hope was nomodesetting making the system runnable. SystemrescueCD proved that this was the case for my system too. I used memetest and and started every task in stress, except for Disk-IO as my normal disks where not mounted and did not think to do so. No issues found after several hours.
- Then i looked more into modesetting. To my understanding this is only relevant for the graphics driver (i use(d) xf86-vidoe-ati for a R7 370 gaming 2g). I hastly bought a new graphics card. Exactly the same issue.
- Removed xf86-vidoe-ati and after carefully looking at xorg.log found that it was looking for fbdev, so i found and installed xf86-vidoe-fbdev. Now i can get into the GUI again, with startx, although only one screen and Xonotic runs at 2fps. Supertuxkart and openmw crash and have to be stopped with killall -9 as root. What did I expect...
- Desperately updating arch at every occasion at this point, I though the problem may be that the newest kernel has a bug, maybe i need an older one. The issue arrived a day after an update. Debian live to the rescue, downloading the +nonfree option after being prompted that i apparently do need non-free firmware for my graphics card. Older kernel, same issue. The live system also displays machine check errors (if the previous attempt failed) followed by a crash or unintended reboot. So, kernel version probably not an issue.
- I had installed rasdaemon and looked at several boot cycles with journalctl -r. I found nothing of note except an error message related to the radeon driver, wich, going back more in time, proved to be always there and was part of misleading me to think the graphics card must be at fault. I believed to have found the issue when running ras-mc-ctl --summary and saw plenty of disk errors. Searching online, i found that the disk error feature apparently is a new, buggy feature and disk read failures are not reliable. The fact that i am using the SSDs right now, backed one up to the other is supporting this (as well as booting from a second SSD and pendrives all yielding the same error).
- Booted from my second hard disk, with a copy of my main system, same results, every time, modeset and nomodeset.
- Unplugged one of two RAM modules at a time to see if the issue disappears. It did not. Removing both sticks, just for fun does not even give a bios message. Disappointing.
- The last desperate attempt before posting this was to take the CPU out, cleaning its pins with IPA, removing grime here and there form the main board while at it. Fresh thermal paste, extra care. Same issue.

> Conclusion so far:
This feels like one of those mind bending riddles: What hardaer is modeset most related to, but its not the graphics. What causes plenty or "mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: ..." errors but is not the CPU and still related to modeset.
I am using the system in question to type these lines, listen to music, brows the internet. Can it be the mainboard?
From the research online, i would attempt the following: Replace Mainbaord and CPU, I obviously already got a newer, matching GPU. Replace MB and CPU at once, to not risk cross contamination. As the the CPU-RAM combination and SSDs are continuously used and I can not see those being involved in the failure... This is at least the plan but there is hope that some other proposal may solve the issue or inform me that it would be risky to not replace RAM and SSD dirves too... The brand new GPU was plugged into the current system for 10 minutes or so. I guess that is something I have to risk as well, too much paranoia is not helpful either.

> Errors:
Full mce error example, CPU and plenty of other numbers seem to change, but even so, recurring to the values seen before. After a closer look just now, the PROCESSOR 2 line seem always to be the same (not just in the occurrence typed out below but over several occurrences). Well, APIC and TIME are changing. Typed form a picture as the message disappears rather quickly. Not all numbers may be precise, but I am happy to double check if someone actually can interpret the information.
[ 0.422644] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: faa000000
000080b
[ 0.422651] mce: [Hardware Error]: TSC 0 MISC d012000200000000 SYND 5d000000
IPID 1002e00000002
[ 0.422656] mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1659024464 SOCKET
0 APIC 0 microcode 8001138
[ 0.422662] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea000000
0000108
[ 0.422664] mce: [Hardware Error]: TSC 0 MISC 1ffff9c7ad0da MISC d0120001000
0000 SYND 4d000000 IPID 500b000000000
[ 0.422668] mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1659024464 SOCKET
0 APIC 4 microcode 8001138
[ 0.422671] mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 5: bea000000
0000108
[ 0.422673] mce: [Hardware Error]: TSC 0 MISC 1ffff9c232910 MISC d0120001000
0000 SYND 4d000000 IPID 500b000000000
[ 0.422677] mce: [Hardware Error]: PROCESSOR 2:800F11 TIME 1659024464 SOCKET
0 APIC 8 microcode 8001138

# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.
Disk errors summary:
0:0 has 5 errors
0:2064 has 93 errors
0:2080 has 7 errors
0:66304 has 895 errors
No MCE errors.

> System:
MB: ASUS TUF B350-PLUS GAMING
CPU: AMD RYZEN 7 1700
GPU: MSI R7 370 GAMING 2G
RAM: CORSAIR VENGENCE CMK16GX4M2A2666C16
SSD1: Samsung SSD 960 EVO 250GB
SSD2: Samsung SSD 850 EVO 250GB
SSD3: Samsung SSD 860 QVO 1TB

$ uname -r
5.18.15-arch1-1

ewaller · 2022-07-31 15:24:05

That just smells like a hardware issue, especially given the machine check exceptions.

Is the ambient temperature unusually hot? Does the machine work differently if you start after the machine has had a chance to fully cool down to room temperature?
Are you overclocking or 'undervolting' your processor?
Is your power supply up to the challenge?

jonnyDumb · 2022-08-01 15:03:27

Thank you for having a look ewaller.

ewaller wrote:

Is the ambient temperature unusually hot?

When the crash happened it was somewhat warm, but to no extreme extend, 28°C ~ 82°F. Temperatures during troubleshooting have been lower on average.

ewaller wrote:

Does the machine work differently if you start after the machine has had a chance to fully cool down to room temperature?

The boot issues I encountered form the beginning occurs regularly, no correlation with temperature. I believe those are truly random. Crashes (those i occasionally saw during normal usage before the 'big' crash) may have been more "regular" during summer but I would not put much into that as memory may betray.
The current issue with the "modesetting" crashes has 100% reproducibility, regardless of temperature time off, time on.
I am at least the second one on the internet who seems to experience this exact problem, but was hoping that the other guy simply gave up too soon and it is a Graphics card problem... Alas, I refuse to believe that the brand new GPU, just purchased for testing, would have the exact same fault... Coincidence is one thing, but that, I feel, would be so unlikely that I am willing to take the risk and keep the new card in case i replace the other components. I had the new card in for a short time only, to prove to me that the problem is the same. That should hopefully have ensured that another faulty part could not damage the new GPU.

ewaller wrote:

Are you overclocking or 'undervolting' your processor?

No overclocking or undervolting. While undervolting is certainly something I seeked to do, I did not have the patience to figure out how the bios would let me achieve that... If anything, i had the power safe feature of the bios active which will prefer lower clock speeds. Also tried to under clock the CPU, no difference (The stresstesting with the 'stress' for a few hours ran stable). The base clock was never touch either.

ewaller wrote:

Is your power supply up to the challenge?

I do believe so; summing the theoretical consumptions and a bit of guesstimates, i get this: MB ~20W, GP: ~75W, CPU: 65W, other stuff: 20W ~180W. PSU rating 550 Watt. Doubling that, and I should still be fine. The PSU has been on my mind as well. A colleague told me that the power grid sometimes switches transformer stations. If the caps are old in the PSU, resulting glitches may propagate through, maybe? When looking at the Bios reported voltages, they all look stable. The PSUs fan is pretty much always idle. Even though I love to save some money, I though of replacing the PSU togheter with MB and CPU, just for peace of mind. It can not be the cause of the problem right now (modsetting correlation too high, PC running for hours with nomodeset, incl. stress tasks for hours), maybe it was part of the problem for the original crash that put me into the modeset dilema. A new PSU could maybe prevent that in the future at least.

One more thing I should try is to unplug all SSDs and boot form my systemrescuecd thumb drive again (it conveniently has the nomodset and standart boot option which triggers the issue). The problem is weird enough to my mind that such a simple test certainly should not be omitted. The Samsung SSD is connected via M.2 to the PCIe bus if I understand correctly, so maybe there is something to be discovered.... Will report back.

Side note, as I typed this reply, the PC froze again... A 'normal' freeze. So, those seem to occur regardless of modesetting.

jonnyDumb · 2022-08-01 17:02:44

Yep, that did not go very far. Nomodesetting is still king. A gruesome one, but the throne is his never the less.

A few other observations:
- The system crashed twice while using with nomodeset, first after around 2 hours, mentioned in my previous post, then a second time 15 minutes later. When booting the second time, there it was short but dreaded, 3 lines of mce errors. The system is getting worse.
- Unplugging the SSds did nothing. Booted SystemRescue about 20 times. 100% crashes with modeset. 0% crashes with nomodeset. However. I did not see the boot problem a single time. A bit more testing and unfortunately the 960 EVO in the M.2 slot seems to be culprit. Well, at least that mystery may be solved (Or I was just a lot more lucky the 10 times I booted with the other 2 SSDs plugged in...).

With the unlikely number of crashes and the appearance of an mce when running with nomodeset, i think the future of this particular hardware combination is sealed. This is a problem cold hard cash has to solve. Yet, if there are other proposals, I am here to listen for a few more days before ordering and probably also afterwards.

To venture an unscientific, non provable guess: I was careless when assembling the PC when I first got the parts, too exited, made errors, had to remove components, walked around the room, the air was dry, did not always touch a metal part before handling parts. A little zap, degraded MOSFETs on the mainboard or the CPU, currents form a small leak growing bigger, exponentially. 4 years later it becomes a visible issue. First, nomodeset hides the problem at the cost of significantly reduced functionality, but the problem is a ticking time bomb in the last stages. I will resort to living with my trusty old laptop until new parts arrive.

Update: while proofreading this post, a 3 crash, mce at reboot. The end is near for this machine. If anybody has an opinion whether a new PSU should be purchased and why, don't hold back!

seth · 2022-08-01 21:27:53

From the symptoms it's clearly the GPU, even w/ "GP: ~75W" you'd need to have it in a PEG slot (consult the board manual, the regular slots only get you 25W) but google says the R7 370 draws 110W under load (not w/ nomodeset where it's ~20W what works in every regular PCIe slot), so there's likely a 6/8-pin connector that you forgot to attach and the GPU crashes for being underpowered.

Last edited by seth (2022-08-01 21:28:08)

jonnyDumb · 2022-08-02 14:51:32

Thank you for the suggestion seth.

110 Watts is certainly more than I thought. In either case, the 6 pin PCI-e power connector is connected and working. Without it, the system does not boot at all. Tried this in course of troubleshooting. The graphics card is inserted into the PCI-e slot closest to the CPU. The only other PCI-e slot is at the bottom of the mainboard, where front panel USB connectors would interfere with the double height cooling.

I shall note that the PC is 4 years old and I regularly played The Dark Mod, Xonotix, Supertuxkart and Morrowind (using OpenMW). Until the "violent crash" outlined in my original post that is.

The replacement GPU I bought does not require the extra 6/8-pin power connector (as it physically does not have one). It showed the same behavior, which leads me to doubt that the GPU is the culprit... Any suggestions how to further troubleshoot the GPU are welcome though.

As a site note, I believe the PCIe slot should support 75W if a x16 card is inserted and after software configuration: https://en.wikipedia.org/wiki/PCI_Express#Power The replacement GPU i bought is rated at 53 Watts if i remember correctly. I think confusing this may have been the source of my bad memory regarding power consumption of the R7 370.

jonnyDumb · 2022-08-16 14:30:31

For those who run into similar problems, I did not want this post to be left without some conclusion:

1. Received new Mainboard and CPU. Graphics card is ultimately new as well, but since it did not solve the issue paired with the old components, the problem must have been the CPU and/or Mainboard.

2. The boot hangs when using the Samsung SSD 960 EVO 250GB in the M.2 slot seem to be gone. Around 20 reboots are enough to indicate this with something that failed every second or third reboot before. Same harddisc, but new MB + CPU as said.

3. Bad user experience: Asus Flashback (https://www.asus.com/support/FAQ/1038568) was not just a feature but necessity. The motherboard did not support the CPU out of the Box and this was the only way to update the Bios. After the previous problems, my heart sank when a black screen was all there is until the bios was updated. No warning messages, no LED error codes or 7-segment fault indication.
-> When hardware errors hit hard, and problems just continued, my old laptop really was a savior. Good to keep some usable alternative around and somewhat updated. I was also lucky enough to have had Nextcloud sync important files, mainly, my Keepass database.
-> With times, building a computer has not become easier. At least on cheaper main boards, troubleshooting is hard / impossible. The new mainboard has a debug port. Maybe it is worth to see if there is some useful information on there, invest in a UART dongle or whatever it takes. Price may offset the discomfort of having no clue of whats going on. From experience, I found that RAM or CPU both can cause this. CPU and lack of new enough bios has been an issue before. I do not envy people with less experience running into this problem.

Last edited by jonnyDumb (2022-08-16 14:32:42)

Arch Linux

#1 2022-07-31 14:51:56

After crash, mc: [Hardware Error] errors when using modesetting.

#2 2022-07-31 15:24:05

Re: After crash, mc: [Hardware Error] errors when using modesetting.

#3 2022-08-01 15:03:27

Re: After crash, mc: [Hardware Error] errors when using modesetting.

#4 2022-08-01 17:02:44

Re: After crash, mc: [Hardware Error] errors when using modesetting.

#5 2022-08-01 21:27:53

Re: After crash, mc: [Hardware Error] errors when using modesetting.

#6 2022-08-02 14:51:32

Re: After crash, mc: [Hardware Error] errors when using modesetting.

#7 2022-08-16 14:30:31

Re: After crash, mc: [Hardware Error] errors when using modesetting.

Board footer