You are not logged in.
I just built a new machine this weekend: an Intel i3-10320 CPU on a MSI MAG B560 Torpedo motherboard. I've never built a machine before. After building it, I immediately updated the BIOS. So, here is the problem: the only way I can get Arch Linux to boot, whether from a live USB or as it is currently installed on my new machine, is by adding nomodeset to kernel boot line. Per dmesg | grep Error, I am getting the following errors at boot:
[ 0.107865] mce: [Hardware Error]: Machine check events logged
[ 0.107866] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee2000000040110a
[ 0.107869] mce: [Hardware Error]: TSC 0 ADDR fef20080 MISC 3880000086
[ 0.107872] mce: [Hardware Error]: PROCESSOR 0:a0653 TIME 1620583348 SOCKET 0 APIC 0 microcode e0
[ 0.794166] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PC00.PGON.PBGE], AE_NOT_FOUND (20210105/psargs-330)
[ 0.794174] ACPI Error: Aborting method \_SB.PC00.PGON due to previous error (AE_NOT_FOUND) (20210105/psparse-529)
[ 0.794178] ACPI Error: Aborting method \_SB.PC00.PEG1.PG01._ON due to previous error (AE_NOT_FOUND) (20210105/psparse-529)
[ 1.012493] RAS: Correctable Errors collector initialized.
I'm particularly concerned about the hardware errors. From what I've been able to gather so far (and I've done a fair bit research over the past few days -- though I should disclaim that I am a total amateur), I may actually have two issues. And I'm not sure if they are related. The hardware errors may be due to a bad CPU or mobo socket, or it may be a firmware or microcode issue. The ACPI errors may, or may not, be related, but are probably firmware or microcode driven. I've updated the BIOS and added the most recent intel-ucode to my kernel boot line as well. So I think I am current. It seems a number of people are having the ACPI issues at the moment. Does anyone have any insights?
Offline
Are you overclocking this machine or undervolting it? It's been years since I played with overlocking and undervolting but the error sparks a memory, https://wiki.archlinux.org/title/Stress … ing_Errors
CPU-optimized Linux-ck packages @ Repo-ck • AUR packages • Zsh and other configs
Offline
Not overclocking. I don't think I'm undervolting, but I will double-check cpu power hookups to the psu just to make sure that is as it should be.
Offline
as far i'm aware the i3-10320 is locked, so really no overclocking capability. outside the motherboard messing with power limits. which is common with intel boards. in the bios i would make sure power limits are truly intel stock and not messed with. tdp for the 10320 is 65 watts for the p1, p2 should be 90 watts. tau of 28 seconds.
googling i found this with a similar mce error to yours: https://community.intel.com/t5/Graphics … d-p/711594
but good ol' intel doesn't appear to offer any help as "linux isn't validated by intel."
there could be something wrong with your cpu. if you can, i would test with windows first. windows unfortunately has the better monitoring / stability testing / benchmarking tools. i'm curious to see if windows picks up any whea errors with event viewer in the system pane.
Last edited by orlfman (2021-05-10 06:13:51)
Offline
You could also run MemTest86 to check the CPU memory registers and caches for error.
Offline
Thanks everyone for the suggestions. I have verified the hardware is hooked up as it should be. As graysky suggested, I tried to find the Fedora version of Intel Processor Diagnostic Tool (IPDT), but all the links to the tool on the Internet appear to be broken. I did a fair bit of searching for the tool, but only hit dead ends. Per orlfman, I probably will load Windows and then get IPDT for Windows just to see if that will narrow down any issues. In the mean time, I will verify the power limits in the BIOS tonight. May try to run stress and MemTest86 too.
Offline
That MCE hardware event in your log happened early at boot. Do you also get MCE log entries later while you are using the computer?
If it's only happening at boot, you perhaps shouldn't worry too much about it. I remember seeing other people with those kind of mysterious MCE events that only happen at boot but don't happen later, their computer ran fine otherwise.
The log you shared is the output of 'dmesg'? Those entries should also be in systemd's journal. You can then search for old entries from previous boots in the "journalctl" output.
The "nomodeset" issue should be something else. You didn't mention what graphics card you are using.
Offline
Yes, it happens early each time I boot, and only then. I've run journalctl | grep Error and confirmed it is always the same series of messages early in boot, every boot. I'm not using a graphics card; I'm relying on the cpu's built in graphics. FYI... I've built this machine to serve as headless home NAS, and plan to leave it on 24/7.
I was just in BIOS and one anomaly sticks out in bold text below:
VOLTAGE
CPU Core: 0.970V
CPU IO: 0.956V
CPU IO2: 65.535V
CPU SA: 1.054V
System 3.3V: 3.360V
System 12V: 12.120V
DRAM: 1.204V
I don't even know what CPU IO2 is, but 65V seems kinda high for something on the cpu. I'm guessing it is a un/dis-connected sensor.
Offline
The error is (very most likely) from the last boot.
Do you restart cleanly, does the reboot process hang w/ some errors during the shutdown and is there a difference in MCE messages between a cold and a warm reboot?
Ceterum censeo and since it was mentioned: 3rd link in my signature.
Online
The error is the same (except for the timestamp) regardless of whether it is a cold or warm boot. No errors during shutdown.
The machine seems to ***mostly run*** despite the CPU and ACPI errors listed above, but of course I have to boot with nomodeset to use a monitor. I've tried disabling fastboot: no effect. I've reinstalled Arch from scratch. I've replace Arch with Ubuntu, then replace Ubuntu with Linux Mint. I get the exact same messages all the time at the same place during boot and have to use nomodeset in all cases. I'm in the process of loading Windows 10 to see what happens.
Offline
I stumbled onto something interesting in the kernel documentation yesterday that fits with the CPU problem here:
First, I saw this here in kernel-parameters.txt:
mce=option [X86-64] See Documentation/x86/x86_64/boot-options.rst
Then next looking through the file boot-options.rst, one of the options it describes is this:
mce=bootlog
Enable logging of machine checks left over from booting.
Disabled by default on AMD Fam10h and older because some BIOS
leave bogus ones.
If your BIOS doesn't do that it's a good idea to enable though
to make sure you log even machine check events that result
in a reboot. On Intel systems it is enabled by default.
mce=nobootlog
Disable boot machine check logging.
You could try this "mce=nobootlog" kernel command line parameter and see what happens. If it hides the MCE event messages in dmesg and journalctl, this should then mean that they were events from before the Linux kernel was loaded. They were then events from early at boot when the UEFI was still in control of the machine.
If this "mce=nobootlog" works, I would then not worry about this anymore. The text in the documentation mentions there's machines that always create those MCE events at boot. I guess your machine is then one of those and there's nothing to do about it.
Offline
Yes, adding mce=nobootlog to the as a kernel boot parameter did suppress the MCE hardware errors. The other errors remain, as expected. I have one quick question: Why would suppressing these errors via the kernel boot parameters indicate the events were driven when the UEFI was still in control of the machine, and not the kernel?
Offline
https://en.wikipedia.org/wiki/Machine-check_exception
It's either that or the error is carried over from the last boot.
If there were no issues w/ the shutdown (you shut down the system cleanly, rather than it somehow powered off out of nowhere) and the errors are reproducible (always the same), they're detected at boot.
Whether they're bogus or a genuine error can't be told, but google finds that exact error at the exact address and bank quite some times…
Online
As far as I can determine, the machine runs fine despite the MCE error. I will proceed as if they are harmless. I have other issues (e.g. graphics problems) that I need to troubleshoot too, but those are topics for another thread.
Offline
I'm dealing with this same exact issue and trust me when I say it's not "fine" or a bug.. and has a high-risk of leading to more serious issues down the road. This error is typically going to be a bad memory controller/cache on your cpu 75% of the time, bad chipset on your motherboard, bent pins on your cpu socket, or a bad trace on your motherboard.. although on rare occasions other hardware that shorts out the motherboard such as a bad power supply, pci card, or bad caps on the motherboard can cause it also, however unless it's directly linked to the CPU it's extremely rare.). I've decided it's time for a upgrade and to error on the side of caution and just replace my motherboard, cpus, memory, power supply, etc.. This is the second identical motherboard and cpu set I've had with memory / CPU related issues within a few years.. I see no point other then trying to penny pitch $$$ to cause myself headaches for the future.. I can always pinpoint the issue in the future if I have nothing to do and use the hardware for less important things but as my main system it simply will not do. Also bad hardware like this can cause a daisy chain of failures.. bad cpu causes bad motherboard you replace motherboard.. cpu causes bad motherboard again.. you replace bad cpu motherboard causes cpu to go bad again.. extremely rare but I've seen it happen in a controlled engineering environment with other hardware.
At this point in time with the same error I can't even put my root device in rw the kernel forbids it without forcing it's hand even with a clean fsck mount won't work. No other hardware errors show. No issues with my RAID drives. Removing all memory sticks from CPU1 bank 1, 2 & 3 and not CPU0 fixes the error (I say fixes, but it's not a fix, it just "hides" the error from the kernel but it will still cause data corruption) however trying another memory stick from CPU0 bank fails every time.. I tried 15 "known good" memory sticks with the same results. Also I can run CPU/memory burn test and "most" basic test will pass without issue.
Also I wanted to say that the forum sign up on this server with the date/uname/hash should be changed in my opinion.. for people like me that are having hardware issues without access to another Linux system. I actually had to stop and think.. and notice that the security question changed every time.. from hash 256 to 512.. different date %V %J outputs.. and make sure my time is in par with the servers with epoch or day of the month format.. it really just wasted my time which is something I really hate. Forums are usually for people who need help, which can include date/time/rtc/hardware/kernel issues with Linux systems. Only elitist n00bs who think they are supreme/clever/elite and better than other people would use that type of captcha.. but that's only my opinion. Hunting down a online sha tool.. and trying multiple "near correct dates" hashes until I get the right one takes more time then most of the hardest captachas.. and that's from someone who is very experienced with Unix/Linux systems.
It's kind of like Arch forum admins are telling me : "Well, if you don't have current access to a Linux system with the correct date set your not getting into our forums without some work and wasted time finding a proper hash that works" It just leaves a bad taste in my mouth.. and I've been using Unix/Linux systems for over 27 years and that's my point of view.
Last edited by zeronullity (2021-07-26 12:25:01)
Offline
I'm dealing with this same exact issue
"Bank 6: ee2000000040110a" and *only* "Bank 6: ee2000000040110a"?
For the rest of your post, please see https://gitlab.archlinux.org/archlinux/ … opicsrants
Online
I'm dealing with this same exact issue
"Bank 6: ee2000000040110a" and *only* "Bank 6: ee2000000040110a"?
For the rest of your post, please see https://gitlab.archlinux.org/archlinux/ … opicsrants
Yes Seth, the same error, different address range, and yes it will effect each system differently based on the hardware in use and ranges effected.. It's a very serious error even if you don't have issues currently.. it's true you may never have an issue but that doesn't mean the hardware doesn't have a real physical failure.. How it effects your system depends on many factors. It's like me telling you my shirt is ripped, I'm going to throw it away. And you asking "but is it ripped where it can be seen?" It doesn't matter, it's ripped, I'm throwing it or giving it away.. because it's most likely only going to get worse.
I'm sorry for breaking the forum rules Seth, I just call it how I see it when things seem to be blatantly wrong from my point of view. It was more or less mean't as a teaching moment to share common human kindness & hospitality instead of turning a way new Linux users because they are not smart enough to answer the question or don't have easy access to checksum tools at the moment. And I was more harsh than I normally am because I saw other forum posts with members/admins boasting about this very thing and it irked me quite a bit.. to be careless/thoughtless about "new members".
Last edited by zeronullity (2021-07-26 13:35:09)
Offline
I don't have that error but it is all over the internet with this *specific* address (which is a bit specific and probably does matter and it also will likely matter whether the address is varying).
And whether there more and different MCEs along it.
There're 10 hits for this forum alone (most of which do actually not concern that error but are just random dmesg posts)
Apparently it was introduced w/ linux 4.10, https://bbs.archlinux.org/viewtopic.php … 1#p1698801
I'm not saying it's harmless for sure, but having an itiching toe right before your house burned down doesn't mean that the house burned down because of your itching toe…
Online
I could go into great detail explaining what the addresses mean.. and how they would effect certain situations if they are on the same hardware in various scenarios or even the same error on different hardware.. Which is like having a unique hardware fingerprint.
https://en.wikipedia.org/wiki/Machine-check_exception
seth: "I'm not saying it's harmless for sure, but having an itiching toe right before your house burned down doesn't mean that the house burned down because of your itching toe…"
First of all I personally wouldn't use that comparison. Even with the "rare" instances where it's a Firmware/Kernel/Driver bug you can have the really bad luck of having both.. where it's both a firmware/kernel bug but you have a hardware issue too. I'm not suggesting that the OP throw his system away.. I would suggest using "known good" parts to rule out the hardware.. If it's NOT the hardware then swapping out the hardware with a IDENTICAL part# with a IDENTICAL firmware version will make no difference to the error. I definitely wouldn't take the approach of oh well a few other people had this problem they say it's nothing and go on. I take the old school approach which usually never fails, not too many ways to reinvent a perfect round-circle.. A square or oblong tire probably wouldn't be very good to drive on but.. Hey, if you have no other choice, why not right? What's the worst that could happen?
Offline
I could go into great detail explaining
Don't let me stop you…
MCE is good to analyze HW errors after the fact and preserve errors across boots.
But if the very same error with the very same addresses occurs for many people and always during the boot when the cores are initialized and without any perceivable issues, chances are it's bogus and not a thermal issue or decay (where you'd expect more randomness)
Alternatively the OP is really active across the web ;-)
(Though there're also many dell systems affected where idk the used chipset)
Here's someone who got them when switching to the exact same chipset, https://forums.unraid.net/topic/103883- … rd-on-691/
Online
Update on my original post... I eventually got a better Mobo and CPU. That resolved the issue. I did try swapping out the original Mobo with a new, identical Mobo and got the same error. In the end, I discovered the Mobo to be very quirky in a few ways ways, and I don't think it was entirely compatible with the CPU (or the GPU built into the CPU). The Mobo was really built for Intel 11th gen, and I was using Intel 10th gen. It should have been backward compatible, but just wasn't.
Offline
Hopefully this thread isn't too old to bump up again, but I ran across this thread for the exact same error. Considering the sheer number of hits around this error and addresses, I'm going to let it ride. Brand new build with quality parts. Burned in the system pretty hard when I first saw the errors. Ran stress tests 24 hours running all 20 vcores pretty solid. Ran into absolutely no issues at all. I've run Windows and Linux on this hardware with no issues, other than a few video issues due to lower quality HDMI cables. The errors bother me because they're there, but knowing so many other people have them also is certainly comforting in a twisted kind of way.
Same CPU gen though, 10900 (non-k) on a Gigabyte Z590 Aorus Master. 64G of Gigabyte listed RAM. CPU runs at 30*C or below at standard load.
Offline