[SOLVED] How does one identify the RAM bank that has ECC failures?

andrej.podzimek · 2021-01-22 05:41:21

First and foremost: The common wisdom (looking at Error Addr and matching it with dmidecode output, as described here) doesn't work on my system. This is because for some reason, perhaps due to interleaving, dmidecode reports that all 4 RAM banks cover the entire address range of 128 GB (when in fact each bank has only 32 GB).

My hardware is an ASRock X570 Creator with a Ryzen 3950X, BIOS version 3.30. It has 128 GB of ECC RAM, 4 banks of type M391A4G43MB1-CTD. Just in case if that matters: AMD SME is enabled (mem_encypt=on).

Here's a complete dmesg.

Here's a complete dmidecode.

After ~6+ months of continuous operation without any ECC errors, roughly one correctable error per uptime started to appear. (That was already way more frequent than the commonly expected units of correctable errors per year.) Quite recently I found the following uncorrectable issue in dmesg:

Jan 17 00:35:48 charon kernel: mce: [Hardware Error]: Machine check events logged
Jan 17 00:35:48 charon kernel: [Hardware Error]: Deferred error, no action required.
Jan 17 00:35:48 charon kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b
Jan 17 00:35:48 charon kernel: [Hardware Error]: Error Addr: 0x000000098b3e0040
Jan 17 00:35:48 charon kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x6d2e20200b800001
Jan 17 00:35:48 charon kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Jan 17 00:35:48 charon kernel: EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x0 offset:0x0 grain:64)
Jan 17 00:35:48 charon kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jan 17 00:35:48 charon kernel: mce: [Hardware Error]: Machine check events logged
Jan 17 00:35:48 charon kernel: [Hardware Error]: Deferred error, no action required.
Jan 17 00:35:48 charon kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b
Jan 17 00:35:48 charon kernel: [Hardware Error]: Error Addr: 0x000000098b3e0540
Jan 17 00:35:48 charon kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xdad800220b800001
Jan 17 00:35:48 charon kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Jan 17 00:35:48 charon kernel: EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x0 offset:0x0 grain:64)
Jan 17 00:35:48 charon kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jan 17 00:35:48 charon kernel: [Hardware Error]: Deferred error, no action required.
Jan 17 00:35:48 charon kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b
Jan 17 00:35:48 charon kernel: [Hardware Error]: Error Addr: 0x000000098b3e0a80
Jan 17 00:35:48 charon kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x595a02220b800001
Jan 17 00:35:48 charon kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Jan 17 00:35:48 charon kernel: EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x0 offset:0x0 grain:64)
Jan 17 00:35:48 charon kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jan 17 00:35:48 charon kernel: [Hardware Error]: Deferred error, no action required.
Jan 17 00:35:48 charon kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b
Jan 17 00:35:48 charon kernel: [Hardware Error]: Error Addr: 0x000000098b3e0f40
Jan 17 00:35:48 charon kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x61b200220b800001
Jan 17 00:35:48 charon kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Jan 17 00:35:48 charon kernel: EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x0 offset:0x0 grain:64)
Jan 17 00:35:48 charon kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jan 17 00:35:48 charon kernel: [Hardware Error]: Deferred error, no action required.
Jan 17 00:35:48 charon kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b
Jan 17 00:35:48 charon kernel: [Hardware Error]: Error Addr: 0x000000098b3e1440
Jan 17 00:35:48 charon kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xa3aa22220b800001
Jan 17 00:35:48 charon kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Jan 17 00:35:48 charon kernel: EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x0 offset:0x0 grain:64)
Jan 17 00:35:48 charon kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jan 17 00:35:48 charon kernel: [Hardware Error]: Deferred error, no action required.
Jan 17 00:35:48 charon kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b
Jan 17 00:35:48 charon kernel: [Hardware Error]: Error Addr: 0x000000098b3e19c0
Jan 17 00:35:48 charon kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x3c3a00200b800001
Jan 17 00:35:48 charon kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Jan 17 00:35:48 charon kernel: EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x0 offset:0x0 grain:64)
Jan 17 00:35:48 charon kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jan 17 00:35:48 charon kernel: [Hardware Error]: Deferred error, no action required.
Jan 17 00:35:48 charon kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b
Jan 17 00:35:48 charon kernel: [Hardware Error]: Error Addr: 0x000000098b3e1e40
Jan 17 00:35:48 charon kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xf74700020b800001
Jan 17 00:35:48 charon kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Jan 17 00:35:48 charon kernel: EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x0 offset:0x0 grain:64)
Jan 17 00:35:48 charon kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jan 17 00:35:48 charon kernel: [Hardware Error]: Deferred error, no action required.
Jan 17 00:35:48 charon kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b
Jan 17 00:35:48 charon kernel: [Hardware Error]: Error Addr: 0x000000098b3e2240
Jan 17 00:35:48 charon kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x0b7d00200b800001
Jan 17 00:35:48 charon kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Jan 17 00:35:48 charon kernel: EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x0 offset:0x0 grain:64)
Jan 17 00:35:48 charon kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jan 17 00:35:48 charon kernel: [Hardware Error]: Deferred error, no action required.
Jan 17 00:35:48 charon kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b
Jan 17 00:35:48 charon kernel: [Hardware Error]: Error Addr: 0x000000098b3e2740
Jan 17 00:35:48 charon kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x294d02000b800001
Jan 17 00:35:48 charon kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Jan 17 00:35:48 charon kernel: EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x0 offset:0x0 grain:64)
Jan 17 00:35:48 charon kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jan 17 00:35:48 charon kernel: [Hardware Error]: Deferred error, no action required.
Jan 17 00:35:48 charon kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b
Jan 17 00:35:48 charon kernel: [Hardware Error]: Error Addr: 0x000000098b3e2c00
Jan 17 00:35:48 charon kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xdf2102020b800001
Jan 17 00:35:48 charon kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Jan 17 00:35:48 charon kernel: EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x0 offset:0x0 grain:64)
Jan 17 00:35:48 charon kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jan 17 00:35:48 charon kernel: [Hardware Error]: Deferred error, no action required.
Jan 17 00:35:48 charon kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b
Jan 17 00:35:48 charon kernel: [Hardware Error]: Error Addr: 0x000000098b3e3100
Jan 17 00:35:48 charon kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x48ff02220b800001
Jan 17 00:35:48 charon kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Jan 17 00:35:48 charon kernel: EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x0 offset:0x0 grain:64)
Jan 17 00:35:48 charon kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jan 17 00:35:48 charon kernel: [Hardware Error]: Deferred error, no action required.
Jan 17 00:35:48 charon kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b
Jan 17 00:35:48 charon kernel: [Hardware Error]: Error Addr: 0x000000098b3e3640
Jan 17 00:35:48 charon kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x7a4100020b800001
Jan 17 00:35:48 charon kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Jan 17 00:35:48 charon kernel: EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x0 offset:0x0 grain:64)
Jan 17 00:35:48 charon kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jan 17 00:35:48 charon kernel: [Hardware Error]: Deferred error, no action required.
Jan 17 00:35:48 charon kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|-|MiscV|AddrV|-|-|SyndV|UECC|Deferred|-|Scrub]: 0xdc2031000000011b
Jan 17 00:35:48 charon kernel: [Hardware Error]: Error Addr: 0x000000098b3e3b00
Jan 17 00:35:48 charon kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xba4100000b800001
Jan 17 00:35:48 charon kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Jan 17 00:35:48 charon kernel: EDAC MC0: 1 UE Cannot decode normalized address on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x0 offset:0x0 grain:64)
Jan 17 00:35:48 charon kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jan 17 00:38:09 charon kernel: mce_notify_irq: 11 callbacks suppressed
Jan 17 00:38:09 charon kernel: mce: [Hardware Error]: Machine check events logged
Jan 17 00:38:09 charon kernel: [Hardware Error]: Corrected error, no action required.
Jan 17 00:38:09 charon kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]: 0xdc2041000000011b
Jan 17 00:38:09 charon kernel: [Hardware Error]: Error Addr: 0x000000098b3e3f40
Jan 17 00:38:09 charon kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x76dd20000a800c01
Jan 17 00:38:09 charon kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Jan 17 00:38:09 charon kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#1channel#1 (csrow:1 channel:1 page:0x0 offset:0x0 grain:64 syndrome:0x2000)
Jan 17 00:38:09 charon kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

Unfortunately, each of the four RAM banks has the same, indistinguishable 128GB address range in dmidecode:

# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.2.0 present.
Table at 0xAE022000.

Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
	Vendor: American Megatrends Inc.
	Version: P3.30
	Release Date: 12/01/2020
...
	BIOS Revision: 5.17
...
Handle 0x0017, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x000F
	Error Information Handle: 0x0016
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 32 GB  <<<<<<<<<<<<<<<<<<<<<<<<<<<< Each of the 4 banks is 32 GB only.
	Form Factor: DIMM
	Set: None
	Locator: DIMM 0
	Bank Locator: P0 CHANNEL A
	Type: DDR4
	Type Detail: Synchronous Unbuffered (Unregistered)
	Speed: 2667 MT/s
	Manufacturer: Samsung
	Serial Number: 03CF4F25
	Asset Tag: Not Specified
	Part Number: M391A4G43MB1-CTD    
	Rank: 2
	Configured Memory Speed: 2667 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: Unknown
	Module Manufacturer ID: Bank 1, Hex 0xCE
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 32 GB
	Cache Size: None
	Logical Size: None

Handle 0x0018, DMI type 20, 35 bytes
Memory Device Mapped Address
	Starting Address: 0x00000000000
	Ending Address: 0x01FFFFFFFFF
	Range Size: 128 GB  <<<<<<<<<<<<<<<<<<<<<<<<<<<< But each has a 128 GB range.
	Physical Device Handle: 0x0017
	Memory Array Mapped Address Handle: 0x0011
	Partition Row Position: Unknown
	Interleave Position: Unknown
	Interleaved Data Depth: Unknown
...

This means that I can't find a straightforward matching between the failing address (Error Addr) and a particular bank. There is a Bank Locator in dmidecode that points at a particular socket on the motherboard and also a Serial Number that's on the RAM bank, but there seems to be no way to match the error address with any of that.

It often takes 1 to 2 weeks of uptime before ECC errors appear, but sometimes it takes a month, which means that various RAM halving techniques are not a great option either. Also, my past experience with RAM halving (on different hardware) was so bad that I'd rather avoid it. (ECC errors always went away, no matter which half of the banks I removed.) That said, this time I would much rather investigate the issue under a realistic setup, i.e. with all 4 banks installed.

There are better investigation methods that work without halving of the memory. For example, given 4 banks:

Swap two banks and leave the two other banks in original positions.
Based on whether the error address changes or not, you'll know which pair of banks is to blame.
In the pair of banks containing the bad bank, swap one of the banks with one from the healthy pair.
Based on whether the error address changes or not, you now know precisely which bank was failing.

But again, this^^^ doesn't work in my case, due to the address interleaving, so I have no clue how to interpret the error address (or its change).

I tried to switch some (or hopefully all) interleaving options off in the UEFI setup, but this did not have the desired effect of reducing the address ranges shown in dmidecode to 32 GB. I'm not even sure if interleaving can be entirely disabled on this machine. Most UEFI settings around RAM are quite poorly documented.

Questions concerning the dmesg output:

Could the magic number / bitmask at the end of the .*CPU:0.* line (e.g. 0xdc2031000000011b) possibly identify a bank?
What's IPID and Syndrome? Could it be bank-specific?

Any other debugging ideas?

Last edited by andrej.podzimek (2021-01-22 18:19:42)

loqs · 2021-01-22 06:35:17

Possibly

https://www.kernel.org/doc/html/latest/ … e/ras.html
Following the dual channel example csrow0 ch1 is DIMM_B0

Handle 0x001D, DMI type 17, 84 bytes
Memory Device
	Array Handle: 0x000F
	Error Information Handle: 0x001C
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 32 GB
	Form Factor: DIMM
	Set: None
	Locator: DIMM 0
	Bank Locator: P0 CHANNEL B
	Type: DDR4
	Type Detail: Synchronous Unbuffered (Unregistered)
	Speed: 2667 MT/s
	Manufacturer: Samsung
	Serial Number: 03CF4F5B
	Asset Tag: Not Specified
	Part Number: M391A4G43MB1-CTD    
	Rank: 2
	Configured Memory Speed: 2667 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: Unknown
	Module Manufacturer ID: Bank 1, Hex 0xCE
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 32 GB
	Cache Size: None
	Logical Size: None

andrej.podzimek · 2021-01-22 08:11:55

loqs wrote:

Possibly
https://www.kernel.org/doc/html/latest/ … e/ras.html
Following the dual channel example csrow0 ch1 is DIMM_B0

Awesome, thanks a lot! It says csrow1 ch1, but IIUC, according to that dual channel table, that's also in DIMM_B0. (And these are indeed dual-rank modules, so that's hopefully a full match against the table.)

I've just managed to reproduce the problem (again); it's getting really frequent. And with a "mirrored" placement of the 4 banks, it now gives me DIMM_A1, csrow3 ch0. Which means:

I (hopefully) know which bank it is!
I know it's not the MB or the CPU, because the error moves with the bank.

Yay!

Now here comes a confusing bit: In the meantime I managed to switch off interleaving. The UEFI setup has two interleaving options, each in a different menu subtree. I finally found and disabled both and now I have 32GB mapped addresses in dmidecode.

But dmidecode says it's DIMM_B1, not DIMM_A1, based on the interval the Error Addr falls into (at least if I'm parsing this dmidecode data correctly):

	Locator: DIMM 1
	Bank Locator: P0 CHANNEL B

However, I still think that your initial conclusion based on the csrow and channel is correct, because:

When I shuffled (mirrored) the banks, a B0->A1 transition could happen, but B0->B1 could not happen.
While I did switch off interleaving, I didn't switch off bank hashing (left it on auto), so I'm assuming that address ranges shown by dmidecode can still mismatch addresses in dmesg's ECC logs due to this.

andrej.podzimek · 2021-01-22 18:19:22

Alright. Things got dramatic. As I carried out the experiments above, ECC failures started to appear a couple of hours after a reboot. It no longer took them a month to show up. A reboot later I had hundreds of ECC errors in dmesg on that one bank. Yet another reboot later I saw the first unrecoverable memory failure (likely uncorrectable ECC failures somewhere in anonymous kernel memory where stuff can't be reloaded from disk or killed).

So I had no choice but to pull out half the RAM, or else the machine wouldn't be stable and usable any more. Fortunately a replacement bank got delivered ~3 hours after I had ordered it.

I replaced bank A1 (formerly B0) with the new one, restored a full 128 GB capacity, and … while taking the risk of speaking too soon, I think it's fixed.

Folding@Home is running on all 32 CPUs, on the AMD GPU and also on the NVidia eGPU, lots of DMA everywhere.
I scrubbed some 15 TB of Btrfs.
I read ~10 TB of files, just to fill and overwrite the whole RAM with caches a few times.

So far I have zero ECC errors. I just hope, fingers crossed, that it stays that way and errors won't start appearing later during a low-power state. (I restored all RAM interleaving settings to auto, because they make the machine noticeably faster.)

In any case, the bank identified by ECC was indeed the right one to replace. The only confusing bit was the motherboard's 1-based bank numbering, so A1 identified using dmesg was labeled A2 on the motherboard.

Phew. This was an eye-opening experience. If I hadn't had ECC, this would have been slowly eating away my data and causing instability, perhaps over months. ECC saved the day. ECC rulezzz. ECC forever.

Last edited by andrej.podzimek (2021-01-24 18:50:53)

andrej.podzimek · 2021-01-24 18:13:14

So, my comment above (now quite clearly marked invalid, but preserved for future reference) is a clear example of why you should never draw conclusions from just a few hours of testing.

I'll update this thread, for the record, but maybe a month from now, once I'm reasonably sure I got it right this time. As a teaser, I would point out that the UEFI setup has options called BankGroupSwap and BankGroupSwapAlt, which can do something surprising.

Also, with ECC failures, it is often the case that any change to the bank arrangement (e.g. shuffling of the banks around) can either trigger more failures or make failures less likely. It is therefore perfectly possible to swap the wrong bank and become convinced that you solved the problem when in fact you merely made it less likely.

The key take-home message is that a combination of memory controllers on AMD and unbuffered ECC memory can exhibit behaviors different from classical Intel examples around the web.

andrej.podzimek · 2021-04-05 01:56:17

Here’s the promised update at last. It’s a quarter late rather than a month late, but with nearly 3 months of uptime without any ECC errors I finally have some confidence that my conclusions had been correct.

TL;DR, long story short, on this hardware:

CH0 == B
CH1 == A

What got me so confused? Well, here’s the original RAM bank arrangement:

A0 (alpha) <<< the bank that was actually failing
A1 (beta)
B0 (gamma) <<< the misidentified failing bank, based on the first guess
B1 (delta)

Errors appeared to be in B0 initially. Which, after the “mirror” reshuffle I did to diagnose the issue further, would have ended up in A1:

A0 (delta)
A1 (gamma) <<< the misidentified failing bank, based on the first guess
B0 (beta)
B1 (alpha) <<< the bank that was actually failing

So, soon after the “mirror” reshuffle (and ECC error log messages mentioned in comments above), I just swapped A1 and claimed victory. Heck, was I wrong! Rule No. 1: RAM chip manipulation often makes ECC errors go away temporarily. One should NOT speak too soon, without at least months of error-free uptime!

As you may have guessed by now, errors came back. This time with an intensity scarier than ever. It was getting close to a situation in which the machine would not boot.

As already mentioned, channels are swapped on this motherboard, so CH0 == B and CH1 == A. How could this lead to so much confusion? The symmetry of the “mirror” reshuffle caused it.

Before the reshuffle, A0 (alpha) had been actually failing (misidentified as B0 (gamma)).
After the reshuffle, B1 (alpha) had been actually failing (misidentified as A1 (gamma)).

And then confirmation bias did the rest. This is because the “mirror” reshuffle and the bank identification based on it were both affected by the very same incorrect assumption that CH0 == A and CH1 == B.

Presumably, after swapping (gamma), the situation did not improve at all. In fact (alpha) had been failing and needed to be swapped.

For a moment I was worried that I was observing a cascading failure of multiple RAM chips, but that was not the case. It’s just that on this motherboard / chipset, CH0 == B and CH1 == A. With (alpha) swapped, I haven’t seen a single ECC error.

Arch Linux

#1 2021-01-22 05:41:21

[SOLVED] How does one identify the RAM bank that has ECC failures?

#2 2021-01-22 06:35:17

Re: [SOLVED] How does one identify the RAM bank that has ECC failures?

#3 2021-01-22 08:11:55

Re: [SOLVED] How does one identify the RAM bank that has ECC failures?

Yay!

#4 2021-01-22 18:19:22

Re: [SOLVED] How does one identify the RAM bank that has ECC failures?

#5 2021-01-24 18:13:14

Re: [SOLVED] How does one identify the RAM bank that has ECC failures?

#6 2021-04-05 01:56:17

Re: [SOLVED] How does one identify the RAM bank that has ECC failures?

Board footer