AMD Opteron MC4 Error ECC Error in the Probe Filter directory

richard · February 9, 2025, 8:24am

Running a couple of supermicro H8SGL-F recently upgraded processors (to 6300s)…

The 6338p upgrade went flawless, but on the other with a 6380

System:
  Host: cassini Kernel: 6.12.12-1-lts arch: x86_64 bits: 64
  Console: pty pts/0 Distro: EndeavourOS
Machine:
  Type: Server System: Supermicro product: H8SGL v: 1234567890 serial: <superuser required>
  Mobo: Supermicro model: H8SGL v: 1234567890 serial: <superuser required>
    BIOS: American Megatrends v: 3.5b date: 03/18/2016
CPU:
  Info: 8-core AMD Opteron 6380 [MT MCP] speed (MHz): avg: 1400 min/max: 1400/2500
...
Network:
  Device-1: Intel 82599ES 10-Gigabit SFI/SFP+ Network driver: ixgbe
  Device-2: Intel 82574L Gigabit Network driver: e1000e
  Device-3: Intel 82574L Gigabit Network driver: e1000e
Drives:
  Local Storage: total: raw: 473.24 GiB usable: 457.12 GiB used: 50.62 GiB (11.1%)
Info:
  Memory: total: 128 GiB note: est. available: 125.76 GiB used: 8.74 GiB (6.9%)
  Processes: 397 Uptime: 23h 10m Init: systemd Shell: Bash inxi: 3.3.37

I’m seeing the following log entry:

2025-02-09T00:59:13+01:00 kernel: mce: [Hardware Error]: Machine check events logged
2025-02-09T00:59:13+01:00 kernel: [Hardware Error]: Corrected error, no action required.
2025-02-09T00:59:13+01:00 kernel: [Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[-|CE|MiscV|AddrV|-|CECC|-]: 0x9c3c44e0011f010b
2025-02-09T00:59:13+01:00 kernel: [Hardware Error]: Error Addr: 0x000000000001f140
2025-02-09T00:59:13+01:00 kernel: [Hardware Error]: MC4 Error (node 0): ECC Error in the Probe Filter directory.
2025-02-09T00:59:13+01:00 kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: GEN

Even though there is ‘no action required’, there’s nothing I hate more is finding turds in the log files.

The 6380 is one from a pair I got on Ebay that were initially installed in a dual processor system.

The only perceivable issue encountered thus far is upon occasion when issuing a reboot, the system doesn’t come back around and I have to do a manual reset.

My question is, if anybody has happened to come across this or knows about any possible issue, is this potentially a BIOS issue concerning supermicro, or an AMD processor issue (or non-issue). Apparently it involves HT-Assist aka Probe Filter.

1093i3511 · February 9, 2025, 1:29pm

Check if the issue migrates to the other machine, if you swap the memory modules.

richard · February 9, 2025, 4:38pm

When I get a chance, I’ll try that.
Though I did a full memtest86+ run with no apparent errors.

Indeed, that’s the reason I thought perhaps a cpu cache issue…

Too, the system doesn’t seem to log these as memory errors:

$ sudo ras-mc-ctl --error-count
Label                   CE      UE
mc#0csrow#2channel#1    0       0
mc#1csrow#2channel#1    0       0
mc#1csrow#2channel#0    0       0
mc#0csrow#2channel#0    0       0
mc#0csrow#0channel#1    0       0
mc#0csrow#1channel#0    0       0
mc#0csrow#3channel#1    0       0
mc#1csrow#0channel#1    0       0
mc#1csrow#3channel#1    0       0
mc#1csrow#1channel#0    0       0
mc#1csrow#1channel#1    0       0
mc#1csrow#3channel#0    0       0
mc#1csrow#0channel#0    0       0
mc#0csrow#3channel#0    0       0
mc#0csrow#1channel#1    0       0
mc#0csrow#0channel#0    0       0

cheers

ricklinux · February 9, 2025, 4:58pm

Did you disable ECC when running memtest86+

richard · February 9, 2025, 5:02pm

No, not in particular. I presume you mean in the BIOS prior to running the tests, yes?
I guess I presumed that memtest would check if there were ‘corrected’ errors and signal them…
I’ll try that tomorrow…

ricklinux · February 9, 2025, 5:33pm

Also have you checked sudo dmesg and the mcelog for more info.
MCE4 is related to cpu cache but can be caused from overheating, memory corruption or faulty hardware such as cpu, power supply. (low idle voltage)
You may be able to address that in bios settings.

I see the bios is updated to what is available.

It could be a motherboard component issue also.
Complicated stuff.

richard · February 12, 2025, 6:49am

Yeah, dmesg is where I first discovered the subject.

Also, I’m already using rasdaemon which apparently is necessary for these opterons which are not supported by mcelog.

Seems to me I saw on the supermicro page for the H8SGL-F that it was 6300P ready.

It’s possible that I misread that as 6300 series already supported, and the new 6300P is also supported now.

Indeed, the 6338P doesn’t seem to have any issue.

Well, the good news is, after disabling ECC and running a full pass successfully memtest86+ then re-enabling ECC, I’ve gone a couple of days without the error.

Hopefully the act of doing the above reset any cpu issue… I’ll give it some more time though.

Thanks for the hint!

cheers

1093i3511 · February 12, 2025, 8:44am

Based on this error code, a simple search resulted in this thread, maybe you can narrow it down (e.g. verifying if it’s an ECC error thrown by a specific ram stick), by the use of edac-utils.

richard · February 12, 2025, 10:05am

I did come across that thread…
On EOS, these are the last couple of errors shown with

$ sudo ras-mc-ctl  --errors
No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.

MCE events:
1 2024-11-19 09:29:08 +0100 error: , CPU 2, mcgcap=0x00000107, status=0xdc60400069080a13, addr=0x4289e0a80, misc=0xa01c0f0f01000000, walltime=0x673c4c55, cpu=0x00000006, cpuid=0x00600f20, apicid=0x00000006, bank=0x00000004
2 2024-11-19 09:34:22 +0100 error: , CPU 2, mcgcap=0x00000107, status=0xdc4040004e080a13, addr=0x4b1feaac0, misc=0xa01c0f4301000000, walltime=0x673c4d8f, cpu=0x00000006, cpuid=0x00600f20, apicid=0x00000006, bank=0x00000004
... 8< snip >8
420 2025-02-08 05:08:29 +0100 error: , CPU 2, mcgcap=0x00000107, status=0x9c3c44e0011f010b, addr=0x0001f140, misc=0xa01c0f0001000000, walltime=0x67a6d8be, cpuid=0x00600f20, apicid=0x00000020, socketid=0x00000001, bank=0x00000004, microcode=0x06000852
421 2025-02-08 05:44:54 +0100 error: , CPU 2, mcgcap=0x00000107, status=0x9c3c44e0011f010b, addr=0x0001f140, misc=0xa01c0f0001000000, walltime=0x67a6e147, cpuid=0x00600f20, apicid=0x00000020, socketid=0x00000001, bank=0x00000004, microcode=0x06000852
422 2025-02-09 00:59:14 +0100 error: , CPU 2, mcgcap=0x00000107, status=0x9c3c44e0011f010b, addr=0x0001f140, misc=0xa01c0f0001000000, walltime=0x67a7efd1, cpuid=0x00600f20, apicid=0x00000020, socketid=0x00000001, bank=0x00000004, microcode=0x06000852

where no errors since sunday after the memtest86+ run monday morning.

Where did you find edac-utils packaged for an Arch platform?
The output is a bit different…

richard · February 15, 2025, 10:45am

Well, still getting some errors, perhaps because I set BIOS ECC to ‘good’ instead of just ‘basic’.
In any event, I’ll probably try switching memory the next time I get a chance to bring down the systems.

BTW, for grins, I downloaded and tested edac-utils… doesn’t seem to do well on this opteron…

$ /usr/local/bin/edac-util  -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: mc#1csrow#0channel#0: 0 Corrected Errors
mc1: csrow0: mc#1csrow#0channel#1: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: mc#1csrow#1channel#0: 0 Corrected Errors
mc1: csrow1: mc#1csrow#1channel#1: 0 Corrected Errors
mc1: csrow2: 0 Uncorrected Errors
mc1: csrow2: mc#1csrow#2channel#0: 0 Corrected Errors
mc1: csrow2: mc#1csrow#2channel#1: 0 Corrected Errors
mc1: csrow3: 0 Uncorrected Errors
mc1: csrow3: mc#1csrow#3channel#0: 0 Corrected Errors
mc1: csrow3: mc#1csrow#3channel#1: 0 Corrected Errors
edac-util: No errors to report.

whereas

$ uptime -s
2025-02-13 07:09:34
$ sudo ras-mc-ctl  --errors
No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.

MCE events:
1 2024-11-19 09:29:08 +0100 error: , CPU 2, mcgcap=0x00000107, status=0xdc60400069080a13, addr=0x4289e0a80, misc=0xa01c0f0f01000000, walltime=0x673c4c55, cpu=0x00000006, cpuid=0x00600f20, apicid=0x00000006, bank=0x00000004
2 2024-11-19 09:34:22 +0100 error: , CPU 2, mcgcap=0x00000107, status=0xdc4040004e080a13, addr=0x4b1feaac0, misc=0xa01c0f4301000000, walltime=0x673c4d8f, cpu=0x00000006, cpuid=0x00600f20, apicid=0x00000006, bank=0x00000004
8< snip >8
421 2025-02-08 05:44:54 +0100 error: , CPU 2, mcgcap=0x00000107, status=0x9c3c44e0011f010b, addr=0x0001f140, misc=0xa01c0f0001000000, walltime=0x67a6e147, cpuid=0x00600f20, apicid=0x00000020, socketid=0x00000001, bank=0x00000004, microcode=0x06000852
422 2025-02-09 00:59:14 +0100 error: , CPU 2, mcgcap=0x00000107, status=0x9c3c44e0011f010b, addr=0x0001f140, misc=0xa01c0f0001000000, walltime=0x67a7efd1, cpuid=0x00600f20, apicid=0x00000020, socketid=0x00000001, bank=0x00000004, microcode=0x06000852
423 2025-02-14 20:38:56 +0100 error: , CPU 2, mcgcap=0x00000107, status=0x9c3c44e0011f010b, addr=0x0001f140, misc=0xa01c0f0001000000, walltime=0x67af9bd2, cpuid=0x00600f20, apicid=0x00000020, socketid=0x00000001, bank=0x00000004, microcode=0x06000852
424 2025-02-15 10:45:40 +0100 error: , CPU 2, mcgcap=0x00000107, status=0x9c3c44e0011f010b, addr=0x0001f140, misc=0xa01c0f0001000000, walltime=0x67b06246, cpuid=0x00600f20, apicid=0x00000020, socketid=0x00000001, bank=0x00000004, microcode=0x06000852

richard · April 16, 2025, 6:28am

Update finally… The good news is that is was not the memory.
Finally decided to switch out the CPU (since I happened to by a pair on Ebay)
No problems since!