EndeavourOS crashing and rebooting after new CPU has been installed

soupcookie · September 19, 2024, 12:22pm

Yesterday I replaced my Ryzen 3 2200g with a Ryzen 7 5700X, I then proceeded to reinstall EndeavourOS (was time to clean up my system anyways), now every couple of hours my system would crash and then proceed to reboot. This is not an issue on Windows at all so suspect it may be related to Linux and not a broken hardware issue. Before this this upgrade I’ve very rarely ever experienced any crashes on EOS.

After the crashes I ran journalctl | grep -i "hardware err"

And received this:

Sep 19 07:19:18 netsu-pc kernel: mce: [Hardware Error]: Machine check events logged
Sep 19 07:19:18 netsu-pc kernel: mce: [Hardware Error]: CPU 5: Machine Check: 0 Bank 5: bea0000001000108
Sep 19 07:19:18 netsu-pc kernel: mce: [Hardware Error]: TSC 0 
Sep 19 07:19:18 netsu-pc kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1726723156 SOCKET 0 APIC a microcode a20120a
Sep 19 13:59:19 netsu-pc kernel: mce: [Hardware Error]: Machine check events logged
Sep 19 13:59:19 netsu-pc kernel: mce: [Hardware Error]: CPU 13: Machine Check: 0 Bank 5: bea0000000000108
Sep 19 13:59:19 netsu-pc kernel: mce: [Hardware Error]: TSC 0 ADDR ffffff8e26f382 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Sep 19 13:59:19 netsu-pc kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1726747157 SOCKET 0 APIC b microcode a20120a

Can confirm nothing is overheating (heard it may be the cause):

sensors
k10temp-pci-00c3
Adapter: PCI adapter
Tctl:         +54.2°C  
Tccd1:        +53.5°C  

nouveau-pci-0800
Adapter: PCI adapter
fan1:         745 RPM
temp1:        +35.0°C  (high = +95.0°C, hyst =  +3.0°C)
                       (crit = +105.0°C, hyst =  +5.0°C)
                       (emerg = +135.0°C, hyst =  +5.0°C

I also tried adding processor.max_cstate=1 to GRUB_CMDLINE_LINUX_DEFAULT for grub after reading about others having similar issues, but at best it may have just decreased the frequency of the crashes, but most likely did not do much if anything.

The crashes seems to happen most frequently when I switch applications (like VSCode to Vivaldi with alt+tab), in fact, 70% of the crashes seems to happen when I’m using a browser, but that may just be coincidence. The CPU and RAM are not stressed at all, hovering low constantly during my uses. This issue may potentially also be a Budgie issue, since I’ve always used XFCE with i3 and decided to test out Budgie with this new install.

Possibly Important Hardware Info:
CPU: Ryzen 7 5700X
GPU: NVIDIA 1080 TI

Software Info:
OS: EndeavourOS (Linux 6.6.51-1-lts and latest available downloadable ISO)
DE: Budgie
Common Software Open During Crashes: Vivaldi, Firefox, VSCode

fred666 · September 19, 2024, 1:17pm

First thing that comes to my mind is a bios upgrade?
Your new cpu may have issues with a specific bios version, so i would try to upgrade to the latest stable version and recheck if there are still issues.

soupcookie · September 19, 2024, 1:30pm

Thing is, I updated the BIOS version to the version I am running now (from 5.10 to 7.30) about a week before I installed the CPU, so doubt it is that

Reference: https://www.asrock.com/mb/AMD/AB350M-HDV/index.asp#BIOS

fred666 · September 19, 2024, 3:11pm

Still, this seems to be a hardware error to me
No overclocking involved?
Reset bios to default settings?

stresstest ram is next
Or re seat the memory modules

ricklinux · September 19, 2024, 3:38pm

The first error looks like it’s a memory error. Reseat all the memory and run a longer advance mem test on the memory. I would also be trying the latest current kernel instead of the lts. You could also try some alternate kernel parameters if none of the previous suggestions help. idle=nomwait or iommu=soft

I would also like to say i wouldn’t be afraid to try the beta bios updates either because they both have newer amd agesa which controls all the power for the cpu.

The one thing i don’t understand with Asrocks Bios implementations is saying you cannot flash to a previous bios version once updated. So i understand if your not wanting to do that.

soupcookie · September 23, 2024, 8:01am

Hey! Thanks for the help! So far I have received no more crashes after resetting my bios to default settings (strange?), if I experience no crashes this week I will consider this the solution

Also to answer your other question, no overclocking is involved

soupcookie · September 23, 2024, 9:23am

Nevermind, happened again, I will be continuing applying provided potential fixes and tests in hopes of fixing the issue

Here are more captured error logs again in case it is helpful later

Sep 19 07:19:18 netsu-pc kernel: mce: [Hardware Error]: Machine check events logged
Sep 19 07:19:18 netsu-pc kernel: mce: [Hardware Error]: CPU 5: Machine Check: 0 Bank 5: bea0000001000108
Sep 19 07:19:18 netsu-pc kernel: mce: [Hardware Error]: TSC 0 
Sep 19 07:19:18 netsu-pc kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1726723156 SOCKET 0 APIC a microcode a20120a
Sep 19 13:59:19 netsu-pc kernel: mce: [Hardware Error]: Machine check events logged
Sep 19 13:59:19 netsu-pc kernel: mce: [Hardware Error]: CPU 13: Machine Check: 0 Bank 5: bea0000000000108
Sep 19 13:59:19 netsu-pc kernel: mce: [Hardware Error]: TSC 0 ADDR ffffff8e26f382 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Sep 19 13:59:19 netsu-pc kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1726747157 SOCKET 0 APIC b microcode a20120a
Sep 23 11:20:05 netsu-pc kernel: mce: [Hardware Error]: Machine check events logged
Sep 23 11:20:05 netsu-pc kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: bea0000000000108
Sep 23 11:20:05 netsu-pc kernel: mce: [Hardware Error]: TSC 0 ADDR ffffff9426f382 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Sep 23 11:20:05 netsu-pc kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1727083203 SOCKET 0 APIC 2 microcode a20120a

joekamprad · September 23, 2024, 9:48am

The error message “[Hardware Error]: CPU 1: Machine Check: 0 Bank 5: bea0000000000108” indicates a hardware-related issue with the CPU you’ve installed on your Linux system. This error typically suggests a problem with the CPU’s internal circuitry or its compatibility with the motherboard.

soupcookie · September 23, 2024, 10:55am

Does this mean there can be an issue with the CPU itself? Or possibly an issue with the motherboard CPU support (as some has suggested, further updating the bios)?

The thing I don’t understand tho is why it is only affecting Linux? This weekend I was on Windows for 3 days straight, primarily gaming (CPU - and RAM - intensive ones I could not play with my old CPU) and browsing, and not 1 crash, hang or issue - then I boot up Linux and when I use it for longer than an hour it gives a short hang or 2, then shortly after it reboots itself… Does Windows just have some sort of fallback code or something that prevents this crash, even though it is a hardware issue?

@fred666
As per following your recommendation to stress test, I have installed and used a couple stress testing software. Please excuse the lack of any missing logs or flags in commands, I would gladly rerun if required.

I also kept an eye on my temps, CPU generally hovered here:
k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +56.8°C
Tccd1: +53.0°C

And during a stress tests went up to (chose highest seen values)
Tctl: +94.9°C
Tccd1: +94.8°C

stressapptest https://github.com/stressapptest/stressapptest

stressapptest -s 600 -W -l stress.log

2024/09/23-11:30:48(SAST) Log: Commandline - stressapptest -s 60 -W -l stress.log
2024/09/23-11:30:48(SAST) Stats: SAT revision 1.0.11_autoconf, 64 bit binary
2024/09/23-11:30:48(SAST) Log: reproducible @ reproducible on Mon Sep 23 09:27:43 UTC 2024 from open source release
2024/09/23-11:30:48(SAST) Log: 1 nodes, 16 cpus.
2024/09/23-11:30:48(SAST) Log: Defaulting to 16 copy threads
2024/09/23-11:30:48(SAST) Log: Total 15914 MB. Free 12552 MB. Hugepages 0 MB. Targeting 14926 MB (93%)
2024/09/23-11:30:48(SAST) Log: Prefer plain malloc memory allocation.
2024/09/23-11:30:48(SAST) Log: Using mmap() allocation at 0x73023b51f000.
2024/09/23-11:30:48(SAST) Stats: Starting SAT, 14926M, 60 seconds
2024/09/23-11:30:51(SAST) Log: Region mask: 0x1
2024/09/23-11:31:01(SAST) Log: Seconds remaining: 600
----
2024/09/23-11:31:41(SAST) Log: Seconds remaining: 10
2024/09/23-12:50:59(SAST) Stats: Found 0 hardware incidents
2024/09/23-12:50:59(SAST) Stats: Completed: 6076082.00M in 600.50s 10118.45MB/s, with 0 hardware incidents, 0 errors
2024/09/23-12:50:59(SAST) Stats: Memory Copy: 6076082.00M at 10131.81MB/s
2024/09/23-12:50:59(SAST) Stats: File Copy: 0.00M at 0.00MB/s
2024/09/23-12:50:59(SAST) Stats: Net Copy: 0.00M at 0.00MB/s
2024/09/23-12:50:59(SAST) Stats: Data Check: 0.00M at 0.00MB/s
2024/09/23-12:50:59(SAST) Stats: Invert Data: 0.00M at 0.00MB/s
2024/09/23-12:50:59(SAST) Stats: Disk: 0.00M at 0.00MB/s
2024/09/23-12:50:59(SAST) 
2024/09/23-12:50:59(SAST) Status: PASS - please verify no corrected errors
2024/09/23-12:50:59(SAST)

I also ran stress, GTKStressTesting and mprime

mprime - https://aur.archlinux.org/packages/mprime-bin
stress - https://www.geeksforgeeks.org/linux-stress-command-with-examples/
GST - https://gitlab.com/leinardi/gst

mprime -b
stress --vm-keep --vm 30 -t 30

Both gave logs on what they were doing (or none at all), but no logs about the system itself

CPU and RAM was pretty much maxed out when viewing htop, especially with the first test, no issues to report during the test that I could see

@ricklinux unfortunately the same issue is applied to latest kernel, I’ll try adding those suggestions to kernel and get back to you, as well as possibly updating BIOS to a beta one

joekamprad · September 23, 2024, 11:11am

it could be also something about stuff setup working for the old cpu but not the new.

soupcookie · September 25, 2024, 5:33am

Possibly? I did however re-install both Windows and Linux when the new CPU was installed to prevent any issues such as that

joekamprad · September 25, 2024, 8:27am

so nothing should be in the way from this side

**Vermeer, Cezanne and Renoir CPUs are supported after updating the BIOS, using another CPU for that update.

But i bet your Bios is already on latest stable version

I find this post:
https://www.reddit.com/r/archlinux/comments/vz5apu/a_solution_to_mce_hardware_error_reboots_on_amd/

there it says to add processor.max_cstate=5 to boot options, or downgrade linux-firmware

soupcookie · September 25, 2024, 9:10am

“idle=nomwait or iommu=soft”
Unfortunately it did not do the trick

soupcookie · September 25, 2024, 9:19am

Yes, I’m already on 7.3, ricklinux recommended updating further, that will be my last resort tho, if nothing else works

I already added processor.max_cstate=1, which sounds to be more stable? But I would change it to 5 if you believe it will help

But I’ll try downgrading the firmware next, and hope for the best, I got

Name            : linux-firmware
Version         : 20240909.552ed9b8-1

What would you recommend I go down to, 2024.08.01?

joekamprad · September 25, 2024, 9:30am

one version down … there is also linux-firmware-git in the aur would be the newer version… what could help too

2024-09-25_11-29

soupcookie · September 26, 2024, 8:05am

Interesting developments so far, so yesterday I got this

Looping back and forth between each other, I believe this is an NVIDIA driver issue, not sure if it is related in any way, here is the journalctl:

Sep 25 14:10:09 netsu-pc kernel: nouveau 0000:08:00.0: gr: GPC3/TPC0/MP trap: global 00000000 [] warp 0001 [STACK_ERROR]
Sep 25 14:10:09 netsu-pc kernel: nouveau 0000:08:00.0: gr: GPC3/TPC1/MP trap: global 00000000 [] warp 0001 [STACK_ERROR]
Sep 25 14:10:09 netsu-pc kernel: nouveau 0000:08:00.0: gr: GPC3/TPC2/MP trap: global 00000000 [] warp 0001 [STACK_ERROR]
Sep 25 14:10:09 netsu-pc kernel: nouveau 0000:08:00.0: gr: GPC3/TPC3/MP trap: global 00000000 [] warp 0001 [STACK_ERROR]
Sep 25 14:10:09 netsu-pc kernel: nouveau 0000:08:00.0: gr: GPC3/TPC4/MP trap: global 00000000 [] warp 0001 [STACK_ERROR]
Sep 25 14:11:27 netsu-pc kernel: nouveau 0000:08:00.0: gr: GPC3/TPC0/MP trap: global 00000000 [] warp 0001 [STACK_ERROR]
Sep 25 14:11:27 netsu-pc kernel: nouveau 0000:08:00.0: gr: GPC3/TPC1/MP trap: global 00000000 [] warp 0001 [STACK_ERROR]
Sep 25 14:11:27 netsu-pc kernel: nouveau 0000:08:00.0: gr: GPC3/TPC2/MP trap: global 00000000 [] warp 0001 [STACK_ERROR]
Sep 25 14:11:27 netsu-pc kernel: nouveau 0000:08:00.0: gr: GPC3/TPC3/MP trap: global 00000000 [] warp 0001 [STACK_ERROR]
Sep 25 14:11:27 netsu-pc kernel: nouveau 0000:08:00.0: gr: GPC3/TPC4/MP trap: global 00000000 [] warp 0001 [STACK_ERROR]

I have seen this in the past, I believe, but that was a pretty long time ago, so not an issue I see happening often, so possibly irrelevant.

Anyhow, after that incident yesterday, I decided for interest sake to just go back to XFCE. Surprisingly, I’ve been using Linux pretty much the whole time since then, and it only crashed this morning, but unlike the other crashes, my system didn’t reboot and the screen didn’t turn off (I have had these types of crashes before, going to the latest kernel usually to fixes them, so will be trying that now), I had to hold in the power button after a decent wait of nothing happening to get my system to reboot. I still got that pre-hang that happens right before a crash, but this time, after manually rebooting my system, the journalctl had no new logs to offer me falling under error. Could this crash be related to something else? Could it be that the previous mce errors were related to the Budgie DE somehow? Maybe a driver or library that was having trouble working?

I have not yet downgraded to a previous Linux kernel, but if the crashes continue, I will do so.

Also as a side note, probably not related to the XFCE crash, but will put here just in case, since you never know

The crash happened exactly on 59 seconds, if crash continues, then very slight possibility of a time-related issue? This was probably just a coincidence tho, but will keep my eyes open just in case

soupcookie · September 27, 2024, 7:44am

Downgraded but unfortunately no luck, my system actually held up really well for a while after switching to xfce so thought it might be good now, but unfortunately the issues continued today, crashing and rebooting itself

Sep 27 09:41:43 netsu-pc kernel: mce: [Hardware Error]: Machine check events logged
Sep 27 09:41:43 netsu-pc kernel: mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000001000108
Sep 27 09:41:43 netsu-pc kernel: mce: [Hardware Error]: TSC 0 ADDR ffffff91fddd22 MISC d0130fff00000000 SYND 4d000000 IPID 500b000000000 
Sep 27 09:41:43 netsu-pc kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1727422900 SOCKET 0 APIC 6 microcode a20120a

His_Turdness · September 27, 2024, 1:45pm

Have you tried CMOS reset yet?

ricklinux · September 27, 2024, 2:33pm

As @His_Turdness said. Try clearing the CMOS and reset it to defaults.

joekamprad · September 28, 2024, 9:48am

what is your GPU?
inxi -Gaz | eos-sendlog
Could be also an issue between the GPU und CPU of some kind… some newer Nvidia GPUS have still issues running on nouveau, could be this causing some instructions that causing CPU to got issue… (om´nly bet in the dark)
But you could try running on Nvidia properitary Drivers:
nvidia-inst would install it for you and set needed config. If you do not have our tool installed install it first:
sudo pacman -S nvidia-inst
nvidia-inst