Weird random system freeze

absuser · March 4, 2024, 11:05am

Hello all,
I 've been experiencing some crashes that I can’t make heads or tails of. When they happen, the system freezes, sometimes allowing the mouse to move for a while (but nothing else is responsive, can’t swith tty, restart xorg or anything). They seem to occur in certain kernel builds, for example when I first installed Endeavor my system was free of them for many days. After an upgrade they reappeared. I thought it was a kernel problem or something I was installing, but I couldn’t find anything common before they occured in any journalctl -x entry. But then I noticed that some entries had Hardware Errors in them, about 7 days apart:

[Hardware Error]: System Fatal error.
Mar 04 06:46:44 user-desktop1 kernel: fbcon: Taking over console
Mar 04 06:46:44 user-desktop1 kernel: [Hardware Error]: CPU:8 (17:1:1) MC3_STATUS[Over|UE|MiscV|-|PCC|TCC|SyndV|-|-|-]: 0xfaa0000000070118
Mar 04 06:46:44 user-desktop1 kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000004d000001
Mar 04 06:46:44 user-desktop1 kernel: [Hardware Error]: Decode Unit Ext. Error Code: 7, Patch RAM sequencer parity error.
Feb 28 08:38:45 user-desktop1 kernel: [Hardware Error]: System Fatal error.
Feb 28 08:38:45 user-desktop1 kernel: fbcon: Taking over console
Feb 28 08:38:45 user-desktop1 kernel: [Hardware Error]: CPU:0 (17:1:1) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000000000108
Feb 28 08:38:45 user-desktop1 kernel: [Hardware Error]: Error Addr: 0x0001ffff8350eb06
Feb 28 08:38:45 user-desktop1 kernel: [Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000
Feb 28 08:38:45 user-desktop1 kernel: [Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error.
Feb 28 08:38:45 user-desktop1 kernel: [Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN

So I have a two-fold question,

How should I go about identifying the cause of the crashes? Their frequency varies, sometimes it can be once or twice a day (system is on 24h), sometimes once a week. In the journalctl -x I have many kde-related errors with window placement, nothing that appears regularly before crashes though, and nothing severe enough to warrant a crash.
What should I do about the hardware errors? How could I identify their cause?

Thank you very much in advance, any partial or complete answer will be greatly appreciated, also tell me what I need to post (in terms of logs).
You can see my system here https://0x0.st/H7AT.txt (using EOS upload tool)

thefrog · March 4, 2024, 11:25am

I’m not an expert with memory however this seems like it could be a memory issue?

rinwolf1312 · March 4, 2024, 5:03pm

Welcome to the forum!

Are they happening during increased activity, or has it occurred at idle?

Is that 16GB memory between two modules? Ruling out issues with the memory could be done by testing both individually.

memtest86 is a good diagnostic tool, if a bit slow.

absuser · March 4, 2024, 10:51pm

Hello, thanks for the prompt replies!
Memtest didn’t find anything wrong with the memory.
They are not happening in increased activity, no. Mostly while using Libreoffice I think, writing text.

fred666 · March 4, 2024, 11:01pm

Cpu error could indicate a cpu problem but i would first swap the ram with known good ones
Sometimes re seat them (or clean the contacts between a piece of printer paper) helps.

Update bios if there are newer versions (but NOT before you are certain that the ram and cpu are not the problem)

drunkenvicar · March 5, 2024, 2:18am

it’s hardware, dying hdd or ssd likely.
you won’t know til you get there.
don’t run S.M.A.R.T. tests they are useless. they always say everything is ‘near death’ or ‘old’ and I stopped believing them. I bought new rig and all hardware components were close to death or dying accoring to smart…

…how do you know you have a Drive problem? when this stuff increases frequency to like 3 times a week… keep your antennae up. back your stuff up. you will know when to buy another hdd/sdd.

memory stick bay as likely as sticks themselves. both valid, too. you can eliminate those.
/dos pesos

absuser · March 5, 2024, 5:00am

I’ll get to it ASAP. My CPU is old, so it could be reaching the end of its life. Is there any tool you’d suggest to run some tests to it?

absuser · March 5, 2024, 5:03am

I’ve had a lot of drive problems over time, doesn’t look like one but ofc I can’t be certain. My EFI is on a fairly old drive (currently 44537 hours power on time), but it has never showed any signs of failure

drunkenvicar · March 5, 2024, 5:13am

If you weren’t getting the weird system freezes and hangs, then what were the signs you were getting? (re: drive failure).
We all see it different.
I’ve also had freezes/hangs software-related: usually some indexer choking the holy sh*t out system resources (I’m looking at you Baloo/Dolphin) or a browser that starts to spawn processes until out of control.
which is some kind of memory fault.

yeah software still likely.
I have no idea what will test your hardware outside of smart which I don’t trust.

the crappy part of t-shooting software freezes is it freezes before you can htop and see the abomination causing it.

upon a freeze reboot journalctl | tail is supposed to paint the whole picture but there are a crapload of post-freeze/reboot journalctl commands that supposedly tell the whole story. translating that output way above my mental grade…

dimski · March 5, 2024, 6:08am

You have first gen ryzen cpu, same gen as my laptop which also had similar problem. Fixed mine by adding idle=nomwait to my kernel parameter. If problem persist you can also add processor.max_cstate=1. Read here for more info.

absuser · March 5, 2024, 8:18am

My most recent drive failure started with data corruption, in folders in my home dir. The one before started with a failure to boot, I managed to boot with a couple of tries but eventually the drive just became non-bootable. Both were ssd related. But you are right, it could be hdd/ssd related, it’s just that I’ve changed many drives in the last few months, and these freezes persisted. The only time interval they seemed to stop (ie go freeze-less for a week) was when I freshly installed EOS (I think it was with 6.7.5 kernel, but Im not sure) .

I’ve been extensively going through journalctl seconds, even moments before and after crashes. After the last three, there was a common feature, the hardware errors I posted. Got another similar one yesterday. I installed the lts kernel to see if it will make a difference, and I’m putting the nomwait parameter as dimski suggested. Thank you all for your input, I’ll let you know if the problem persists or if I discover something new.
<3

fred666 · March 5, 2024, 10:50am

mprime is a stress test tool, if your hardware survives this test for at least 10 min the hardware is good

absuser · March 6, 2024, 4:59pm

Since my last post the machine is on, no crashes or freezes. Seems stable.
I also run mprime for about an hour, got one error that didn’t lead to a crash or freeze:

Mar 06 18:23:26 user-desktop1 kernel: mce: [Hardware Error]: Machine check events logged
Mar 06 18:23:26 user-desktop1 kernel: [Hardware Error]: Corrected error, no action required.
Mar 06 18:23:26 user-desktop1 kernel: [Hardware Error]: CPU:5 (17:1:1) MC0_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c204000000c0135
Mar 06 18:23:26 user-desktop1 kernel: [Hardware Error]: Error Addr: 0x00000003a3a1c940
Mar 06 18:23:26 user-desktop1 kernel: mce: [Hardware Error]: Machine check events logged
Mar 06 18:23:26 user-desktop1 kernel: [Hardware Error]: Corrected error, no action required.
Mar 06 18:23:26 user-desktop1 kernel: [Hardware Error]: CPU:5 (17:1:1) MC0_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c204000000c0135
Mar 06 18:23:26 user-desktop1 kernel: [Hardware Error]: Error Addr: 0x00000003a3a1c940
Mar 06 18:23:26 user-desktop1 kernel: [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000371a1b2502
Mar 06 18:23:26 user-desktop1 kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
Mar 06 18:23:26 user-desktop1 kernel: [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD

system · March 8, 2024, 5:00pm

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.