Sporadic Hard Freezes Requiring Full Power Cycle

Long time no post! Which is a good thing, Endeavour has been wonderful!

However an issue has been popping up for me with increasing frequency and I think that it’s past time to actually invest some time in hunting down the root cause.

The symptoms are very cut and dry. At a random point, typically when just sitting on the desktop idle/reading/watching content, the entire system will freeze. And I mean FREEZE. Sound (if playing at the time) goes into about a 1-2 second loop and there is no response from any input.
I have finally actually dug into the journal to see what I can see and it looks like in the seconds before a freeze the qt.qpa.xcb module throws a fit about something… I have linked the full output prior to hard reset in this Github gist but the error line is:
Jul 22 23:16:50 VoyagerMango kwin_x11[2989]: qt.qpa.xcb: QXcbConnection: XCB error: 9 (BadDrawable), sequence: 25019, resource id: 23069412, major code: 14 (GetGeometry), minor code: 0

I first attempted to look up this module via a quick google but most related posts are specific errors with other applications not entire system wide issues. And ‘qt’ itself doesn’t look to have an official github? So do any of you lovely people have any advice or experience here? Or even just a link of where to go properly report this would be great as I don’t think it’s something Endeavour related… Thanks!

Oh and interestingly (I’m not sure if there is any connection here) this same sort of freezing can happen when connecting my PS4 controller via bluetooth after it had already been connected and then disconnected previously during that boot. I haven’t yet had this freeze occur since I started monitoring prior boot journals so I can’t say for sure if it is presenting the same messages or is just coincidentally giving the same symptom.

Edit: inxi -Faz

Can you send the output of inxi -Faz?
Please configure REISUB and reboot normally. If another freeze happens, try rebooting with SysRq+Alt+R E I S U B and then send the journal of the previous boot with

sudo journalctl -b -1

kwin_x11 shouldn’t hard freeze your system. This line looks more interesting:

kernel: WARNING: CPU: 10 PID: 2265 at kernel/sched/fair.c:7118 set_next_buddy+0x5e/0x70

I have edited the op to include inxi, but here is the direct link as a reply: https://gist.github.com/RPINerd/0837932c7bfc874ccdf877f127513af1

It just happened twice in a row, here is the gist from the last crash:

Edit: Also REISUB had no effect on the system, I was forced to powercycle it

Did you add sysrq_always_enabled=1 to your kernel parameters? Does REISUB work if the system is still working normally? Does the caps lock of your keyboard blink on a freeze?
Also the log files you sent are only the last few lines. You should send them as whole. Compress the log file if necessary.
The only relevant thing I see in log files is that both errors happen at CPU 10. Try this kernel parameter:

isolcpus=6

@RPINerd
There are 7 newer UEFI Bios updates for your motherboard. I would suggest updating to the latest version. 3.70 dated 2022/3/2

1 Like

It looks like the wireguard also which shows errors and corruption which could be the reason for the crash too. :thinking:

I did not try reisub on a functioning system but it looks like my rig has sensed that I’m trying to fix it because it is now crashing regularly within just a few minutes of logging in and opening the terminal…
I will attempt to add the kernel parameter, I started with just the sysctl conf file line.
as for doing the isolcpu, shouldn’t that be 10 not 6? I only googled the parameter so I don’t know exactly what it’s for, but if we’re going to cut out a core from interacting with the scheduler shouldn’t it be the one throwing errors?

Edit : oh and no caps lock blink, kb leds remain lit but are entirely unresponsive (ie if I push num or scrl lock post freeze, the leds do not come on

Wow I don’t feel like it has been that long since I last messed in the bios! I’m hesitant to do this though as it’s beginning to feel like a possible hardware failure and I don’t want to brick the mobo by having a freeze during that process…

If nothing else, it’s nice to see my board still getting regular updates! My previous platform was a sandybridge 2700k so I had not seen a bios update in many years haha

This is a Ryzen platform and the updates are very important as they have new AGESA which controls most of the power requirements. If you are doing a UEFI Bios update please read the instructions over here.

https://www.asrock.com/support/BIOSIG.asp?cat=BIOS10

1 Like

New chipsets on motherboards will usually get many updates for a few years before they subside, especially if it is a popular platform. I have had 13 updates on my X570 board so far and the same on my Intel Z370 which is getting dated now.

The latest UEFI Bios update includes the patch for ftpm which causes stuttering and freezing.

Edit: It’s not just a Windows problem. It does happen on Linux.

The number “10” is the thread, not core. Your CPU has 8 cores and 16 threads, 2 threads per core. If CPUs start counting from 0, (this part may be wrong) 10th CPU is on 6th core. (if count starts from 1 we would disable core 5)
isolcpus needs the core argument.

Ah, duh! Thanks for clarifying with such good detail :smile:

This can also indicate something like bad RAM.

1 Like

First thing i would do is update the UEFI Bios because it’s way out of date for a Ryzen CPU. It has the patch for ftpm stutter and freezing issue. I would also test the ram as @jonathon has pointed out.
isolcpus is deprecated and not used much anymore on newer hardware as they are supported by cpuset and cgroups.

I have -knock on wood- successfully updated the bios to the latest 3.70, what’s the current go-to for ram testing? I think last time I had to do ram diagnosing I was using a recovery boot tool burned on a CD ROM haha

I always used memtest86+ You can install it.

@RPINerd
You get that memory tested yet?

32Gb took it’s sweet time for a full 4 pass test lol, but no errors! System seems stable so far, I would be thrilled if my issue was as simple as a bios update…
I was worried that one of my cores was dying on me and I’d need to start thinking about CPU replacement. I still have PTSD from my 980Ti dying right at the start of the pandemic and trying to get a 6000 series at launch haha