Continous Crashing/Rebooting

Hi All,

I am seeking help because my system keeps on randomly rebooting. This happens when I game but also when I am checking journal logs in terminal or using firefox.

From my understanding the error is related to the following:

mar 25 15:13:53 Rocinante kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx_0.0.0 timeout, signaled seq=301443, emitted seq=301445
mar 25 15:13:53 Rocinante kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process artofrally.x64 pid 4312 thread artofrally:cs0 pid 4314
mar 25 15:13:53 Rocinante kernel: amdgpu 0000:2f:00.0: amdgpu: GPU reset begin!
mar 25 15:13:56 Rocinante com.valvesoftware.Steam.desktop[3102]: Warning: The game hasn’t rendered a frame from us in over 10 seconds
mar 25 15:13:58 Rocinante kernel: amdgpu 0000:2f:00.0: amdgpu: SMU: I’m not done with your previous command: SMN_C2PMSG_66:0x00000029 SMN_C2PMSG_82:0x00000000
mar 25 15:13:58 Rocinante kernel: amdgpu 0000:2f:00.0: amdgpu: Failed to disable gfxoff!

From my research and what the found, I need to add the following

amdgpu.ppfeaturemask=0xfffd3fff

to the kernel to fix the issue. I am not sure how to do it, hence why I am seeking advice/help.

Please find my logs:

https://0x0.st/HoSJ.txt
https://0x0.st/HoSy.txt

Please let me know if you need any other information.

Thank you.

Check this article about systemd-boot, which seems it is your system bootloader. There are instructions on How to modify kernel options.

2 Likes

@petsam thank you for the pointer and advice. I will play around with this and see if it fixes my issue. Thank you.

easiest is to add the kernel parameter to /etc/kernel/cmdline so you will be able to edit or remove it on the run when bootimng using the “e” edit menu option on systemd-boot menu…

2 Likes

Cheers for the advice. I did was advised but my issue still persists. I need to dig deeper to understand the issue. Thanks for the help.

If you want any advice, post technical system info and journal logs from when the issue happaned.

I will do so when I get back. I ran a CPU + mem test for an hour and no issues there. I am suspecting I have a faulty gpu but will post the journal’s as advised. Thank you for your help.

As requested here is a snippet of the log before my system rebooted:

mar 26 10:25:38 Rocinante gnome-shell[1348]: Attempting to thaw the notification queue for object ClutterText[0x5614339d5040]; Property notification does not work during instance finalization.

mar 26 10:25:38 Rocinante gnome-shell[1348]: Attempting to freeze the notification queue for object ClutterText[0x561436cf9c00]; Property notification does not work during instance finalization.

mar 26 10:25:38 Rocinante gnome-shell[1348]: Attempting to thaw the notification queue for object ClutterText[0x561436cf9c00]; Property notification does not work during instance finalization.

mar 26 10:25:38 Rocinante gnome-shell[1348]: Attempting to freeze the notification queue for object ClutterText[0x56143433c540]; Property notification does not work during instance finalization.

mar 26 10:25:38 Rocinante gnome-shell[1348]: Attempting to thaw the notification queue for object ClutterText[0x56143433c540]; Property notification does not work during instance finalization.

– Boot d71fba72a0d246e4890ee806f94eb436 –

From what I’ve read, this is a bug that has been fixed in gnome 43.4 but it doesn’t explain why the system rebooted.

The only other error I saw was the following MCE error:

mar 26 10:26:07 Rocinante kernel: mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000001000108

mar 26 10:26:07 Rocinante kernel: fbcon: Taking over console

mar 26 10:26:07 Rocinante kernel: mce: [Hardware Error]: TSC 0 ADDR 7fdd1968af7d MISC d012000100000000 SYND 4d000000 IPID 500b000000000

mar 26 10:26:07 Rocinante kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1679819166 SOCKET 0 APIC 6 microcode a201025

I’ve done my best to google these but lack the knowledge to fully understand what is output here. Again I am under the suspicion the GPU is having issues.

Please let me know if you need me to run any other commands for information. Thank you again for your help.

Did you have this in the past? Check

journalctl | grep "Hardware Error"

Please find the information requested below:

mar 24 13:31:12 Rocinante kernel: mce: [Hardware Error]: Machine check events logged

mar 24 13:31:12 Rocinante kernel: mce: [Hardware Error]: CPU 7: Machine Check: 0 Bank 5: bea0000000000108

mar 24 13:31:12 Rocinante kernel: mce: [Hardware Error]: TSC 0 ADDR ffffff93f8758c MISC d012000100000000 SYND 4d000000 IPID 500b000000000

mar 24 13:31:12 Rocinante kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1679661071 SOCKET 0 APIC e microcode a201016

mar 24 13:33:39 Rocinante kernel: mce: [Hardware Error]: Machine check events logged

mar 24 13:33:39 Rocinante kernel: mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5: bea0000000000108

mar 24 13:33:39 Rocinante kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffc1405182 MISC d012000100000000 SYND 4d000000 IPID 500b000000000

mar 24 13:33:39 Rocinante kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1679661218 SOCKET 0 APIC f microcode a201016

mar 25 09:56:32 Rocinante kernel: mce: [Hardware Error]: Machine check events logged

mar 25 09:56:32 Rocinante kernel: mce: [Hardware Error]: CPU 8: Machine Check: 0 Bank 5: bea0000001000108

mar 25 09:56:32 Rocinante kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffc13430f8 MISC d012000100000000 SYND 4d000000 IPID 500b000000000

mar 25 09:56:32 Rocinante kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1679734591 SOCKET 0 APIC 1 microcode a201016

mar 25 10:00:28 Rocinante kernel: mce: [Hardware Error]: Machine check events logged

mar 25 10:00:28 Rocinante kernel: mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000000000108

mar 25 10:00:28 Rocinante kernel: mce: [Hardware Error]: TSC 0 ADDR 1414160b8 MISC d012000100000000 SYND 4d000000 IPID 500b000000000

mar 25 10:00:28 Rocinante kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1679734826 SOCKET 0 APIC 6 microcode a201016

mar 25 10:12:46 Rocinante kernel: mce: [Hardware Error]: Machine check events logged

mar 25 10:12:46 Rocinante kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000000000108

mar 25 10:12:46 Rocinante kernel: mce: [Hardware Error]: TSC 0 ADDR 3591ac26e MISC d012000100000000 SYND 4d000000 IPID 500b000000000

mar 25 10:12:46 Rocinante kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1679735565 SOCKET 0 APIC 0 microcode a201016

mar 25 15:17:31 Rocinante kernel: mce: [Hardware Error]: Machine check events logged

mar 25 15:17:31 Rocinante kernel: mce: [Hardware Error]: CPU 9: Machine Check: 0 Bank 5: bea0000001000108

mar 25 15:17:31 Rocinante kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffc0d1111a

mar 25 15:17:31 Rocinante kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1679753850 SOCKET 0 APIC 3 microcode a201016

mar 25 15:55:59 Rocinante kernel: mce: [Hardware Error]: Machine check events logged

mar 25 15:55:59 Rocinante kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000000000108

mar 25 15:55:59 Rocinante kernel: mce: [Hardware Error]: TSC 0 ADDR 7fc05969e3a6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000

mar 25 15:55:59 Rocinante kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1679756157 SOCKET 0 APIC 0 microcode a201025

mar 25 16:00:16 Rocinante kernel: mce: [Hardware Error]: Machine check events logged

mar 25 16:00:16 Rocinante kernel: mce: [Hardware Error]: CPU 7: Machine Check: 0 Bank 5: bea0000000000108

mar 25 16:00:16 Rocinante kernel: mce: [Hardware Error]: TSC 0 ADDR 7f78ba1200ee MISC d012000100000000 SYND 4d000000 IPID 500b000000000

mar 25 16:00:16 Rocinante kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1679756414 SOCKET 0 APIC e microcode a201025

mar 25 22:33:03 Rocinante kernel: mce: [Hardware Error]: Machine check events logged

mar 25 22:33:03 Rocinante kernel: mce: [Hardware Error]: CPU 5: Machine Check: 0 Bank 5: bea0000001000108

mar 25 22:33:03 Rocinante kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffb6f50da2 MISC d012000100000000 SYND 4d000000 IPID 500b000000000

mar 25 22:33:03 Rocinante kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1679779981 SOCKET 0 APIC a microcode a201025

mar 25 23:04:28 Rocinante kernel: mce: [Hardware Error]: Machine check events logged

mar 25 23:04:28 Rocinante kernel: mce: [Hardware Error]: CPU 13: Machine Check: 0 Bank 5: bea0000001000108

mar 25 23:04:28 Rocinante kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffc1af811a MISC d012000100000000 SYND 4d000000 IPID 500b000000000

mar 25 23:04:28 Rocinante kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1679781866 SOCKET 0 APIC b microcode a201025

mar 26 10:26:07 Rocinante kernel: mce: [Hardware Error]: Machine check events logged

mar 26 10:26:07 Rocinante kernel: mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000001000108

mar 26 10:26:07 Rocinante kernel: mce: [Hardware Error]: TSC 0 ADDR 7fdd1968af7d MISC d012000100000000 SYND 4d000000 IPID 500b000000000

mar 26 10:26:07 Rocinante kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1679819166 SOCKET 0 APIC 6 microcode a201025

I’ve switched the Kernel to LTS and will try to game later on to see if that has an impact.

Its the GPU. I tried a stress test and it crashed immediately. I’ve requested a RMA so will wait to see how the process is. Thank you @petsam and @joekamprad for the advice and help. It is appreciated.

3 Likes

https://wiki.archlinux.org/title/Machine-check_exception

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.