Very frequent full system freezes with Gnome

Hi,
I have very frequent (1-3 times a day) full system lockups (e.g. no input is registered any more, sound mostly continues though). I don’t know how to diagnose this.

These where generated after a reboot right after a freeze:

Installed Apps: http://ix.io/4FhV
dmesg: http://ix.io/4FhW
gnome extensions: http://ix.io/4FfO
journalctl -b -1 -e: http://ix.io/4FhR
inxi -Fxz: http://ix.io/4FhS
free: http://ix.io/4FhT
swap: http://ix.io/4FfM
vm.swappiness = 60

I have a suspicion that steam might be involved, otherwise I can’t see a pattern (yet).
Anything else that would be helpful?

What I can see is:

Sep 03 18:01:19 finn-endeavor gameoverlayui[17921]: g_object_unref: assertion ‘G_IS_OBJECT (object)’ failed
Sep 03 18:01:20 finn-endeavor pkexec[19037]: pam_unix(polkit-1:session): session opened for user root(uid=0) by finn(uid=1000)
Sep 03 18:01:20 finn-endeavor gamemoded[17294]: ERROR: Could not inspect tasks for client [17305]! Skipping ioprio optimisation.
Sep 03 18:01:29 finn-endeavor steam.desktop[2865]: reaping pid: 17921 – gameoverlayui
Sep 03 18:02:16 finn-endeavor /usr/lib/gdm-x-session[1807]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x0090, 0x00053bb0, 0x00053c68)
Sep 03 18:02:23 finn-endeavor /usr/lib/gdm-x-session[1807]: (WW) NVIDIA(0): WAIT (1-S, 17, 0x0090, 0x00053bb0, 0x00053c68)
Sep 03 18:02:25 finn-endeavor chromium.desktop[16140]: [16140:1:0903/180225.568483:ERROR:command_buffer_proxy_impl.cc(320)] GPU state invalid after WaitForGetOffsetInRange.
Sep 03 18:02:25 finn-endeavor chromium.desktop[3207]: [3207:3207:0903/180225.573528:ERROR:gpu_process_host.cc(956)] GPU process exited unexpectedly: exit_code=512
Sep 03 18:02:26 finn-endeavor /usr/lib/gdm-x-session[1807]: (WW) NVIDIA(0): WAIT (2-S, 17, 0x008f, 0x00053bb0, 0x00053c9c)

Graphics driver crash? It’s up to date…

That is about the same crash I’ve been having…in my case, the crash is induced by shutting down BOINC…I can watch a memory corruption start right afterward…The only way I’ve been able to recover is to reboot right after BOINC stops. The crash is related to unloading a high-usage program out of VRAM.

Yes, this is a driver issue…I’ve narrowed it down to the combination of the current Mesa & Vulcan builds…I’m waiting for Mesa 23.2 to come out…should be in a couple of weeks. You could downgrade Mesa & Vulcan to see if that fixes it…I’m moving to an Intel Arc video card…I’ve fought Nvidia for far too many years now & I’m not fond of AMD. There is a thread on what is going to happen to Nvidia in the 6.6 Kernel series & I don’t want to see it.

Take a look at this thread: Plans for Nvidia Proprietary Driver Going Forward from 6.6?

2 Likes

WTF what kind of unacceptable bug is that? Crash after every game? Also NVidia drivers will stop working on linux? Excuse me what the fuck?

1 Like

I’m not a gamer but i suspect maybe it has something to do with game mode and pam?

1 Like

Well—when I get a crash it’s not during games…it’s just when I shut off BOINC, which uses 95% of my available VRAM…the system will try to freeze. The message I get at that time is:

/usr/lib/gdm-x-session[1350]: (WW) NVIDIA: Wait for channel idle timed out.

At the same time, I can see my Conky Nvidia monitor will show the memory MHz at 0 & the GPU frequency goes to 32000 MHz…

This will happen VERY reliably when BOINC shuts off. This started happening with the current Mesa update.

I ran a 30 min stress-test in Windows to verify that I don’t have a problem with my RTX 3070…it passed without any errors.

That is what I thought…I just bought an Intel Arc A750…about time to leave Nvidia & I like the looks of the Arc series—they are just starting to optimize the driver to get good performance out of it.

Take a look: https://www.phoronix.com/review/intel-anv-mod-boost

I have a 3080 RTX FE, I will surely not “leave nvidia” any time soon. I will rather go back to windows.

Can you install xfce and try it again? Because for me its not happening there

Well…I’m not really interested in xfce…I’ve got a workaround that works now…and my mind is set on Arc now.

what is your workaround?

Since, in my case, BOINC causes the problem when I just stop the application with a high GPU load----I have found that stopping the GPU load—wait—then closing the application works. If I stop the GPU load—wait—then startup Steam—that also works. (BOINC has a selection to auto-stop all running instances if it sees a high load application start, but that seems to not work right—it used to).

I know that this won’t help in your instance—but that is what is working for me at this time.

Have you tried to downgrade Mesa to see if that works?