Hi,
I have very frequent (1-3 times a day) full system lockups (e.g. no input is registered any more, sound mostly continues though). I don’t know how to diagnose this.
These where generated after a reboot right after a freeze:
That is about the same crash I’ve been having…in my case, the crash is induced by shutting down BOINC…I can watch a memory corruption start right afterward…The only way I’ve been able to recover is to reboot right after BOINC stops. The crash is related to unloading a high-usage program out of VRAM.
Yes, this is a driver issue…I’ve narrowed it down to the combination of the current Mesa & Vulcan builds…I’m waiting for Mesa 23.2 to come out…should be in a couple of weeks. You could downgrade Mesa & Vulcan to see if that fixes it…I’m moving to an Intel Arc video card…I’ve fought Nvidia for far too many years now & I’m not fond of AMD. There is a thread on what is going to happen to Nvidia in the 6.6 Kernel series & I don’t want to see it.
Well—when I get a crash it’s not during games…it’s just when I shut off BOINC, which uses 95% of my available VRAM…the system will try to freeze. The message I get at that time is:
/usr/lib/gdm-x-session[1350]: (WW) NVIDIA: Wait for channel idle timed out.
At the same time, I can see my Conky Nvidia monitor will show the memory MHz at 0 & the GPU frequency goes to 32000 MHz…
This will happen VERY reliably when BOINC shuts off. This started happening with the current Mesa update.
I ran a 30 min stress-test in Windows to verify that I don’t have a problem with my RTX 3070…it passed without any errors.
That is what I thought…I just bought an Intel Arc A750…about time to leave Nvidia & I like the looks of the Arc series—they are just starting to optimize the driver to get good performance out of it.
Since, in my case, BOINC causes the problem when I just stop the application with a high GPU load----I have found that stopping the GPU load—wait—then closing the application works. If I stop the GPU load—wait—then startup Steam—that also works. (BOINC has a selection to auto-stop all running instances if it sees a high load application start, but that seems to not work right—it used to).
I know that this won’t help in your instance—but that is what is working for me at this time.
Have you tried to downgrade Mesa to see if that works?