Random crashes amdgpu

After 4 or 5 months having this random crashes, I need help solving the issue. In February when I tried out CachyOS for a week I got random crashes, which persistent when I came back to EOS.

Today it crashed 4 times already, one when I was opening a window, the other when i switched icons on the taskbar, one time when the lock screen appeared and i was not doing anything at all, so completely random.

The output of journalctl which is always the same or it is nothing logged at all:

Apr 20 15:04:05 saeniv-b660mprors systemd[850]: Started app-org.kde.konsole-8315.scope.
Apr 20 15:04:13 saeniv-b660mprors systemd[850]: Started Dolphin - File Manager.
Apr 20 15:04:23 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 20 15:04:23 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 20 15:04:23 saeniv-b660mprors kwin_wayland[890]: kwin_libinput: Libinput: event3  - Sharkoon Technologies GmbH Sharkoon Light² 100: client bug: event processing lagging behind by 334ms, your system is >
Apr 20 15:04:23 saeniv-b660mprors kwin_wayland[890]: kwin_libinput: Libinput: client bug: timer event3 debounce: scheduled expiry is in the past (-250ms), your system is too slow
Apr 20 15:04:23 saeniv-b660mprors kwin_wayland[890]: kwin_libinput: Libinput: client bug: timer event3 debounce short: scheduled expiry is in the past (-263ms), your system is too slow
Apr 20 15:04:23 saeniv-b660mprors kwin_wayland[890]: kwin_libinput: Libinput: WARNING: log rate limit exceeded (5 msgs per 3600000ms). Discarding future messages.
Apr 20 15:04:23 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 20 15:04:23 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 20 15:04:24 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 20 15:04:24 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 20 15:04:24 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 20 15:04:24 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 20 15:04:24 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 20 15:04:25 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 20 15:04:25 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 20 15:04:25 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 20 15:04:25 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 20 15:04:25 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 20 15:04:25 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 20 15:04:33 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: amdgpu: Dumping IP State
Apr 20 15:04:36 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000029 SMN_C2PMSG_82:0x00000000
Apr 20 15:04:36 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable gfxoff!
Apr 20 15:04:37 saeniv-b660mprors kernel: snd_hda_intel 0000:03:00.1: Unable to change power state from D3hot to D0, device inaccessible
Apr 20 15:04:41 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:41 param:0x00000000 message:DisallowGfxOff?
Apr 20 15:04:41 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable gfxoff!
-- Boot f075a9441e0a479fb8a2b4b80ac890a4 --

and the output from inxi -Faz | eos-sendlog

https://0x0.st/8O6_.txt

I updated the bios, i tried different gpu drivers, switched from the Noname NVME to a Samsung SSD, tried different mesa. I tried everything i found.

Maybe switching from KDE to another desktop environment. I am really frustrated, because the PC is barely usable for daily work, during a game I have no crashes, which is odd.

https://www.reddit.com/r/Fedora/comments/1gystaj/amdgpu_dmcub_error_collecting_diagnostic_data/

seems like the same problem, and there’s a solution posted! (this was the first google search result for “dmcub errror”)

3 Likes

Thank you very much. I googled so much and read so many threads, but I was searching for the wrong topic. Sorry for creating a new post.
I applied the fix and test it out.
The same question was also already answered here:

let us know if it works!

Crashing randomly or crashing at all is no good. The output shows the following.

Apr 20 15:04:36 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable gfxoff!
Apr 20 15:04:37 saeniv-b660mprors kernel: snd_hda_intel 0000:03:00.1: Unable to change power state from D3hot to D0, device inaccessible
Apr 20 15:04:41 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:41 param:0x00000000 message:DisallowGfxOff?

amdgpu.dcdebugmask=0x10 Disables PSR (Panel Power Saver Refresh)

Not sure if that is the issue? But you’ll know soon enough.

You are correct. This was not the solution. Got 6 crashes in the last days, the last two without any reports, but today i got the usual

Apr 22 17:47:47 saeniv-b660mprors kwin_wayland[924]: kwin_core: Cannot grant a token to KWin::ClientConnection(0x57fdda568ff0)
Apr 22 17:47:47 saeniv-b660mprors kwin_wayland[924]: kwin_core: Cannot grant a token to KWin::ClientConnection(0x57fdda568ff0)
Apr 22 17:47:55 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:47:56 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:47:56 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:47:56 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:47:57 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:47:57 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:47:57 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:47:57 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:47:57 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:47:57 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:47:58 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:47:58 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:48:02 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:48:02 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:48:02 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:48:02 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:48:03 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:48:03 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:48:03 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:48:03 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: [drm] *ERROR* dc_dmub_srv_log_diagnostic_data: DMCUB error - collecting diagnostic data
Apr 22 17:48:05 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: amdgpu: Dumping IP State
Apr 22 17:48:08 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000029 SMN_C2PMSG_82:0x00000000
Apr 22 17:48:08 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable gfxoff!
Apr 22 17:48:11 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:41 param:0x00000000 message:DisallowGfxOff?
Apr 22 17:48:11 saeniv-b660mprors kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable gfxoff!
Apr 22 17:48:11 saeniv-b660mprors kernel: snd_hda_intel 0000:03:00.1: Unable to change power state from D3hot to D0, device inaccessible
Apr 22 17:48:25 saeniv-b660mprors kernel: sched: DL replenish lagged too much

snd_hda_intel 0000:03:00.1: Unable to change power state from D3hot to D0, device inaccessible

All the other lines refer to amdgpu but this one to the intel cpu.

The fix from above is still applied, i checked just to make sure.

Tried the LTS kernel, but same issue.

You could try the following kernel paramater.

amdgpu.ppfeaturemask=0xfff73fff

0xfffd3fff means that everything is enabled except for PP_OVERDRIVE_MASK, PP_GFXOFF_MASK, and PP_STUTTER_MODE

It might be worth trying. If using grub make sure to run the grub update command after adding the kernel parameter to the command line in grub.

Edit:

1 Like

I added the kernel parameter and i had no crashes in the last hour. Thank you so much.

Just to make sure i understand this correct, the set parameter prevents the GPU from switching to idle and the power consumption will be higher?

I think so…it’s preventing it from low power at idle state or something like that? :thinking: I’m hoping it works as I am doing a lot of guessing because it’s a bit of trial & error method when one doesn’t really know 100% what is causing the problem. A stab in the dark as they say! :wink:

Thank you. I searche a litte and found this.

I will report back after a few days, but currently it is still running. Thank you again.

Ya i had seen that before also but it is only for Raven Ridge IGPU.

So how are the random crashes now after 24 hrs?

So I was not at home yesterday, but today i had one crash, with a different report, but still a crash. I hope it was just a one time thing, but i will see tomorrow, where i will use the PC more.

You could try another kernel parameter for ths splitlock error.

split_lock_detect=off

Hello,
Today I have used the PC for a whole day, and it worked without issues.

amdgpu.ppfeaturemask=0xfff73fff

Works great. Thank you. The reduced version with:

amdgpu.ppfeaturemask=0xffff7fff

Which should just disable PP_GXOFF_MASK did not work for me. So thank you very much for your advice. I will add it as a permanent kernel parameter to my system.
One question still crosses my mind, why is my system not working without the parameter?

I will leave the thread open for another two weeks just to be certain.

1 Like

It’s likely a UEFI Firmware Bios issue. Not all manufacturers do a great job to implement this. A Bios update may eventually fix this or a kernel update. There is nothing wrong with using kernel parameters to deal with these kinds of issues.