AMD GPU Crash? Screens won't come online, reboot required

Hey,

upfront, please forgive my scrubiness regarding Linux logs.

Preface: I am using corectrl to run various profile on my 6900XT and I have never experienced first hand crash with the profile I got setup, at least not when/while using the PC. Screens are set to be off after 3 minutes and sleep mode is disabled.

So yesterday I let my computer run overnight, as I was encoding with Handbrake. Today the computer wouldn’t come back online without a reboot/hard reset.

I checked the logs and it appears that the GPU “randomly died” and caused a crash cascade. Given me last output update on the encode this happened around 00:46.

Here are some log excerpts:

First log entries around that time:

Mai 11 00:46:42 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
Mai 11 00:46:42 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
Mai 11 00:46:42 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
Mai 11 00:46:42 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
Mai 11 00:46:42 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000007 message:TransferTableSmu2Dram?
Mai 11 00:46:42 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: [sienna_cichlid_get_power_profile_mode] Failed to get activity monitor!
Mai 11 00:46:42 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for …
Mai 11 00:46:42 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: Setting new power limit is not supported!
Mai 11 00:46:42 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
Mai 11 00:46:42 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
Mai 11 00:46:42 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: Setting new power limit is not supported!
Mai 11 00:46:43 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
Mai 11 00:46:43 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: Failed to export SMU metrics table!
Mai 11 00:46:43 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?

Mai 11 00:46:43 derp-ms7d13 kernel: i2c-designware-pci 0000:03:00.3: Unable to change power state from D3hot to D0, device inaccessible

More stuff like this keeps coming for a while until eventually shifting to

Mai 11 00:47:03 derp-ms7d13 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx_0.0.0 timeout, signaled seq=11786539, emitted seq=11786541
Mai 11 00:47:03 derp-ms7d13 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process kscreenlocker_g pid 42897 thread kscreenloc:cs0 pid 42900

Mai 11 00:47:03 derp-ms7d13 kernel: [drm:dc_dmub_srv_wait_idle [amdgpu]] ERROR Error waiting for DMUB idle: status=5
Mai 11 00:47:03 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: Failed to disable smu features.
Mai 11 00:47:03 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: Fail to disable dpm features!
Mai 11 00:47:03 derp-ms7d13 kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] ERROR suspend of IP block failed -121
Mai 11 00:47:03 derp-ms7d13 kernel: [drm:psp_ring_cmd_submit [amdgpu]] ERROR ring_buffer_start = 000000007540d1c0; ring_buffer_end = 0000000063fefc33; write_frame = 000000007a25ce68
Mai 11 00:47:03 derp-ms7d13 kernel: [drm:psp_ring_cmd_submit [amdgpu]] ERROR write_frame is pointing to address out of bounds
Mai 11 00:47:03 derp-ms7d13 kernel: [drm:psp_suspend [amdgpu]] ERROR Failed to terminate ras ta
Mai 11 00:47:03 derp-ms7d13 kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] ERROR suspend of IP block failed -22

Mai 11 00:47:04 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: ASIC reset failed with error, -121 for drm dev, 0000:03:00.0
Mai 11 00:47:15 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to resume
Mai 11 00:47:15 derp-ms7d13 kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
Mai 11 00:47:15 derp-ms7d13 kernel: [drm] VRAM is lost due to GPU reset!
Mai 11 00:47:15 derp-ms7d13 kernel: [drm] PSP is resuming…
Mai 11 00:47:15 derp-ms7d13 kwin_x11[1188]: amdgpu: amdgpu_cs_query_fence_status failed.
Mai 11 00:47:15 derp-ms7d13 kernel: [drm:psp_hw_start [amdgpu]] ERROR PSP create ring failed!
Mai 11 00:47:15 derp-ms7d13 kernel: [drm:psp_resume [amdgpu]] ERROR PSP resume failed
Mai 11 00:47:15 derp-ms7d13 kernel: [drm:amdgpu_device_fw_loading [amdgpu]] ERROR resume of IP block failed -62

Mai 11 00:47:15 derp-ms7d13 kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset(2) failed

Mai 11 00:47:15 derp-ms7d13 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR GPU Recovery Failed: -62

Mai 11 00:47:18 derp-ms7d13 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring sdma3 timeout, signaled seq=60655, emitted seq=60657
Mai 11 00:47:18 derp-ms7d13 kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process pid 0 thread pid 0

From here out on various UI service begin to die.

Mai 11 00:47:23 derp-ms7d13 kaccess[1474]: The X11 connection broke (error 1). Did the X11 server die?
Mai 11 00:47:23 derp-ms7d13 kded5[1331]: X connection to :0 broken (explicit kill or server shutdown).
Mai 11 00:47:23 derp-ms7d13 kdeconnectd[1419]: The X11 connection broke (error 1). Did the X11 server die?
Mai 11 00:47:23 derp-ms7d13 kded5[1187]: X connection to :0 broken (explicit kill or server shutdown).
Mai 11 00:47:23 derp-ms7d13 kded5[1187]: The X11 connection broke: I/O error (code 1)
Mai 11 00:47:23 derp-ms7d13 konsole[31870]: The X11 connection broke (error 1). Did the X11 server die?
Mai 11 00:47:23 derp-ms7d13 baloorunner[3693]: The X11 connection broke (error 1). Did the X11 server die?
Mai 11 00:47:23 derp-ms7d13 kscreen_backend_launcher[1395]: The X11 connection broke (error 1). Did the X11 server die?
Mai 11 00:47:23 derp-ms7d13 org_kde_powerdevil[2995]: The X11 connection broke (error 1). Did the X11 server die?
Mai 11 00:47:23 derp-ms7d13 xembedsniproxy[1287]: The X11 connection broke (error 1). Did the X11 server die?
Mai 11 00:47:23 derp-ms7d13 polkitd[842]: Unregistered Authentication Agent for unix-session:2 (system bus name :1.58, object path /org/kde/PolicyKit1/AuthenticationAgent, locale en_GB.UTF-8) (disconnected from bus)
Mai 11 00:47:23 derp-ms7d13 ksmserver[1185]: The X11 connection broke (error 1). Did the X11 server die?
Mai 11 00:47:23 derp-ms7d13 polkit-kde-authentication-agent-1[1281]: The X11 connection broke (error 1). Did the X11 server die?
Mai 11 00:47:23 derp-ms7d13 kwin_x11[1188]: The X11 connection broke (error 1). Did the X11 server die?
Mai 11 00:47:23 derp-ms7d13 gmenudbusmenuproxy[1280]: The X11 connection broke (error 1). Did the X11 server die?
Mai 11 00:47:23 derp-ms7d13 kactivitymanagerd[1279]: The X11 connection broke (error 1). Did the X11 server die?
Mai 11 00:47:23 derp-ms7d13 kwalletd5[1132]: The X11 connection broke (error 1). Did the X11 server die?
Mai 11 00:47:23 derp-ms7d13 kwin_x11[1188]: file:///usr/share/kwin/outline/plasma/outline.qml:14: TypeError: Cannot read property ‘longDuration’ of null
Mai 11 00:47:23 derp-ms7d13 systemd[1119]: plasma-xembedsniproxy.service: Main process exited, code=exited, status=1/FAILURE
Mai 11 00:47:23 derp-ms7d13 systemd[1119]: plasma-xembedsniproxy.service: Failed with result ‘exit-code’.
Mai 11 00:47:23 derp-ms7d13 systemd[1119]: plasma-xembedsniproxy.service: Consumed 1.049s CPU time.
Mai 11 00:47:23 derp-ms7d13 systemd[1119]: plasma-kscreen.service: Main process exited, code=exited, status=1/FAILURE

and it only goes downhill from there on out.

There still logs from 03:00 onward, so the system was not entirely dead, but they are unrelated to the cause.

Handbrake definitely died with the particular issue though, as indicated by the abrupt ending of the encode.

So my question is basically this: What has happened here and why? It strikes me as very random.

The encode was done via CPU, so no GPU involvement. Further the GPU had its desktop profile enabled with virtually zero load. I use the same profile regularly for undemanding titles and never had the system crash on me, or my screens go black for that matter.

In case anyone is actually familiar with this particular issue: Any solution?

The system is up to date (well, May 10th/11th up to date)

I can easily produce GPU crashes with corectrl even on desktop when undervolting too much.
Especially on Linux, when you update, a lot of software changes, which also has influence on what values you can apply in corectrl.

Try to lighten some of the changes by going a bit back to default values and check if the issue happens again.

I’ve set up a desktop only profile now with max UV and lowest core clock (within corectl rammifications).

Nothing has happened yet, but I will see. As they’re 4K encodes they will be taking their sweet time anyway. If everything looks good in a few day’s time I will report back. If not, I will attempt a stock run, just to be on the safe side.