[HowTo] Mitigate GFX Crash/Lockup (apparent freeze) with amdgpu

There are numerous bug reports and even more possibly-related forum posts and anecdotes floating around about a surprisingly common problem with amdgpu systems.

Symptoms include an apparent ‘freeze’ - though this is often actually just the gfx ceasing to be updated, while audio continues if playing, and a TTY is accessible with patience.

Often triggered by h.264, fullscreen videos in browsers like firefox, or fullscreen games.


Output in the logs would appear like some of the following;

Pageflip timed out! This is a bug in the amdgpu kernel driver

Or

flip_done timed out


Kernel

First it may be mentioned that kernel 6.6 ( linux66 ) is often comparatively reliable without further settings but this is not a permanent solution, nor is it even an option for some systems.




PSR

It is quite possible that these issues are related to PSR, “Panel Self Refresh”, a power saving feature. The amdgpu.dcdebugmask parameter can be used to affect PSR options derived from these values;

enum DC_DEBUG_MASK {
	DC_DISABLE_PIPE_SPLIT = 0x1,
	DC_DISABLE_STUTTER = 0x2,
	DC_DISABLE_DSC = 0x4,
	DC_DISABLE_CLOCK_GATING = 0x8,
	DC_DISABLE_PSR = 0x10,
	DC_FORCE_SUBVP_MCLK_SWITCH = 0x20,
	DC_DISABLE_MPO = 0x40,
	DC_ENABLE_DPIA_TRACE = 0x80,
	DC_ENABLE_DML2 = 0x100,
	DC_DISABLE_PSR_SU = 0x200,
	DC_DISABLE_REPLAY = 0x400,
	DC_DISABLE_IPS = 0x800,

(some definitions: Core Driver Infrastructure — The Linux Kernel documentation)

Meaning something like these:

DC_DISABLE_PSR_SU

amdgpu.dcdebugmask=0x200

OR

DC_DISABLE_PSR_SU and DC_DISABLE_REPLAY

amdgpu.dcdebugmask=0x600

OR

DC_DISABLE_PSR (automatically also SU)

amdgpu.dcdebugmask=0x10

OR

DC_DISABLE_PSR (automatically also SU) & DC_DISABLE_STUTTER

amdgpu.dcdebugmask=0x12

In roughly increasing order of severity may be enough to workaround the issue.




dGPU Power Management

Users of dedicated GPUs may find that the following option is useful, again at the cost of power-saving.

amdgpu.runpm=0



Direct Scan Out

Finally, especially for integrated GPU users where none of the above successfully abated the amdgpu freeze, another possible ‘fix’ is to disable “Direct Scan-out”. This is not optimal as direct scanout is meant to increase performance and decrease latency. However this is the only thing that worked for me and is preferable to a full graphics lockup. This can be controlled via a global environment variable (such as set in /etc/environment) and varies with window manager.

For kwin:

KWIN_DRM_NO_DIRECT_SCANOUT=1

For other wayland compositors:

WLR_SCENE_DISABLE_DIRECT_SCANOUT=1


Hope that was helpful to someone else out there. <3

5 Likes

I think I may be experiencing this. Not ready to report yet, but I want to leave a comment so that the topic doesn’t get closed.

Am I to understand that for a working kernel, you really have to go back as far as 6.6? linux-lts has this same problem?

Okay:
The first symptom I noticed was KDE Plasma taking about a minute to start on boot. The rest of the system booted fine; I just saw a black screen with an unusable command prompt for about a minute before the desktop appeared.

Then, about two weeks ago, I started getting program freezes. Note that it is a program freeze, not a “full graphics lockup”. It seems to effect some programs, but not others. The most-effected programs were KDE-related: Spectacle, Dolphin, and the start menu. Programs which never seem to freeze include Brave, Discord, and, surprisingly, all games. There is also a time element: after a restart, freezes only start after a few hours of uptime.

As OP recommends, I tried amdgpu.dcdebugmask=0x12, then amdgpu.runpm=0. The former almost totally eliminated the desktop startup delay; now it’s about 3 seconds. It may have slightly delayed/lessened the program freezing, but not much. amdgpu.runpm=0 did nothing.

I also changed from the zen kernel to the LTS kernel, to no effect. I’m about to try KWIN_DRM_NO_DIRECT_SCANOUT=1, but I’m not expecting much, since I’m using discrete graphics.

amdgpu.dcdebugmask=0x10

This also solves odd graphical glitches that I encounter on clean install. Without this suspend also doesn’t work and when screen shuts down there is corruption and glitching when it’s awakened.

This occurs on AMD iGPU, which is odd, because they shouldn’t have this kind of issues. :sweat_smile:

Any chance this error is related?
amdgpu 0000:12:00.0: [drm] *ERROR* lttpr_caps phy_repeater_cnt is 0x0, forcing it to 0x80.

Seems unrelated but this is what I could find;
https://gitlab.freedesktop.org/drm/amd/-/issues/?sort=created_date&state=opened&search=phy_repeater_cnt

Setting amdgpu.dcdebugmask=0x200 in /etc/kernel/cmdline and running sudo dracut –force seems to have stabilised the issue for me.

I believe it was something to do with the OLED Pixel Refresh feature on my monitor as it seemed to get really unstable when the scheduled time for that was approaching (shows an OSD message to cancel if computer in use).

It seems to be a few generic errors related to instability between the latest kernels and the amdgpu driver through multiple features though (VRR, PSR, etc) - so fixes may vary for others.

1 Like