Amdgpu crashes with kernel 5.10

I have an AMD Ryzen 7 3700X with a AMD Radeon RX 5600 XT graphic card. Yesterday I got annoyed by incomplete boot processes. I was logged in via ssh and rebooted. The subsequent boot process stopped at a black screen. ssh was not running. I had to go to the computer and restart it with REISUB. I saw this behaviour before. But didnt investigate further until today.

In the journal I found this during boot:

Jan 05 22:09:16 rakete kernel: ------------[ cut here ]------------
Jan 05 22:09:16 rakete kernel: WARNING: CPU: 12 PID: 5588 at dcn20_validate_bandwidth_fp+0x8d/0xd0 [amdgpu]
Jan 05 22:09:16 rakete kernel: Modules linked in: zram z3fold cfg80211 ccm algif_aead des_generic libdes algif_skcipher cmac md4 algif_has>
Jan 05 22:09:16 rakete kernel:  crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper crypto_simd cryptd ahci nvme xhc>
Jan 05 22:09:16 rakete kernel: CPU: 12 PID: 5588 Comm: Xorg Tainted: P           OE     5.10.4-xanmod1-cacule-MANJARO #1
Jan 05 22:09:16 rakete kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C37/MPG X570 GAMING EDGE WIFI (MS-7C37), BIOS 1.D2 12>
Jan 05 22:09:16 rakete kernel: RIP: 0010:dcn20_validate_bandwidth_fp+0x8d/0xd0 [amdgpu]
Jan 05 22:09:16 rakete kernel: Code: 00 7b 35 22 85 14 1f 00 00 75 2f 31 d2 f2 0f 11 85 58 26 00 00 48 89 ee 4c 89 e7 e8 5d f6 ff ff 89 c2>
Jan 05 22:09:16 rakete kernel: RSP: 0018:ffffba56c13e7b48 EFLAGS: 00010246
Jan 05 22:09:16 rakete kernel: RAX: 0000000000000001 RBX: 4079400000000000 RCX: 0000000005d0a00c
Jan 05 22:09:16 rakete kernel: RDX: 0000000000000000 RSI: 4568acf9cb978f12 RDI: 00000000000311a0
Jan 05 22:09:16 rakete kernel: RBP: ffffa0f2fdb00000 R08: 0000000000000000 R09: 0000000000000000
Jan 05 22:09:16 rakete kernel: R10: ffffa0f2fdb001b8 R11: dead000000000100 R12: ffffa0f2ad0b0000
Jan 05 22:09:16 rakete kernel: R13: ffffa0f2fdb00000 R14: ffffa0f2aca83340 R15: ffffa0f2ad0b0000
Jan 05 22:09:16 rakete kernel: FS:  00007ff18d7322c0(0000) GS:ffffa1016ed00000(0000) knlGS:0000000000000000
Jan 05 22:09:16 rakete kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 05 22:09:16 rakete kernel: CR2: 00007ff11e7f7fe0 CR3: 000000013dbee000 CR4: 0000000000350ee0
Jan 05 22:09:16 rakete kernel: Call Trace:
Jan 05 22:09:16 rakete kernel:  dcn20_validate_bandwidth+0x24/0x40 [amdgpu]
Jan 05 22:09:16 rakete kernel:  dc_validate_global_state+0x3c3/0x4c0 [amdgpu]
Jan 05 22:09:16 rakete kernel:  amdgpu_dm_atomic_check+0xc84/0xd60 [amdgpu]
Jan 05 22:09:16 rakete kernel:  drm_atomic_check_only+0x55d/0x7c0 [drm]
Jan 05 22:09:16 rakete kernel:  drm_atomic_commit+0x13/0x50 [drm]
Jan 05 22:09:16 rakete kernel:  drm_atomic_connector_commit_dpms+0xda/0x100 [drm]
Jan 05 22:09:16 rakete kernel:  drm_mode_obj_set_property_ioctl+0x189/0x3b0 [drm]
Jan 05 22:09:16 rakete kernel:  ? drm_wait_vblank_ioctl+0x2f4/0x7c0 [drm]
Jan 05 22:09:16 rakete kernel:  ? drm_connector_set_obj_prop+0x90/0x90 [drm]
Jan 05 22:09:16 rakete kernel:  drm_connector_property_set_ioctl+0x37/0x60 [drm]
Jan 05 22:09:16 rakete kernel:  ? emulator_task_switch+0x170/0x520 [kvm]
Jan 05 22:09:16 rakete kernel:  drm_ioctl_kernel+0xb0/0x140 [drm]
Jan 05 22:09:16 rakete kernel:  drm_ioctl+0x212/0x3c0 [drm]
Jan 05 22:09:16 rakete kernel:  ? drm_connector_set_obj_prop+0x90/0x90 [drm]
Jan 05 22:09:16 rakete kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Jan 05 22:09:16 rakete kernel:  __x64_sys_ioctl+0x83/0xb0
Jan 05 22:09:16 rakete kernel:  do_syscall_64+0x33/0x80
Jan 05 22:09:16 rakete kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 05 22:09:16 rakete kernel: RIP: 0033:0x7ff18e0fef6b
Jan 05 22:09:16 rakete kernel: Code: 89 d8 49 8d 3c 1c 48 f7 d8 49 39 c4 72 b5 e8 1c ff ff ff 85 c0 78 ba 4c 89 e0 5b 5d 41 5c c3 f3 0f 1e>
Jan 05 22:09:16 rakete kernel: RSP: 002b:00007ffe1daf30c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 05 22:09:16 rakete kernel: RAX: ffffffffffffffda RBX: 00007ffe1daf3100 RCX: 00007ff18e0fef6b
Jan 05 22:09:16 rakete kernel: RDX: 00007ffe1daf3100 RSI: 00000000c01064ab RDI: 000000000000000d
Jan 05 22:09:16 rakete kernel: RBP: 00000000c01064ab R08: 0000000000000000 R09: 00007ff18e567d10
Jan 05 22:09:16 rakete kernel: R10: 00007ff18e567d20 R11: 0000000000000246 R12: 0000564308ec7e20
Jan 05 22:09:16 rakete kernel: R13: 000000000000000d R14: 0000000000000000 R15: 0000000000000000
Jan 05 22:09:16 rakete kernel: ---[ end trace 2604fae69d803995 ]---

I then checked all journals and realized that I have 56 such incidences in the logs since 16. December when I started to use a 5.10 kernel. I do not see the issue with kernel 5.4.

It always starts with

WARNING: CPU: 12 PID: 5588 at dcn20_validate_bandwidth_fp+0x8d/0xd0 [amdgpu]

And it happens either during boot or even during shutdown. But during shutdown is obviously not a problem.

WARNING during shutdown
Jan 06 07:37:38 rakete systemd[6158]: Finished Exit the Session.
Jan 06 07:37:38 rakete systemd[6158]: Reached target Exit the Session.
Jan 06 07:37:38 rakete kernel: ------------[ cut here ]------------
Jan 06 07:37:38 rakete kernel: WARNING: CPU: 1 PID: 22 at dcn20_validate_bandwidth_fp+0x8d/0xd0 [amdgpu]
Jan 06 07:37:38 rakete kernel: Modules linked in: btrfs blake2b_generic xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs z3fold cfg8>
Jan 06 07:37:38 rakete kernel:  zlua(OE) zavl(POE) icp(POE) hid_generic usbhid hid xfs libcrc32c crct10dif_pclmul crc32_pclmul ghash_clmul>
Jan 06 07:37:38 rakete kernel: CPU: 1 PID: 22 Comm: kworker/1:0 Tainted: P           OE     5.10.4-xanmod1-cacule-MANJARO #1
Jan 06 07:37:38 rakete kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C37/MPG X570 GAMING EDGE WIFI (MS-7C37), BIOS 1.D2 12>
Jan 06 07:37:38 rakete kernel: Workqueue: events drm_mode_rmfb_work_fn [drm]
Jan 06 07:37:38 rakete kernel: RIP: 0010:dcn20_validate_bandwidth_fp+0x8d/0xd0 [amdgpu]
Jan 06 07:37:38 rakete kernel: Code: 00 7b 35 22 85 14 1f 00 00 75 2f 31 d2 f2 0f 11 85 58 26 00 00 48 89 ee 4c 89 e7 e8 5d f6 ff ff 89 c2>
Jan 06 07:37:38 rakete kernel: RSP: 0018:ffffb2ec80277c78 EFLAGS: 00010246
Jan 06 07:37:38 rakete kernel: RAX: 0000000000000001 RBX: 4079400000000000 RCX: 0000000008cc2001
Jan 06 07:37:38 rakete kernel: RDX: 0000000000000000 RSI: c027877d9f8dc980 RDI: 00000000000311a0
Jan 06 07:37:38 rakete kernel: RBP: ffff9f1eed880000 R08: 0000000000000000 R09: 0000000000000000
Jan 06 07:37:38 rakete kernel: R10: ffff9f1eed8801b8 R11: ffffb2ec80277d2f R12: ffff9f1eadd50000
Jan 06 07:37:38 rakete kernel: R13: ffff9f1eed880000 R14: ffff9f1f058cb300 R15: ffff9f1eadd50000
Jan 06 07:37:38 rakete kernel: FS:  0000000000000000(0000) GS:ffff9f2d6ea40000(0000) knlGS:0000000000000000
Jan 06 07:37:38 rakete kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 06 07:37:38 rakete kernel: CR2: 00007f82f6aa1b70 CR3: 00000005ff610000 CR4: 0000000000350ee0
Jan 06 07:37:38 rakete kernel: Call Trace:
Jan 06 07:37:38 rakete kernel:  dcn20_validate_bandwidth+0x24/0x40 [amdgpu]
Jan 06 07:37:38 rakete kernel:  dc_validate_global_state+0x3c3/0x4c0 [amdgpu]
Jan 06 07:37:38 rakete kernel:  amdgpu_dm_atomic_check+0xc84/0xd60 [amdgpu]
Jan 06 07:37:38 rakete kernel:  drm_atomic_check_only+0x55d/0x7c0 [drm]
Jan 06 07:37:38 rakete kernel:  drm_atomic_commit+0x13/0x50 [drm]
Jan 06 07:37:38 rakete kernel:  drm_framebuffer_remove+0x3f9/0x4b0 [drm]
Jan 06 07:37:38 rakete kernel:  drm_mode_rmfb_work_fn+0x3c/0x50 [drm]
Jan 06 07:37:38 rakete kernel:  process_one_work+0x1d6/0x3a0
Jan 06 07:37:38 rakete kernel:  worker_thread+0x1f9/0x460
Jan 06 07:37:38 rakete kernel:  ? process_one_work+0x3a0/0x3a0
Jan 06 07:37:38 rakete kernel:  kthread+0x181/0x1b0
Jan 06 07:37:38 rakete kernel:  ? __kthread_init_worker+0x50/0x50
Jan 06 07:37:38 rakete kernel:  ret_from_fork+0x22/0x30
Jan 06 07:37:38 rakete kernel: ---[ end trace 97537b8291c83c08 ]---

I searched the web but could not find anything meaningfull about this particular dcn20_validate_bandwidth_fp issue.

Any idea what this is all about and how I can fix that?

Is your BIOS up to date?

Sometimes gpu driver updates in latest kernels require a matching BIOS update also.

Yes.

I found this which looks similar:
https://bugzilla.kernel.org/show_bug.cgi?id=210321

But it doesnt help me.

Your bios version is a beta did you try to come back to the previous one ?

Version
7C37v1D2(Beta version)
2021-01-04

come back to

Version
7C37v1C
2020-11-04

EDIT: you could check this thread

I just updated the BIOS to the beta version a few days ago. But I see these messages starting 16. December. And this was all with the old stable BIOS.

I checked that, but the solution given is not valid for me. This thread claims a temperature problem with the GPU. My GPU and the whole PC is mostly idle with GPU temperature around 55 °C and CPU temperatures around 43 °C.

1 Like

You are using the 5.10.4-xanmod1-cacule-MANJARO

This is a note from xanmod

Note: The current proprietary NVIDIA, VirtualBox, VMware Workstation / Player or some other dkms modules builds may not be fully compatible with Linux 5.10 (EDGE) and RT kernels.

Are you using a custom 5.10 kernel? Would have been good to know in OP.

Do you get the same issues with linux or linux-zen 5.10 repo kernels?

The WARNING message during shutdown happens with all kernels I have tested:

linux 5.10.4.arch2-1
linux-zen 5.10.4.zen2-1
linux 5.10.2-104-tkg-bmq
linux 5.10.4-xanmod1-cacule-MANJARO

But for the more critical issue - the black screen and freezing during boot - I am not so sure anymore. This issue is not so easy to trigger.

Typically what I do to trigger the freeze during boot (and this is how I detected it):

  1. The PC is unattended. Wake up on LAN and login via ssh
  2. Open two remote sessions from laptop to PC: one for my user and one for root.
  3. do “journalctl -f” in the user session and
  4. do a reboot in the root session

The journal from step #3 shows the WARNING. The PCs is rebooting and I can ping it, but ssh is not possible. I have to go to the console and do a REISUB.
Unfortunately this does not happen on every boot. I am still investigating if the different kernels behave differently.

So far I can confirm that the freeze/black screen during boot happens with

linux-zen 5.10.4.zen2-1 and linux-manjaro-xanmod-cacule 5.10.4-3

I haven’t seen it yet with stock arch kernel. Still trying …

1 Like

I suspect it’s an incomplete fix for the amdgpu regression which was in 5.10-5.10.3. 5.10.3 was patched by Arch, upstream fixes were supposed to be included in 5.10.4, but I haven’t checked whether both patches Arch included in 5.10.3 are actually present in 5.10.4.

5.10.5 is out so it will be worth checking the changelog…

… and there are indeed three commits to amdgpu-related files:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/?id=v5.10.5&id2=v5.10.4&dt=2

So - it will be worth trying 5.10.5.

1 Like

It just found out that it happens with stock arch kernel 5.10.4-arch2-1 as well.

1 Like

What type of monitor do you use and how is it connected?

Maybe not meaningful for us, but that function is defined in https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.c?h=v5.10.5

There is also an upcoming revert: https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/diff/queue-5.10/revert-drm-amd-display-fix-memory-leaks-in-s3-resume.patch?id=2f168cf391c8f615a95ab7d5264db880b225ab22

1 Like

Since this morning I am using kernel version 5.10.5-xanmod1-cacule-MANJARO I did not have any freezes since 6 reboots. This is promising. Lets see how that develops.

1 Like

That was too early. It just happened again:

kernel 5.10.5 log
Jan 07 16:26:55 rakete kernel: ------------[ cut here ]------------
Jan 07 16:26:55 rakete kernel: WARNING: CPU: 13 PID: 5260 at dcn20_validate_bandwidth_fp+0x88/0xd0 [amdgpu]
Jan 07 16:26:55 rakete kernel: Modules linked in: zram z3fold cfg80211 ccm algif_aead des_generic libdes algif_skcipher cmac md4 algif_has>
Jan 07 16:26:55 rakete kernel:  crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper crypto_simd cryptd nvme ahci xhc>
Jan 07 16:26:55 rakete kernel: CPU: 13 PID: 5260 Comm: Xorg Tainted: P           OE     5.10.5-xanmod1-cacule-MANJARO #1
Jan 07 16:26:55 rakete kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C37/MPG X570 GAMING EDGE WIFI (MS-7C37), BIOS 1.D2 12>
Jan 07 16:26:55 rakete kernel: RIP: 0010:dcn20_validate_bandwidth_fp+0x88/0xd0 [amdgpu]
Jan 07 16:26:55 rakete kernel: Code: 00 7b 35 22 85 14 1f 00 00 75 2f 31 d2 f2 0f 11 85 58 26 00 00 48 89 ee 4c 89 e7 e8 62 f6 ff ff 89 c2>
Jan 07 16:26:55 rakete kernel: RSP: 0018:ffffba8e01837b48 EFLAGS: 00010246
Jan 07 16:26:55 rakete kernel: RAX: 0000000000000001 RBX: 4079400000000000 RCX: 000000000969a00d
Jan 07 16:26:55 rakete kernel: RDX: 0000000000000000 RSI: 4b07ea3d7756ad4c RDI: 00000000000281a0
Jan 07 16:26:55 rakete kernel: RBP: ffff9fbb63b20000 R08: 0000000000000000 R09: 0000000000000000
Jan 07 16:26:55 rakete kernel: R10: ffff9fbb63b201b8 R11: dead000000000100 R12: ffff9fbac5120000
Jan 07 16:26:55 rakete kernel: R13: ffff9fbb63b20000 R14: ffff9fbac54f8f50 R15: ffff9fbac5120000
Jan 07 16:26:55 rakete kernel: FS:  00007f9056c5e2c0(0000) GS:ffff9fc9aed40000(0000) knlGS:0000000000000000
Jan 07 16:26:55 rakete kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 07 16:26:55 rakete kernel: CR2: 0000557220f9ac78 CR3: 000000013da16000 CR4: 0000000000350ee0
Jan 07 16:26:55 rakete kernel: Call Trace:
Jan 07 16:26:55 rakete kernel:  dcn20_validate_bandwidth+0x1f/0x30 [amdgpu]
Jan 07 16:26:55 rakete kernel:  dc_validate_global_state+0x3be/0x4c0 [amdgpu]
Jan 07 16:26:55 rakete kernel:  amdgpu_dm_atomic_check+0xc7f/0xd60 [amdgpu]
Jan 07 16:26:55 rakete kernel:  drm_atomic_check_only+0x558/0x7c0 [drm]
Jan 07 16:26:55 rakete kernel:  drm_atomic_commit+0xe/0x50 [drm]
Jan 07 16:26:55 rakete kernel:  drm_atomic_connector_commit_dpms+0xd5/0xf0 [drm]
Jan 07 16:26:55 rakete kernel:  drm_mode_obj_set_property_ioctl+0x184/0x3a0 [drm]
Jan 07 16:26:55 rakete kernel:  ? drm_wait_vblank_ioctl+0x2ef/0x790 [drm]
Jan 07 16:26:55 rakete kernel:  ? drm_connector_set_obj_prop+0x80/0x80 [drm]
Jan 07 16:26:55 rakete kernel:  drm_connector_property_set_ioctl+0x32/0x50 [drm]
Jan 07 16:26:55 rakete kernel:  ? mgmt_index_removed+0xb0/0x100 [bluetooth]
Jan 07 16:26:55 rakete kernel:  drm_ioctl_kernel+0xab/0x140 [drm]
Jan 07 16:26:55 rakete kernel:  ? clock_was_set_work+0x20/0x20
Jan 07 16:26:55 rakete kernel:  drm_ioctl+0x20d/0x3c0 [drm]
Jan 07 16:26:55 rakete kernel:  ? drm_connector_set_obj_prop+0x80/0x80 [drm]
Jan 07 16:26:55 rakete kernel:  amdgpu_drm_ioctl+0x44/0x80 [amdgpu]
Jan 07 16:26:55 rakete kernel:  __x64_sys_ioctl+0x7e/0xb0
Jan 07 16:26:55 rakete kernel:  do_syscall_64+0x33/0x80
Jan 07 16:26:55 rakete kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 07 16:26:55 rakete kernel: RIP: 0033:0x7f905762af6b
Jan 07 16:26:55 rakete kernel: Code: 89 d8 49 8d 3c 1c 48 f7 d8 49 39 c4 72 b5 e8 1c ff ff ff 85 c0 78 ba 4c 89 e0 5b 5d 41 5c c3 f3 0f 1e>
Jan 07 16:26:55 rakete kernel: RSP: 002b:00007ffce46c8148 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jan 07 16:26:55 rakete kernel: RAX: ffffffffffffffda RBX: 00007ffce46c8180 RCX: 00007f905762af6b
Jan 07 16:26:55 rakete kernel: RDX: 00007ffce46c8180 RSI: 00000000c01064ab RDI: 000000000000000d
Jan 07 16:26:55 rakete kernel: RBP: 00000000c01064ab R08: 0000000000000000 R09: 00007f9057a9dd10
Jan 07 16:26:55 rakete kernel: R10: 00007f9057a9dd20 R11: 0000000000000246 R12: 0000562796a8fdc0
Jan 07 16:26:55 rakete kernel: R13: 000000000000000d R14: 0000000000000000 R15: 0000000000000000
Jan 07 16:26:55 rakete kernel: ---[ end trace 7fd56cf2f2bc84ca ]---
Jan 07 16:27:06 rakete kernel: [drm] free PSP TMR buffer
Jan 07 16:28:03 rakete kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000000000).
Jan 07 16:28:03 rakete kernel: [drm] PSP is resuming...
Jan 07 16:28:03 rakete kernel: [drm] reserve 0x900000 from 0x800f400000 for PSP TMR
Jan 07 16:28:03 rakete kernel: amdgpu 0000:2f:00.0: amdgpu: RAS: optional ras ta ucode is not available
Jan 07 16:28:03 rakete kernel: amdgpu 0000:2f:00.0: amdgpu: RAP: optional rap ta ucode is not available
Jan 07 16:28:03 rakete kernel: amdgpu 0000:2f:00.0: amdgpu: SMU is resuming...
Jan 07 16:28:03 rakete kernel: amdgpu 0000:2f:00.0: amdgpu: smu driver if version = 0x00000036, smu fw if version = 0x00000037, smu fw ver>
Jan 07 16:28:03 rakete kernel: amdgpu 0000:2f:00.0: amdgpu: SMU driver if version not matched
Jan 07 16:28:05 rakete kernel: amdgpu 0000:2f:00.0: amdgpu: failed send message: EnableAllSmuFeatures (6)         param: 0x00000000 respon>
Jan 07 16:28:05 rakete kernel: amdgpu 0000:2f:00.0: amdgpu: Failed to enable requested dpm features!
Jan 07 16:28:05 rakete kernel: amdgpu 0000:2f:00.0: amdgpu: Failed to setup smc hw!
Jan 07 16:28:05 rakete kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -62
Jan 07 16:28:05 rakete kernel: amdgpu 0000:2f:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
Jan 07 16:28:05 rakete kernel: snd_hda_intel 0000:2f:00.1: refused to change power state from D3hot to D0
Jan 07 16:28:05 rakete kernel: amdgpu: Move buffer fallback to memcpy unavailable
Jan 07 16:28:05 rakete kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
Jan 07 16:28:05 rakete kernel: snd_hda_intel 0000:2f:00.1: CORB reset timeout#2, CORBRP = 65535
Jan 07 16:28:16 rakete kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=131, emitted seq=133
Jan 07 16:28:16 rakete kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Jan 07 16:28:16 rakete kernel: amdgpu 0000:2f:00.0: amdgpu: GPU reset begin!
Jan 07 16:28:16 rakete kernel: BUG: unable to handle page fault for address: ffff9fbacda7a000
Jan 07 16:28:16 rakete kernel: #PF: supervisor write access in kernel mode
Jan 07 16:28:16 rakete kernel: #PF: error_code(0x0003) - permissions violation
Jan 07 16:28:16 rakete kernel: PGD 7d2401067 P4D 7d2401067 PUD 101190063 PMD 110aff063 PTE 800000010da7a161
Jan 07 16:28:16 rakete kernel: Oops: 0003 [#1] PREEMPT SMP NOPTI
Jan 07 16:28:16 rakete kernel: CPU: 14 PID: 218 Comm: kworker/14:1 Tainted: P        W  OE     5.10.5-xanmod1-cacule-MANJARO #1
Jan 07 16:28:16 rakete kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C37/MPG X570 GAMING EDGE WIFI (MS-7C37), BIOS 1.D2 12>
Jan 07 16:28:16 rakete kernel: Workqueue: events drm_sched_job_timedout [gpu_sched]
Jan 07 16:28:16 rakete kernel: RIP: 0010:kfd_gtt_sa_free+0x34/0x80 [amdgpu]
Jan 07 16:28:16 rakete kernel: Code: f5 53 48 89 fb 0f 1f 44 00 00 4c 8d a3 70 01 00 00 4c 89 e7 e8 7d 3d 99 c7 8b 45 00 3b 45 04 77 16 48>
Jan 07 16:28:16 rakete kernel: RSP: 0018:ffffba8e0073fd50 EFLAGS: 00010286
Jan 07 16:28:16 rakete kernel: RAX: 0000000046c7e000 RBX: ffff9fbac4cc1c00 RCX: 0000000046c7e000
Jan 07 16:28:16 rakete kernel: RDX: ffff9fbac4cea400 RSI: ffff9fbac63b90e0 RDI: ffff9fbac4cc1d70
Jan 07 16:28:16 rakete kernel: RBP: ffff9fbac63b90e0 R08: 0000000000000000 R09: 0000000000009ffb
Jan 07 16:28:16 rakete kernel: R10: ffffba8e0073fab0 R11: ffffba8e0073faa8 R12: ffff9fbac4cc1d70
Jan 07 16:28:16 rakete kernel: R13: ffff9fbac5140000 R14: ffff9fbac1a7b000 R15: ffff9fbac1a7b0b8
Jan 07 16:28:16 rakete kernel: FS:  0000000000000000(0000) GS:ffff9fc9aed80000(0000) knlGS:0000000000000000
Jan 07 16:28:16 rakete kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 07 16:28:16 rakete kernel: CR2: ffff9fbacda7a000 CR3: 000000013184c000 CR4: 0000000000350ee0
Jan 07 16:28:16 rakete kernel: Call Trace:
Jan 07 16:28:16 rakete kernel:  stop_cpsch+0x91/0xc0 [amdgpu]
Jan 07 16:28:16 rakete kernel:  kgd2kfd_pre_reset+0x51/0x70 [amdgpu]
Jan 07 16:28:16 rakete kernel:  amdgpu_device_gpu_recover.cold+0x36e/0x98a [amdgpu]
Jan 07 16:28:16 rakete kernel:  amdgpu_job_timedout+0x117/0x140 [amdgpu]
Jan 07 16:28:16 rakete kernel:  drm_sched_job_timedout+0x5f/0xe0 [gpu_sched]
Jan 07 16:28:16 rakete kernel:  process_one_work+0x1ae/0x300
Jan 07 16:28:16 rakete kernel:  worker_thread+0x48/0x460
Jan 07 16:28:16 rakete kernel:  ? cancel_delayed_work+0xa0/0xa0
Jan 07 16:28:16 rakete kernel:  kthread+0x177/0x1a0
Jan 07 16:28:16 rakete kernel:  ? __kthread_init_worker+0x50/0x50
Jan 07 16:28:16 rakete kernel:  ret_from_fork+0x22/0x30
Jan 07 16:28:16 rakete kernel: Modules linked in: zram z3fold cfg80211 ccm algif_aead des_generic libdes algif_skcipher cmac md4 algif_has>
Jan 07 16:28:16 rakete kernel:  crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper crypto_simd cryptd nvme ahci xhc>
Jan 07 16:28:16 rakete kernel: CR2: ffff9fbacda7a000
Jan 07 16:28:16 rakete kernel: ---[ end trace 7fd56cf2f2bc84cb ]---
Jan 07 16:28:16 rakete kernel: RIP: 0010:kfd_gtt_sa_free+0x34/0x80 [amdgpu]
Jan 07 16:28:16 rakete kernel: Code: f5 53 48 89 fb 0f 1f 44 00 00 4c 8d a3 70 01 00 00 4c 89 e7 e8 7d 3d 99 c7 8b 45 00 3b 45 04 77 16 48>
Jan 07 16:28:16 rakete kernel: RSP: 0018:ffffba8e0073fd50 EFLAGS: 00010286
Jan 07 16:28:16 rakete kernel: RAX: 0000000046c7e000 RBX: ffff9fbac4cc1c00 RCX: 0000000046c7e000
Jan 07 16:28:16 rakete kernel: RDX: ffff9fbac4cea400 RSI: ffff9fbac63b90e0 RDI: ffff9fbac4cc1d70
Jan 07 16:28:16 rakete kernel: RBP: ffff9fbac63b90e0 R08: 0000000000000000 R09: 0000000000009ffb
Jan 07 16:28:16 rakete kernel: R10: ffffba8e0073fab0 R11: ffffba8e0073faa8 R12: ffff9fbac4cc1d70
Jan 07 16:28:16 rakete kernel: R13: ffff9fbac5140000 R14: ffff9fbac1a7b000 R15: ffff9fbac1a7b0b8
Jan 07 16:28:16 rakete kernel: FS:  0000000000000000(0000) GS:ffff9fc9aed80000(0000) knlGS:0000000000000000
Jan 07 16:28:16 rakete kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 07 16:28:16 rakete kernel: CR2: ffff9fbacda7a000 CR3: 000000013184c000 CR4: 0000000000350ee0
Jan 07 16:28:18 rakete sshd[16695]: Accepted publickey for matthias from 192.168.132.20 port 58824 ssh2: RSA SHA256:xxxx>
Jan 07 16:28:18 rakete kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Jan 07 16:28:18 rakete kernel: #PF: supervisor instruction fetch in kernel mode
Jan 07 16:28:18 rakete kernel: #PF: error_code(0x0010) - not-present page
Jan 07 16:28:18 rakete kernel: PGD 0 P4D 0 
Jan 07 16:28:18 rakete kernel: Oops: 0010 [#2] PREEMPT SMP NOPTI
Jan 07 16:28:18 rakete kernel: CPU: 2 PID: 3423 Comm: systemd-network Tainted: P      D W  OE     5.10.5-xanmod1-cacule-MANJARO #1
Jan 07 16:28:18 rakete kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C37/MPG X570 GAMING EDGE WIFI (MS-7C37), BIOS 1.D2 12>
Jan 07 16:28:18 rakete kernel: RIP: 0010:0x0
Jan 07 16:28:18 rakete kernel: Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
Jan 07 16:28:18 rakete kernel: RSP: 0018:ffffba8e00a07e50 EFLAGS: 00010282
Jan 07 16:28:18 rakete kernel: RAX: 0000000000000000 RBX: 000000007fff0000 RCX: 000000007fff0000
Jan 07 16:28:18 rakete kernel: RDX: 000000007fff0000 RSI: ffffba8e003fe038 RDI: ffffba8e00a07e98
Jan 07 16:28:18 rakete kernel: RBP: 000000007fff0000 R08: 0000000000000000 R09: 0000000000000000
Jan 07 16:28:18 rakete kernel: R10: 0000000000000011 R11: 0000000000000000 R12: ffffba8e00a07e98
Jan 07 16:28:18 rakete kernel: R13: ffffba8e00a07f58 R14: ffffba8e003fe000 R15: ffff9fbacda7b5a0
Jan 07 16:28:18 rakete kernel: FS:  00007fbc1f150ec0(0000) GS:ffff9fc9aea80000(0000) knlGS:0000000000000000
Jan 07 16:28:18 rakete kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 07 16:28:18 rakete kernel: CR2: ffffffffffffffd6 CR3: 00000001057ea000 CR4: 0000000000350ee0
Jan 07 16:28:18 rakete kernel: Call Trace:
Jan 07 16:28:18 rakete kernel:  __seccomp_filter+0x81/0x690
Jan 07 16:28:18 rakete kernel:  syscall_enter_from_user_mode+0x6e/0x1d0
Jan 07 16:28:18 rakete kernel:  ? exit_to_user_mode_prepare+0x65/0x110
Jan 07 16:28:18 rakete kernel:  do_syscall_64+0xf/0x80
Jan 07 16:28:18 rakete kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jan 07 16:28:18 rakete kernel: RIP: 0033:0x7fbc1fab7510
Jan 07 16:28:18 rakete kernel: Code: 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 1d 45 31 c9 45 31>
Jan 07 16:28:18 rakete kernel: RSP: 002b:00007ffc77bc2f48 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
Jan 07 16:28:18 rakete kernel: RAX: ffffffffffffffda RBX: 00007ffc77bc3010 RCX: 00007fbc1fab7510
Jan 07 16:28:18 rakete kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000011
Jan 07 16:28:18 rakete kernel: RBP: 0000000000000011 R08: 0000000000000000 R09: 0000000000000000
Jan 07 16:28:18 rakete kernel: R10: 0000000000000022 R11: 0000000000000246 R12: 000055631ea82d00
Jan 07 16:28:18 rakete kernel: R13: 000055631ea73e40 R14: 0000000000000011 R15: 0000000000000000
Jan 07 16:28:18 rakete kernel: Modules linked in: zram z3fold cfg80211 ccm algif_aead des_generic libdes algif_skcipher cmac md4 algif_has>
Jan 07 16:28:18 rakete kernel:  crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper crypto_simd cryptd nvme ahci xhc>
Jan 07 16:28:18 rakete kernel: CR2: 0000000000000000
Jan 07 16:28:18 rakete kernel: ---[ end trace 7fd56cf2f2bc84cc ]---
Jan 07 16:28:18 rakete kernel: RIP: 0010:kfd_gtt_sa_free+0x34/0x80 [amdgpu]
Jan 07 16:28:18 rakete kernel: Code: f5 53 48 89 fb 0f 1f 44 00 00 4c 8d a3 70 01 00 00 4c 89 e7 e8 7d 3d 99 c7 8b 45 00 3b 45 04 77 16 48>
Jan 07 16:28:18 rakete kernel: RSP: 0018:ffffba8e0073fd50 EFLAGS: 00010286
Jan 07 16:28:18 rakete kernel: RAX: 0000000046c7e000 RBX: ffff9fbac4cc1c00 RCX: 0000000046c7e000
Jan 07 16:28:18 rakete kernel: RDX: ffff9fbac4cea400 RSI: ffff9fbac63b90e0 RDI: ffff9fbac4cc1d70
Jan 07 16:28:18 rakete kernel: RBP: ffff9fbac63b90e0 R08: 0000000000000000 R09: 0000000000009ffb
Jan 07 16:28:18 rakete kernel: R10: ffffba8e0073fab0 R11: ffffba8e0073faa8 R12: ffff9fbac4cc1d70
Jan 07 16:28:18 rakete kernel: R13: ffff9fbac5140000 R14: ffff9fbac1a7b000 R15: ffff9fbac1a7b0b8
Jan 07 16:28:18 rakete kernel: FS:  00007fbc1f150ec0(0000) GS:ffff9fc9aea80000(0000) knlGS:0000000000000000
Jan 07 16:28:18 rakete kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 07 16:28:18 rakete kernel: CR2: ffffffffffffffd6 CR3: 00000001057ea000 CR4: 0000000000350ee0
Jan 07 16:28:18 rakete kernel: note: systemd-network[3423] exited with preempt_count 1
Jan 07 16:28:18 rakete kernel: BUG: kernel NULL pointer dereference, address: 0000000000000060
Jan 07 16:28:18 rakete kernel: #PF: supervisor read access in kernel mode
Jan 07 16:28:18 rakete kernel: #PF: error_code(0x0000) - not-present page
Jan 07 16:28:18 rakete kernel: PGD 0 P4D 0

Not even a REISUB was able to recover the PC. A hard reset was needed. … -> I am back to 5.4 LTS :wink:

1 Like

:point_up_2:

Also, what frequency are you running too?

1 Like

The monitor is an Eizo CS2730 connected with DisplayPort cable and 60 Hz.

PS
I switched to HDMI now. Seeing if that makes a difference.

Probably because the kernel oopsed:

but only after amdgpu couldn’t be reset or recovered.

Looks indeed like a kernel bug…

I happens with HDMI and stock arch kernel 5.10.5 as well.

I have a theroy:
This seems to happen after the GPU is idle for 10 minutes. I do not see freezes during inetractive sessions. But when I log out and leave the computer idle for 10 minutes it freezes when I try to use it again. The same is true with the unattended PC, wake-on-LAN and ssh use case I described earlier. In that case the GPU is idle too and after ca. 10 minutes the computer freezes.

How can I prevent the sleep/wake up cycle? What could I do to keep the GPU awake?