Started from a clean install with bare minimum packages to get test game running (Windrose, behaviour occurs on a lot of others though). I acquired a 7900xtx as there seemed to be positive responses to using it under Linux, only now am I finding that it presents this error:
May 06 21:50:42 fluffys-pc kernel: amdgpu 0000:03:00.0: [drm] device wedged, but recovered through reset
May 06 21:50:42 fluffys-pc kernel: amdgpu 0000:03:00.0: Ring gfx_0.0.0 reset succeeded
May 06 21:50:42 fluffys-pc kernel: amdgpu 0000:03:00.0: Starting gfx_0.0.0 ring reset
May 06 21:50:42 fluffys-pc kernel: amdgpu 0000:03:00.0: Process GameThread pid 6845 thread vkd3d_queue pid 6892
May 06 21:50:42 fluffys-pc kernel: amdgpu 0000:03:00.0: ring gfx_0.0.0 timeout, signaled seq=1885012, emitted seq=1885014
May 06 21:50:42 fluffys-pc kernel: amdgpu 0000:03:00.0: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
May 06 21:50:42 fluffys-pc kernel: amdgpu 0000:03:00.0: [drm] AMDGPU device coredump file has been created
Does anyone know of a resolution to this fault and/or more information I could present to help resolve it? It’s getting pretty tedious. I have seen some suggestions of mesa-git but there seems to be caveats to that I am unable to grasp.
Thanks for sharing those details. The first one I don’t think tells us all that much. The device wedged, but recovered... is essentially letting us know it had a fault, a hang, and it’s reset itself to recover.
What’s not mentioned is what caused it (aside from a process ID perhaps) or why unfortunately.
You mentioned you “acquired” this 7900XTX. Is it 2nd hand?
I’d also suggest simply re-seating it, being mindful to remove any possible dust in the slot, or on the card before re-seating it firmly, but carefully. Also re-check your power connectors for it.
Yeah, that’s kinda why I’m on the forums hoping someone has seen such a useless error that might know anything at all. The internet so far has proven incredibly useless, describing behaviours alongside the error message. Essentially when running any recent dx12 software, it kills itself after a given period.
Just incredibly tired; was bought from the store.
I am unfamiliar with hardware to know how to conduct these checks. I’ll look it up.
Yes, I paid the local computer store to install it to avoid any strife from poor installation (I have not worked with them prior to this, but by all reports they’re competent). However, I have just conducted the checks outlined and I am going to try to establish the fault again.
This behaviour has been persistent for about a week, and I have given up and legged it to the forums, for clarity. Hence the nuked installation to have a clean slate to figure out the issue from.
All good. Just good to know there’s that fallback available, and also helps to know if a mate or family member might have done it, in which case, it’s worth re-checking.
I’d hope a store knew what they were doing though.
Given it was freshly installed, sometimes these simple checks can solve the issue without wasting hours of time fiddling with drivers and packages that actually had nothin to do with it.
So to clarify, this issue has persisted from before to after a fresh install? (that’d be another reason to check hardware first ).
Spat out a bit more this time that doesn’t seem to say much else, but here it is anyway.
May 06 23:10:47 fluffys-pc kwin_wayland_wrapper\[1566\]: **amd**gpu: The CS has cancelled because the context is lost. This context is innocent.
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: \[drm\] device wedged, but recovered through reset
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: \[drm\] \*ERROR\* Failed to initialize parser -125!
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: GPU reset(1) succeeded!
May 06 23:10:46 fluffys-pc firefox\[4979\]: **amd**gpu: The CS has cancelled because the context is lost. This context is innocent.
May 06 23:10:46 fluffys-pc steam\[5617\]: radv/**amd**gpu: The CS has been cancelled because the context is lost. This context is innocent.
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring mes_kiq_3.1.0 uses VM inv eng 14 on hub 0
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring jpeg_dec uses VM inv eng 4 on hub 8
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring vcn_unified_1 uses VM inv eng 1 on hub 8
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring vcn_unified_0 uses VM inv eng 0 on hub 8
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring sdma1 uses VM inv eng 13 on hub 0
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring sdma0 uses VM inv eng 12 on hub 0
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring comp_1.3.1 uses VM inv eng 11 on hub 0
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring comp_1.2.1 uses VM inv eng 10 on hub 0
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring comp_1.1.1 uses VM inv eng 9 on hub 0
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring comp_1.0.1 uses VM inv eng 8 on hub 0
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring comp_1.3.0 uses VM inv eng 7 on hub 0
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring comp_1.2.0 uses VM inv eng 6 on hub 0
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: \[drm\] DMUB hardware initialized: version=0x07002F00
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: SMU is resumed successfully!
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: SMU driver if version not matched
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: smu driver if version = 0x0000003d, smu fw if version = 0x00000040, smu fw program = 0, smu fw version = 0x004e8300 (78.131.0)
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: SMU is resuming...
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: SECUREDISPLAY: optional securedisplay ta ucode is not available
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: RAP: optional rap ta ucode is not available
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: reserve 0x1300000 from 0x85fc000000 for PSP TMR
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: PSP is resuming...
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: VRAM is lost due to GPU reset!
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: \[drm\] PCIE GART of 512M enabled (table at 0x0000008000300000).
May 06 23:10:46 fluffys-pc kernel: **amd**gpu 0000:03:00.0: GPU reset succeeded, trying to resume
May 06 23:10:45 fluffys-pc kernel: **amd**gpu 0000:03:00.0: GPU smu mode1 reset
May 06 23:10:45 fluffys-pc kernel: **amd**gpu 0000:03:00.0: GPU mode1 reset
May 06 23:10:45 fluffys-pc kernel: **amd**gpu 0000:03:00.0: MODE1 reset
May 06 23:10:45 fluffys-pc kernel: \[drm:gfx_v11_0_hw_fini \[**amd**gpu\]\] \*ERROR\* failed to halt cp gfx
May 06 23:10:45 fluffys-pc kernel: **amd**gpu 0000:03:00.0: failed to unmap legacy queue
May 06 23:10:45 fluffys-pc kernel: **amd**gpu 0000:03:00.0: MES failed to respond to msg=REMOVE_QUEUE
May 06 23:10:43 fluffys-pc kernel: **amd**gpu 0000:03:00.0: GPU reset begin!. Source: 1
May 06 23:10:43 fluffys-pc kernel: **amd**gpu 0000:03:00.0: Ring gfx_0.0.0 reset failed
May 06 23:10:43 fluffys-pc kernel: **amd**gpu 0000:03:00.0: The CPFW hasn't support pipe reset yet.
May 06 23:10:43 fluffys-pc kernel: **amd**gpu 0000:03:00.0: reset via MES failed and try pipe reset -110
May 06 23:10:43 fluffys-pc kernel: **amd**gpu 0000:03:00.0: failed to reset legacy queue
May 06 23:10:43 fluffys-pc kernel: **amd**gpu 0000:03:00.0: MES failed to respond to msg=RESET
May 06 23:10:41 fluffys-pc kernel: \[drm:gfx_v11_0_bad_op_irq \[**amd**gpu\]\] \*ERROR\* Illegal opcode in command stream
May 06 23:10:41 fluffys-pc kernel: **amd**gpu 0000:03:00.0: Starting gfx_0.0.0 ring reset
May 06 23:10:41 fluffys-pc kernel: **amd**gpu 0000:03:00.0: Process GameThread pid 6806 thread vkd3d_queue pid 6849
May 06 23:10:41 fluffys-pc kernel: **amd**gpu 0000:03:00.0: ring gfx_0.0.0 timeout, signaled seq=9863636, emitted seq=9863638
May 06 23:10:41 fluffys-pc kernel: **amd**gpu 0000:03:00.0: \[drm\] Check your /sys/class/drm/card1/device/devcoredump/data
May 06 23:10:41 fluffys-pc kernel: **amd**gpu 0000:03:00.0: \[drm\] AMDGPU device coredump file has been created
May 06 23:10:41 fluffys-pc kernel: **amd**gpu 0000:03:00.0: Dumping IP State Completed
May 06 23:10:41 fluffys-pc kernel: **amd**gpu 0000:03:00.0: Dumping IP State
Going to check bios update rubbish now and see if that brings it up to snuff. Also did not resolve it.
Would it be possible for you to please try to boot into a LTS kernel, i.e. 6.18.x or into the 6.12.x and see if you still face the issue? Please note that Linux LTS 6.12.x is available from AUR and will require you to compile it, which is going to take a lot of time. On my under-powered system it took a few hours. This will rule out whether it is a problem with the Linux Kernel 7.0.3 or other branches too.
Secondly can you please try to set an environment variable
RADV_DEBUG=nocompute
reboot the system and then check if you are still facing the same issue while running firefox and steam?
Try this with the Linux LTS 6.18.x kernel as well as Linux LTS 6.12.x kernel too.
I am assuming that you have not overclocked your AMD GPU and it is running on its factory setting. If you have, can you please reset it to default values. For checking the various frequency values please look at the files present in the directory /sys/class/drm/card1/device or /sys/class/drm/card0/device/. You are looking for files pp_power_profile_mode, pp_od_clk_voltage and power_dpm_force_performance_level
You are using Intel i7-13700K processor, which has an inbuilt GPU, i.e. Intel UHD Graphics 770. So basically you have two GPUs. Now the inxi command ouput given above does not list the Intel GPU, i.e. UHD Graphics 770. Just to confirm that is the case would it be possible for you to please run the following command
lspci -k
and check that only a single VGA/Graphics is listed and that it is AMD. Also check in the output what is the Kernel driver in use listed? For intel it will be i9154 or equivalent. While for AMD it will be amdgpu
There are 11 newer Bios updates for this motherboard. I suggest you download the latest UEFI Bios update and use M Flash in the Bios settings screen to update it.
The behaviour became so prevalent that I had to uninstall the hardware until I had more time to debug it, I will be reinstalling it shortly to continue debugging it. Give me a couple hours to get back to you on this.
This update was conducted but the behaviour persisted enough that I had to swap back to my old card until I had time to sit down again and deal with it.
Response to steps:
lspci -k returns only returns a VGA/Graphics line pointing to AMD. Kernel driver in use lists both i915 and amdgpu.
Using LTS 6.12 resulted in the same faults. Setting RADV_DEBUG=nocompute caused the failure to present more rapidly.
Using LTS 6.18 resulted in the same behaviours as 6.12.
Going to try roll back drivers to see if that does anything. Well, that’s not working as it seems I’m misremembering how.
If the output of lspic -k shows only one GPU, i.e. AMD then the drivers in use cannot be i915. i915 is an intel driver. You are running a 13th Gen Intel Core i7-13700K CPU which has its own inbuilt GPU, i.e. iGPU. Intel® UHD Graphics 770.
Do you have a udev rule or some service or some timer which forces the Intel GPU to be switched off or disabled?
Good so you have both AMD GPU and Intel iGPU installed and configured with the proper drivers running. The intel iGPU is using i915 xe drivers. The AMD GPU is using amdgpu
Both of them are active.
Have you updated all the firmware as requested by @ricklinux ?
This shows something interesting. It shows the AMD GPU has been wedged. That is the GPU got reset due some issues. More information on this over here. AMD GPU and Intel GPU have all implemented this capability to inform user about a wedged GPU.
@PocketDragon what you will need is to look for logs above the timestamp May 06 23:10:47 and find out why is the AMD GPU is getting wedged. Typical reasons might be
Faulty Drivers and firmware - So update to latest AMD drivers and BIOS firmware.
Monitor the VRAM utilization of AMD GPU and see if it spikes.
Monitor the temp of the AMD GPU. If it spikes or goes on the higher side then that also might lead to this problem. In this case you will have to look at cooling the GPU. Or see if there is any dust or blockage in the vent and air gaps. This looks like a desktop, so be prepared that you might have to go for better or bigger cooling solution.
Intel CPU/GPUs typically can handle for some time 90-100 degrees Centigrade temp. AMD CPU/GPU typically max out at 80-85 degrees Centigrade.
Reduce the games/steam and monitors refresh rate to say 60 Hz or 90 Hz. See at which minimum refresh rate does the issue not occur.
Also refer to this Reditt thread and turn on “enable unified gpu usage monitoring” in MSI Firmware that you have. This will have to be done AFTER you have updated your BIOS Firmware. Not Before. If this does not work then please set the “unified gpu usage monitoring” to back to its original value.
Full up to date. Though it seems I am not the only one experiencing deficiencies on the new AMD drivers. However, rolling back also has not worked at all.
I have yet to observe a spike when this occurs.
I haven’t observed the intel side as I was focused on the AMD issues, but the hardware has only really been sitting near the 70-75 mark under load. I can investigate a better cooling solution regardless though. Though the issue has presented itself even before it has been under load for any extent, crashing nigh immediately that load is placed on it.
I have checked this on a fair number of things and the behaviour presents itself regardless.
I will check these in order and report back.
This does not present anywhere in bios. I also cannot find documentation on it that does not strictly refer to a windows specific utility. Do you have documentation on how to go about this? I may be overlooking something but I could not find anything useful on it.