Once stable AMDGPU now crashing while playing Fallout 76

adamis · October 16, 2022, 5:41am

I’ve been playing Fallout 76 for close to a year now. Up until a couple of weeks ago, it used to be rock solid and I could play for hours on end. I took a break for about a week and now it crashes every time the game is loaded (I can get to the menus and start a game but it crashes once loaded or shortly after once loaded). Since I took a pause on the game for about a week, I don’t know if any updates I ran started the issue.

So far I’ve tried, different kernels (Zen, normal and LTS) to no avail. I’ve tried using the MESA-Git packages with no luck. I’ve tried downgrading MESA to 22.1.0 and 22.0 versions with no luck. I’ve tried downgrading Linux-firmware to one from back in March. I’ve tried running a non Wayland session, I’ve also tried using different versions of Proton, all with no difference in behavior. The crash results in the screenshot below. I assume this is some sort of memory corruption or something.

I’ve checked the journal and I have the following errors below.

I’ve done a log of searching and I cannot seem to find where or what this issue is. Is it the Kernel, is it MESA, Vulkan-Radeon, something else. How might I go about trouble shooting this?

Thanks in advance for any assistance.

Oct 15 22:24:20 robert-desktop kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=21435, emitted seq=21437
Oct 15 22:24:20 robert-desktop kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Fallout76.exe pid 3610 thread Fallout76.exe pid 3610
Oct 15 22:24:20 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
Oct 15 22:24:21 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: BACO reset
Oct 15 22:24:21 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset succeeded, trying to resume
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 1
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 1
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 1
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 1
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring vce0 uses VM inv eng 9 on hub 1
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring vce1 uses VM inv eng 10 on hub 1
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: ring vce2 uses VM inv eng 11 on hub 1
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: recover vram bo from shadow start
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: recover vram bo from shadow done
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: GPU reset(2) succeeded!
Oct 15 22:24:22 robert-desktop kernel: Modules linked in: hid_logitech_hidpp mousedev uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev joydev hid_logitech_dj snd_usb_audio eeepc_wmi intel_rapl_msr intel_rapl_common snd_usbmidi_lib edac_mce_amd kvm_amd asus_wmi snd_rawmidi ccp sparse_keymap snd_hda_codec_realtek uas snd_seq_device platform_profile usbhid mc usb_storage asus_wmi_sensors video snd_hda_codec_generic rng_core wmi_bmof rfkill mxm_wmi asus_ec_sensors ledtrig_audio snd_hda_codec_hdmi snd_hda_intel kvm amdgpu snd_intel_dspcfg snd_intel_sdw_acpi irqbypass snd_hda_codec crct10dif_pclmul crc32_pclmul snd_hda_core polyval_clmulni btrfs snd_hwdep polyval_generic gpu_sched gf128mul ghash_clmulni_intel drm_buddy snd_pcm blake2b_generic drm_ttm_helper aesni_intel xor ttm raid6_pq snd_timer crypto_simd libcrc32c drm_display_helper igb cryptd snd rapl sp5100_tco pcspkr cec soundcore k10temp i2c_piix4 dca mac_hid gpio_amdpt wmi gpio_generic acpi_cpufreq vboxnetflt(OE)
Oct 15 22:24:22 robert-desktop kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Oct 15 22:24:22 robert-desktop plasmashell[1427]: amdgpu: amdgpu_cs_query_fence_status failed.
Oct 15 22:24:22 robert-desktop plasmashell[1427]: amdgpu: The CS has been cancelled because the context is lost.
Oct 15 22:24:22 robert-desktop kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Oct 15 22:24:22 robert-desktop kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Oct 15 22:24:22 robert-desktop kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:220 vmid:4 pasid:32774, for process plasmashell pid 1427 thread plasmashel:cs0 pid 1458)
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu:   in page starting at address 0x0000800000000000 from IH client 0x1b (UTCL2)
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x004009B8
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu:          Faulty UTCL2 client ID: CPF (0x4)
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu:          MORE_FAULTS: 0x0
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu:          WALKER_ERROR: 0x4
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu:          PERMISSION_FAULTS: 0xb
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu:          MAPPING_ERROR: 0x1
Oct 15 22:24:22 robert-desktop kernel: amdgpu 0000:0c:00.0: amdgpu:          RW: 0x0

KDen · October 16, 2022, 5:52am

Reading that looks like it could be a plasmashell or compositing error. KDE recently had an update to 5.26 could be the cause. I’m not too familiar with KDE so don’t know if there’s a way to disable it’s compositor of the top of my head. You could try installing a basic window manager like openbox or awesomewm they come with easy to use defaults for those not familiar with setting up a window manager.

Here’s some on the Endeavor Discovery that will help you setup/tweak EOS themes
https://discovery.endeavouros.com/category/window-tiling-managers/

adamis · October 16, 2022, 4:41pm

Sorry, I posted the wrong section of the log file. I have corrected the log file snippet in my original post to show the section related to Fallout.

KDen · October 16, 2022, 5:06pm

That looks like it could potentially be an old bug. People are still posting in that thread. You may find a solution that works for you in it.
https://bugzilla.kernel.org/show_bug.cgi?id=205089

Another place to look:
https://gitlab.freedesktop.org/drm/amd/-/issues/?search=Failed%20to%20initialize%20parser&sort=created_date&state=opened&first_page_size=20

And of course just a general internet search brings up a boat load of hits as well
https://html.duckduckgo.com/html?q=%5Bamdgpu%5D%5D%20ERROR%20Failed%20to%20initialize%20parser

GolDNenex · October 16, 2022, 5:15pm

Just experienced the same thing with Age of empires 4.

oct. 16 17:02:43 EOSR4 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=13937308, emitted seq=13937311
oct. 16 17:02:43 EOSR4 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RelicCardinal.e pid 51724 thread RelicCardinal.e pid 51724
oct. 16 17:02:43 EOSR4 kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
oct. 16 17:02:44 EOSR4 kernel: amdgpu 0000:04:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
oct. 16 17:02:44 EOSR4 kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
oct. 16 17:02:44 EOSR4 kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
oct. 16 17:02:44 EOSR4 kernel: [drm] free PSP TMR buffer
oct. 16 17:02:44 EOSR4 kernel: amdgpu 0000:04:00.0: amdgpu: MODE1 reset
oct. 16 17:02:44 EOSR4 kernel: amdgpu 0000:04:00.0: amdgpu: GPU mode1 reset
oct. 16 17:02:44 EOSR4 kernel: amdgpu 0000:04:00.0: amdgpu: GPU smu mode1 reset
oct. 16 17:02:44 EOSR4 kernel: amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
oct. 16 17:02:44 EOSR4 kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
oct. 16 17:02:44 EOSR4 kernel: [drm] VRAM is lost due to GPU reset!
oct. 16 17:02:44 EOSR4 kernel: [drm] PSP is resuming...

Verifying the files with steam fixed the problem. Did you try ?

edit: Sadly today i have the same crash than before. Weird because that work great for 4/5 games of 30/50 min. The verification didn’t find files to change this time, so it was probably not related. My methodology seemed good though

Schlaefer · October 16, 2022, 5:29pm

Just got an ring_gfx timeout crash. First crash in many months. My bet would be on an AMDGPU/Mesa 22.2 issue.

ricklinux · October 16, 2022, 5:30pm

Anyone try downgrading mesa?

Schlaefer · October 16, 2022, 5:59pm

I tried to replicate the situation and in one out of three it crashed again. Quitting a proton game that ran in gamescope fullscreen mode triggered it. Also plasma Wayland here.

Alas I don’t have the time to look into it further at the moment. Maybe next weekend.

adamis · October 16, 2022, 8:39pm

Yes, In m post I wrote I tried downgrading Mesa and still and the same issue.

mihalycsaba · October 16, 2022, 10:11pm

there was a vulkan-radeon update a few days ago… maybe try amdvlk and don’t use gamescope

Echoa · October 16, 2022, 10:32pm

Do you have a 5000 series GPU? My 5700xt started doing this a lot before dying out right (was even artifacting in BIOS)

adamis · October 17, 2022, 2:53am

I have a Vega 56.

Echoa · October 17, 2022, 3:43am

Those have a tendency for the VRam to fail (most of the AMD HBM cards had this issue) from what ive heard in overclocking/enthusiast circles. I would see about testing your VRam because my 5700xt had the VRam fail completely and the first signs had been what youre having happen.

TBH stress testing in linux isnt as easy as windows but you might try looping heaven/valley/superpostion tests and when you get a crash check systemd journal. Id guess the lock up will usually occur on the same exact core each time (mine was core 0). I thought it was just usual driver bugs/RDNA1 issues but the card just up and failed.

If that doesnt help try posting a bug report with as much information to replicate the issue as you can to the mesa git. Maybe it is just a driver issue?

Zesko · October 17, 2022, 6:53am

Most likely the Kernel 5.19+ has this issue. Some people said the Kernel 5.18 has no issue.

Echoa · October 17, 2022, 8:30am

He on Vega which is GCN not RDNA, neither of those issues are related to GCN based cards.

Its difficult to relate driver issues between GCN based cards let alone RDNA and GCN. GCN has far more mature support especially Polaris through Vega 56/64. RDNA1 GPUs have also suffered fence time outs, ring errors, freezes etc. since day 1 on windows and linux and nobody has for sure solved it. The RDNA2 issue isnt as common though of those 2.

adamis · October 17, 2022, 3:25pm

I can’t say for certain but I don’t think that is the issue. I’ve run the Superposition Benchmark and it ran just fine. With Fallout, it crashes within a second or two of loading into the game. If the VRAM was failing, it isn’t failing consistently between these two loads.

Echoa · October 17, 2022, 5:00pm

You have any more dmesg or journal outputs to check?

Try rolling back drivers and using LTS kernel to test if it stops. It was this week we updated to Mesa 22.2

GolDNenex · October 17, 2022, 5:05pm

Following the tips in : https://gitlab.freedesktop.org/drm/amd/-/issues/2135

I’ve tried to install/launch aoe 4 with the flatpak version of steam and that fixed my issue.

Tonight i will try to see if the problem is really steam or if its a problem with my steam skin or mangohub.

Did you also use one or both @adamis ?

Echoa · October 17, 2022, 5:12pm

Flatpak is using 22.1.7 mesa, did you try just rolling back your mesa before swapping to flatpak? Its unlikely to be your steam skin, mangohud sometimes but not usually.

GolDNenex · October 17, 2022, 5:22pm

For the skin you probably right. I was thinking about that because i was look for difference between my normal setup and the flatpak one.

I was going to downgrading mesa but trying the flatpak version of steam was faster . But you right i should try if this fix the problem (look like it should).

Also i forgot to give my hardware, its a 6700 xt.