Ran into an interesting issue this morning and figured I’d run it by people who know more than I do to see if y’all can make sense of it, because I can’t.
There are two systems involved. One is a Ryzen 9 3900X, X570 mainboard, 64GB DDR4-2400. The other is an i9-10980XE, X299 mainboard, 128GB DDR4-3200. The rest of the parts (apart from the ones built into the mainboard) are common: AMD FirePro W4100 GPU, Samsung 980 Pro boot/root volume, Intel 660p /home volume, two additional Micron 5210 ION SATA SSD data volumes (/home/user/Downloads, /home/user/VMs), one 25" 1440p display, and two 32" 2160p displays.
EndeavourOS w/GNOME was initially installed on the Ryzen-based system. Works perfectly for weeks at a time if I’m too lazy to reboot. I constantly run two VMs on it (KVM) with hardware passed through. Only ever had to add the amd_iommu and iommu kernel parameters.
I took the four SSDs and GPU out, put them into the i9’s mainboard, and reinstalled. I know it’s not strictly necessary, but I wanted to make sure I got a fresh start with the OS – same initial configuration as on the Ryzen-based system, just with intel-ucode instead of amd-ucode.
Within half an hour, system started hanging, and the displays would turn off and on every couple minutes. I was able to ssh in from another machine and saw something like this in the logs:
[ xxx.xxxxxxx] radeon 0000:65:00.0: ring 0 stalled for more than 29900msec
…over and over. Now, the GPU I’m using is ancient, but it does what I need, and I’d rather not use any of the nvidia GPUs lying around the office.
After some searching I added the radeon.dpm=0 kernel parameter and rebooted. The system was back up and running but it felt like it was running at 20fps on the desktop. That’s definitely not acceptable. Looking at the Arch wiki, I ran lspci -k | grep -A 3 -E “(VGA|3D)” and noted that the system was using the radeon driver; however, I also know that the amdgpu driver supports first-generation GCN cards, so I wanted to switch. To do this I replaced the radeon.dpm=0 kernel parameter with radeon.si_support=0 amdgpu.si_support=1 (W4100 is Cape Verde, Southern Islands/SI, not Sea Islands/CIK), blacklisted the radeon driver, reinstalled kernels, and rebooted again. Since then, the system has not experienced a hang.
I was only able to use it in the office for 5 hours in that state, though, so I’m not going to call the problem solved.
What’s driving me nuts is this: why is it that the Ryzen-based system, with the exact same GPU, behaved so much better than the i9-based system with a fresh OS installation? I had no reason to check which driver was being used by the Ryzen system but I suspect it was amdgpu given the stability.
I should note that I ran stress-ng against the CPU/RAM of the i9-10980XE system – 36 CPU hogs, 32 RAM threads against 120GB, for 72 hours – and it ran without issue. I’ve also replaced GNOME with KDE for no reason other than “I felt like it.”
Bearing in mind that both of these were fresh OS installs, what determines which GPU driver that fresh install should use when first boots? Is it possible that the X570 and X299 platforms would end up choosing different GPU drivers for the exact same physical card?