Amdgpu crashes with kernel 5.10

Try amdgpu.gpu_recovery=1 to force-enable automatic recovery, however I think it’s already the default on Navi.

Other parameters which could help are amdgpu.dpm=0 (to disable dynamic power management), and amdgpu.aspm=0 (to disable ASPM).

Is it always at an exact 10 minute mark where it happens? In that case, you might want to disable maybe monitor power saving as well, if it is configured to power down at 10 minutes.

When everything fails, it might be a good idea to report the issue directly to amdgpu developers.

EDIT: If you report the bug, make sure you use a stock kernel.
I only had problems with the CalcULE scheduler in my tests a few weeks ago. I would not use that kernel just yet. But that’s just my personal take.

1 Like

I had random gpu black screen issues on an AMD system waking from sleep requiring REISUB, amd_iommu=off fixed that.

But this was pre 5.10, probably different issue.

1 Like

That was a good hint. I disabled iommu in the BIOS and this actually prevents the freezes so far. I made two tests so far and will keep testing.

What is the implication of disabling iommu? Is there any negative impact to that? At least my virtualboxes are still running.

1 Like

Maybe that could hint to a BIOS issue.
I have very similar hardware (3900+Navi10) and no crashes so far. It’s an Asus board though.

Possibly a performance and/or security impact but I’m not sure.

In the meantime I believe I have a BIOS issue. Sorry, I did tell otherwise earlier.

The WARNING messages like

is showing in previous BIOS 1.C0 version also. But as I found out now: These warnings do not produce freezes. They are independent. And eventually harmless.

I did a downgrade from BIOS 1.D2 (beta) to 1.C0 and that fixes the issue. No more freezes. With BIOS 1.C0 I stll get the

Jan 08 14:47:02 rakete kernel: __common_interrupt: 1.55 No irq handler for vector
Jan 08 14:47:02 rakete kernel: __common_interrupt: 2.55 No irq handler for vector
Jan 08 14:47:02 rakete kernel: __common_interrupt: 3.55 No irq handler for vector
Jan 08 14:47:02 rakete kernel: __common_interrupt: 4.55 No irq handler for vector
Jan 08 14:47:02 rakete kernel: __common_interrupt: 5.55 No irq handler for vector
Jan 08 14:47:02 rakete kernel: __common_interrupt: 6.55 No irq handler for vector
Jan 08 14:47:02 rakete kernel: __common_interrupt: 7.55 No irq handler for vector
Jan 08 14:47:02 rakete kernel: __common_interrupt: 8.55 No irq handler for vector
Jan 08 14:47:02 rakete kernel: __common_interrupt: 9.55 No irq handler for vector
Jan 08 14:47:02 rakete kernel: __common_interrupt: 10.55 No irq handler for vector

These message have been the main reason for me to go to BIOS 1.D2 because they do no appear there. But obvioulsy BIOS 1.D2 introduces an IOMMU bug which causes my system to freeze after 10 min idle time

Long story short:
With BIOS 1D2 I need to turn off IOMMU in the BIOS to have a stable system.
With BIOS 1.C0 I can leave IOMMU turned on but I havce to live with the error messages in the logs.

I will stick with BIOS 1.C0 for now.

Thank you all for your help on this. The hint regarding IOMMU from @otherbarry pointed me in the right direction.

2 Likes

The freezes most certainly come from the two “BUG” messages in your log, it’s then when the kernel oopses.

But good to know that it is a BIOS issue. Can be helpful for other users as well.

Thanks for your feedback @mbod, great you got it work :rocket:
I remember about a bug with IOMMU on my dad Deepin (Debian) on a previous version with a Ryzen 2200G and the same than you, I had to turn IOMMU off.

Maybe You have to wait the next Bios 1D2 bêta pass on stable version to get IOMMU works without error message !

From my understanding the main implication on a Linux desktop system was PCI Passthrough would no longer work, which I’ve never used anyway.

1 Like

add “noapic” to suppress x.55 irq handler for vector

1 Like

That does not work. Tried it already. The only thing that helps is to disable the IOMMU in the BIOS. But I am hesitant to do that.

EDIT:
I was wrong. noapic does work. I thought I had tested this option already several month ago. But either I did not test it or the behaviour changed. But noapic is certainly eliminating these messages. Thank you!

Are there any negative side effects regarding noapic?

the kernel keep to check apic , but dont trust all apic

sudo journalctl -b0 -g api
sudo journalctl -b0 -g interrupt
1 Like

May be you can review the output of my system with apic=debug. Is there anything odd?

journalctl -b0 -g api
-- Journal begins at Sun 2020-12-06 07:56:30 CET, ends at Sat 2021-01-09 09:22:34 CET. --
Jan 09 09:14:46 rakete kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-manjaro-xanmod root=UUID=0a765f87-6eca-4e05-bd1a-36ac4ba3fb8f rw audit=0 nmi_watchdog=0 apic=debug quiet loglevel=3
Jan 09 09:14:46 rakete kernel: ACPI: APIC 0x00000000CB2DA000 00015E (v03 ALASKA A M I    01072009 AMI  00010013)
Jan 09 09:14:46 rakete kernel: ACPI: Local APIC address 0xfee00000
Jan 09 09:14:46 rakete kernel: mapped APIC to ffffffffff5fd000 (        fee00000)
Jan 09 09:14:46 rakete kernel: ACPI: Local APIC address 0xfee00000
Jan 09 09:14:46 rakete kernel: ACPI: LAPIC_NMI (acpi_id[0xff] high edge lint[0x1])
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: apic_id 17, version 33, address 0xfec00000, GSI 0-23
Jan 09 09:14:46 rakete kernel: IOAPIC[1]: apic_id 18, version 33, address 0xfec01000, GSI 24-55
Jan 09 09:14:46 rakete kernel: Int: type 0, pol 0, trig 0, bus 00, IRQ 00, APIC ID 11, APIC INT 02
Jan 09 09:14:46 rakete kernel: Int: type 0, pol 3, trig 3, bus 00, IRQ 09, APIC ID 11, APIC INT 09
Jan 09 09:14:46 rakete kernel: Int: type 0, pol 0, trig 0, bus 00, IRQ 01, APIC ID 11, APIC INT 01
Jan 09 09:14:46 rakete kernel: Int: type 0, pol 0, trig 0, bus 00, IRQ 03, APIC ID 11, APIC INT 03
Jan 09 09:14:46 rakete kernel: Int: type 0, pol 0, trig 0, bus 00, IRQ 04, APIC ID 11, APIC INT 04
Jan 09 09:14:46 rakete kernel: Int: type 0, pol 0, trig 0, bus 00, IRQ 05, APIC ID 11, APIC INT 05
Jan 09 09:14:46 rakete kernel: Int: type 0, pol 0, trig 0, bus 00, IRQ 06, APIC ID 11, APIC INT 06
Jan 09 09:14:46 rakete kernel: Int: type 0, pol 0, trig 0, bus 00, IRQ 07, APIC ID 11, APIC INT 07
Jan 09 09:14:46 rakete kernel: Int: type 0, pol 0, trig 0, bus 00, IRQ 08, APIC ID 11, APIC INT 08
Jan 09 09:14:46 rakete kernel: Int: type 0, pol 0, trig 0, bus 00, IRQ 0a, APIC ID 11, APIC INT 0a
Jan 09 09:14:46 rakete kernel: Int: type 0, pol 0, trig 0, bus 00, IRQ 0b, APIC ID 11, APIC INT 0b
Jan 09 09:14:46 rakete kernel: Int: type 0, pol 0, trig 0, bus 00, IRQ 0c, APIC ID 11, APIC INT 0c
Jan 09 09:14:46 rakete kernel: Int: type 0, pol 0, trig 0, bus 00, IRQ 0d, APIC ID 11, APIC INT 0d
Jan 09 09:14:46 rakete kernel: Int: type 0, pol 0, trig 0, bus 00, IRQ 0e, APIC ID 11, APIC INT 0e
Jan 09 09:14:46 rakete kernel: Int: type 0, pol 0, trig 0, bus 00, IRQ 0f, APIC ID 11, APIC INT 0f
Jan 09 09:14:46 rakete kernel: mapped IOAPIC to ffffffffff5fc000 (fec00000)
Jan 09 09:14:46 rakete kernel: mapped IOAPIC to ffffffffff5fb000 (fec01000)
Jan 09 09:14:46 rakete kernel: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-manjaro-xanmod root=UUID=0a765f87-6eca-4e05-bd1a-36ac4ba3fb8f rw audit=0 nmi_watchdog=0 apic=debug quiet loglevel=3
Jan 09 09:14:46 rakete kernel: APIC: Switch to symmetric I/O mode setup
Jan 09 09:14:46 rakete kernel: Switched APIC routing to physical flat.
Jan 09 09:14:46 rakete kernel: ENABLING IO-APIC IRQs
Jan 09 09:14:46 rakete kernel: init IO_APIC IRQs
Jan 09 09:14:46 rakete kernel:  apic 17 pin 0 not connected
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: Set routing entry (17-1 -> 0xef -> IRQ 1 Mode:0 Active:0 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: Set routing entry (17-2 -> 0x30 -> IRQ 0 Mode:0 Active:0 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: Set routing entry (17-3 -> 0xef -> IRQ 3 Mode:0 Active:0 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: Set routing entry (17-4 -> 0xef -> IRQ 4 Mode:0 Active:0 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: Set routing entry (17-5 -> 0xef -> IRQ 5 Mode:0 Active:0 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: Set routing entry (17-6 -> 0xef -> IRQ 6 Mode:0 Active:0 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: Set routing entry (17-7 -> 0xef -> IRQ 7 Mode:0 Active:0 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: Set routing entry (17-8 -> 0xef -> IRQ 8 Mode:0 Active:0 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: Set routing entry (17-9 -> 0xef -> IRQ 9 Mode:1 Active:1 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: Set routing entry (17-10 -> 0xef -> IRQ 10 Mode:0 Active:0 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: Set routing entry (17-11 -> 0xef -> IRQ 11 Mode:0 Active:0 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: Set routing entry (17-12 -> 0xef -> IRQ 12 Mode:0 Active:0 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: Set routing entry (17-13 -> 0xef -> IRQ 13 Mode:0 Active:0 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: Set routing entry (17-14 -> 0xef -> IRQ 14 Mode:0 Active:0 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[0]: Set routing entry (17-15 -> 0xef -> IRQ 15 Mode:0 Active:0 Dest:0)
Jan 09 09:14:46 rakete kernel:  apic 17 pin 16 not connected
Jan 09 09:14:46 rakete kernel:  apic 17 pin 17 not connected
Jan 09 09:14:46 rakete kernel:  apic 17 pin 18 not connected
Jan 09 09:14:46 rakete kernel:  apic 17 pin 19 not connected
Jan 09 09:14:46 rakete kernel:  apic 17 pin 20 not connected
Jan 09 09:14:46 rakete kernel:  apic 17 pin 21 not connected
Jan 09 09:14:46 rakete kernel:  apic 17 pin 22 not connected
Jan 09 09:14:46 rakete kernel:  apic 17 pin 23 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 0 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 1 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 2 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 3 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 4 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 5 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 6 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 7 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 8 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 9 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 10 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 11 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 12 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 13 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 14 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 15 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 16 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 17 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 18 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 19 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 20 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 21 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 22 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 23 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 24 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 25 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 26 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 27 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 28 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 29 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 30 not connected
Jan 09 09:14:46 rakete kernel:  apic 18 pin 31 not connected
Jan 09 09:14:46 rakete kernel: ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
Jan 09 09:14:46 rakete kernel: Using local APIC timer interrupts.
                               calibrating APIC timer ...
Jan 09 09:14:46 rakete kernel: ... lapic delta = 625003
Jan 09 09:14:46 rakete kernel: ACPI: Using IOAPIC for interrupt routing
Jan 09 09:14:46 rakete kernel: pps_core: LinuxPPS API ver. 1 registered
Jan 09 09:14:46 rakete kernel: IOAPIC[1]: Set routing entry (18-4 -> 0xef -> IRQ 24 Mode:1 Active:1 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[1]: Set routing entry (18-6 -> 0xef -> IRQ 25 Mode:1 Active:1 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[1]: Set routing entry (18-11 -> 0xef -> IRQ 25 Mode:1 Active:1 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[1]: Set routing entry (18-15 -> 0xef -> IRQ 26 Mode:1 Active:1 Dest:0)
Jan 09 09:14:46 rakete kernel:  PPR X2APIC NX GT IA GA PC GA_vAPIC
Jan 09 09:14:46 rakete kernel: AMD-Vi: Virtual APIC enabled
Jan 09 09:14:46 rakete kernel: AMD-Vi: X2APIC enabled
Jan 09 09:14:46 rakete kernel: fuse: init (API version 7.32)
Jan 09 09:14:46 rakete kernel: IOAPIC[1]: Set routing entry (18-10 -> 0xef -> IRQ 30 Mode:1 Active:1 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[1]: Set routing entry (18-5 -> 0xef -> IRQ 38 Mode:1 Active:1 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[1]: Set routing entry (18-6 -> 0xef -> IRQ 40 Mode:1 Active:1 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[1]: Set routing entry (18-30 -> 0xef -> IRQ 42 Mode:1 Active:1 Dest:0)
Jan 09 09:14:46 rakete kernel: ... APIC ID:      00000000 (0)
Jan 09 09:14:46 rakete kernel: ... APIC VERSION: 80050010
Jan 09 09:14:46 rakete kernel: number of IO-APIC #17 registers: 24.
Jan 09 09:14:46 rakete kernel: number of IO-APIC #18 registers: 32.
Jan 09 09:14:46 rakete kernel: testing the IO APIC.......................
Jan 09 09:14:46 rakete kernel: IO APIC #17......
Jan 09 09:14:46 rakete kernel: .......    : physical APIC id: 11
Jan 09 09:14:46 rakete kernel: .......     : IO APIC version: 21
Jan 09 09:14:46 rakete kernel: IOAPIC 0:
Jan 09 09:14:46 rakete kernel: IO APIC #18......
Jan 09 09:14:46 rakete kernel: .......    : physical APIC id: 12
Jan 09 09:14:46 rakete kernel: .......     : IO APIC version: 21
Jan 09 09:14:46 rakete kernel: IOAPIC 1:
Jan 09 09:14:46 rakete dracut-cmdline[318]: Using kernel command line parameters: audit=0 nmi_watchdog=0 quiet loglevel=3 rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0 BOOT_IMAGE=/boot/vmlinuz-manjaro-xanmod root=UUID=0a765f87-6eca-4e05-bd1a-36ac4ba3fb8f rw audit=0 nmi_watchdog=0 apic=debug quiet loglevel=3
Jan 09 09:14:46 rakete kernel: IOAPIC[1]: Set routing entry (18-0 -> 0xef -> IRQ 46 Mode:1 Active:1 Dest:0)
Jan 09 09:14:46 rakete kernel: IOAPIC[1]: Set routing entry (18-15 -> 0xef -> IRQ 76 Mode:1 Active:1 Dest:0)
Jan 09 09:14:47 rakete kernel: ata1.00: ATAPI: HL-DT-ST BD-RE  BH16NS40, 1.03, max UDMA/100
Jan 09 09:14:54 rakete kernel: RAPL PMU: API unit is 2^-32 Joules, 1 fixed counters, 163840 ms ovfl timer
Jan 09 09:14:54 rakete kernel: IOAPIC[1]: Set routing entry (18-31 -> 0xef -> IRQ 166 Mode:1 Active:1 Dest:0)
Jan 09 09:14:54 rakete kernel: IOAPIC[1]: Set routing entry (18-12 -> 0xef -> IRQ 168 Mode:1 Active:1 Dest:0)

journalctl -b0 -g interrupt
-- Journal begins at Sun 2020-12-06 07:56:30 CET, ends at Sat 2021-01-09 09:17:02 CET. --
Jan 09 09:14:46 rakete kernel: Using local APIC timer interrupts.
                               calibrating APIC timer ...
Jan 09 09:14:46 rakete kernel: __common_interrupt: 1.55 No irq handler for vector
Jan 09 09:14:46 rakete kernel: __common_interrupt: 2.55 No irq handler for vector
Jan 09 09:14:46 rakete kernel: __common_interrupt: 3.55 No irq handler for vector
Jan 09 09:14:46 rakete kernel: __common_interrupt: 4.55 No irq handler for vector
Jan 09 09:14:46 rakete kernel: __common_interrupt: 5.55 No irq handler for vector
Jan 09 09:14:46 rakete kernel: __common_interrupt: 6.55 No irq handler for vector
Jan 09 09:14:46 rakete kernel: __common_interrupt: 7.55 No irq handler for vector
Jan 09 09:14:46 rakete kernel: __common_interrupt: 8.55 No irq handler for vector
Jan 09 09:14:46 rakete kernel: __common_interrupt: 9.55 No irq handler for vector
Jan 09 09:14:46 rakete kernel: __common_interrupt: 10.55 No irq handler for vector
Jan 09 09:14:46 rakete kernel: ACPI: Using IOAPIC for interrupt routing
Jan 09 09:14:46 rakete kernel: ACPI: PCI Interrupt Link [LNKA] (IRQs 4 5 7 10 11 14 15) *0
Jan 09 09:14:46 rakete kernel: ACPI: PCI Interrupt Link [LNKB] (IRQs 4 5 7 10 11 14 15) *0
Jan 09 09:14:46 rakete kernel: ACPI: PCI Interrupt Link [LNKC] (IRQs 4 5 7 10 11 14 15) *0
Jan 09 09:14:46 rakete kernel: ACPI: PCI Interrupt Link [LNKD] (IRQs 4 5 7 10 11 14 15) *0
Jan 09 09:14:46 rakete kernel: ACPI: PCI Interrupt Link [LNKE] (IRQs 4 5 7 10 11 14 15) *0
Jan 09 09:14:46 rakete kernel: ACPI: PCI Interrupt Link [LNKF] (IRQs 4 5 7 10 11 14 15) *0
Jan 09 09:14:46 rakete kernel: ACPI: PCI Interrupt Link [LNKG] (IRQs 4 5 7 10 11 14 15) *0
Jan 09 09:14:46 rakete kernel: ACPI: PCI Interrupt Link [LNKH] (IRQs 4 5 7 10 11 14 15) *0
Jan 09 09:14:46 rakete kernel: AMD-Vi: Interrupt remapping enabled

I’m now getting this on my system as well, any real fix for it? I did find this https://bbs.archlinux.org/viewtopic.php?id=256227

It only started happening once i updated my bios, everything seems to run fine without issue.

Just found this: https://github.com/torvalds/linux/commit/fa5e5c409213265da8a188b4a5e4e641b1382eb4

Edit: It appears to be a kernel/AMD issue, so I’m guessing we’ll have to wait lol. Anyone else getting this issue?

5.11 is in the repos now so that will be worth trying.

I’m running Kernel: 5.11.1-arch1-1

New kernel, new problem, new thread.

Is the issue that you have a crash as in the OP, or that you are getting “No irq handler for vector” messages as in the Arch forum thread? Those seem to be two different issues?

Given the majority of the Arch forum replies towards the end of the thread are about the issue being solved via a BIOS update this might be something you need to talk to the motherboard manufacturer about.

1 Like

Dang, i posted in the wrong thread, my fault for having 45 tabs open lol.

Alternatively you could disable IOMMU in your BIOS, that should work also.

1 Like

On my X570 board I did just recently update again my UEFI Firmware and the no irq vector handler message is gone. This is an MSI board with AMD Ryzen.

Edit: I think I just followed you on this. :thinking: