Nvidia flickering, X11 and idea for best setup

Hi everyone.
Once in a while (every 2-3 years) I start to give linux a gain a change as my main working / gaming os.
So its time again to try. Normally I was a big fan of manjaro but I always had after a while problems with breaking stuff so I thought to try endeavouros.

Long story, short…

I have, like a lot of people, problems with wayland and flickering. While everything seams to work pretty good and smooth, I have a lot of flickering and black apps on resizing.
There were many “guides” so I ended up in using x11 which seams to work stable. It just feels a bit more “clunky” and the bigger problem is, my gpu is always in higher frequencies which results in 40W more power consumption on idle.

My System:

  • AMD 7950x3d
  • 128 GB DDR5
  • RTX 2080ti (nvidia 550.78)

I am also sitting like always in a dilemma because I would need for different purposes my gpu on host or on guests.
I need a windows vm for visual studio 2022 which should still have at least a bit a good feeling (sadly opengl + 3d acceleration virtio ofc does not work with nvidia…) and also a vm which should have my gpu passthrough for gaming / 3d modeling (fusion 360) and such.

For my “normal” windows vm which would include coding and normal stuff, a vm with this: Fmstrat / winapps works pretty good.
For my gaming / 3d vm I would need my nvidia gpu. Because I also want to try to use proton and such on my host, I would also want to be able to still try to play on my linux machine. Also for testing. Dual boot is just annoying and I would finally love to have one machine for everything with some sort of isolation by work.

Long story short again…

  • Are there some new informations about on how to use wayland with nvidia without flickering?
  • how can I resolve my high consumption in idle under x11? I did not had this problem on wayland.
  • Would it work to write like a script to attach, detach the nvidia gpu on the flight? I would want to use the igpu for my system as the main os and the nvidia for gaming under linux & llm`s (lm studio) but also having the option to detach if not needed to attach it to my vm.

I have a kvm so I can easily switch between the mainboard and gpu outputs.

I know, this is pretty “extreme” as use cases but thats my life and I try to get the best out of it. Thats why it was always hard to switch away from macos (working + parallels) and windows (gaming + llm).

can you report

inxi -Fza 
cpupower frequency-info
lscpu -ae

amd-pstate-epp should be working by default

As I said, atm it is the nvidia which does not use the power management correctly with x11.

System:
  Kernel: 6.8.9-arch1-2 arch: x86_64 bits: 64 compiler: gcc v: 14.1.1
    clocksource: tsc avail: hpet,acpi_pm
    parameters: BOOT_IMAGE=/@/boot/vmlinuz-linux
    root=UUID=a2834319-8e6d-4945-92aa-cb74b48a02ce rw rootflags=subvol=@
    nowatchdog nvme_load=YES nvidia_drm.modeset=1 loglevel=3
    nvidia-drm.modeset=1
  Desktop: KDE Plasma v: 6.0.4 tk: Qt v: N/A info: frameworks v: 6.1.0
    wm: kwin_x11 vt: 2 dm: SDDM Distro: EndeavourOS base: Arch Linux
Machine:
  Type: Desktop System: ASUS product: N/A v: N/A serial: <superuser required>
  Mobo: ASUSTeK model: ROG STRIX X670E-E GAMING WIFI v: Rev 1.xx
    serial: <superuser required> part-nu: SKU uuid: <superuser required>
    UEFI: American Megatrends v: 1709 date: 09/28/2023
Battery:
  ID-1: hidpp_battery_0 charge: 57% condition: N/A volts: 3.8 min: N/A
    model: Logitech G502 LIGHTSPEED Wireless Gaming Mouse type: N/A
    serial: <filter> status: discharging
CPU:
  Info: model: AMD Ryzen 9 7950X3D bits: 64 type: MT MCP arch: Zen 4 gen: 5
    level: v4 note: check built: 2022+ process: TSMC n5 (5nm) family: 0x19 (25)
    model-id: 0x61 (97) stepping: 2 microcode: 0xA601206
  Topology: cpus: 1x cores: 16 tpc: 2 threads: 32 smt: enabled cache:
    L1: 1024 KiB desc: d-16x32 KiB; i-16x32 KiB L2: 16 MiB desc: 16x1024 KiB
    L3: 128 MiB desc: 1x32 MiB, 1x96 MiB
  Speed (MHz): avg: 1351 high: 4198 min/max: 400/5759 scaling:
    driver: amd-pstate-epp governor: powersave cores: 1: 400 2: 400 3: 3563
    4: 400 5: 400 6: 400 7: 400 8: 400 9: 3158 10: 400 11: 400 12: 400
    13: 4198 14: 400 15: 400 16: 400 17: 400 18: 400 19: 400 20: 400 21: 400
    22: 3385 23: 4015 24: 400 25: 400 26: 400 27: 4151 28: 4004 29: 3840
    30: 400 31: 3737 32: 400 bogomips: 268406
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
  Vulnerabilities:
  Type: gather_data_sampling status: Not affected
  Type: itlb_multihit status: Not affected
  Type: l1tf status: Not affected
  Type: mds status: Not affected
  Type: meltdown status: Not affected
  Type: mmio_stale_data status: Not affected
  Type: reg_file_data_sampling status: Not affected
  Type: retbleed status: Not affected
  Type: spec_rstack_overflow mitigation: Safe RET
  Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via
    prctl
  Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer
    sanitization
  Type: spectre_v2 mitigation: Enhanced / Automatic IBRS; IBPB:
    conditional; STIBP: always-on; RSB filling; PBRSB-eIBRS: Not affected;
    BHI: Not affected
  Type: srbds status: Not affected
  Type: tsx_async_abort status: Not affected
Graphics:
  Device-1: NVIDIA TU102 [GeForce RTX 2080 Ti Rev. A] driver: nvidia v: 550.78
    alternate: nouveau,nvidia_drm non-free: 550.xx+ status: current (as of
    2024-04; EOL~2026-12-xx) arch: Turing code: TUxxx process: TSMC 12nm FF
    built: 2018-2022 pcie: gen: 3 speed: 8 GT/s lanes: 16 ports: active: none
    off: DP-2,DP-3 empty: DP-1,HDMI-A-1,Unknown-1 bus-ID: 01:00.0
    chip-ID: 10de:1e07 class-ID: 0300
  Device-2: AMD Raphael vendor: ASUSTeK driver: amdgpu v: kernel
    arch: RDNA-2 code: Navi-2x process: TSMC n7 (7nm) built: 2020-22 pcie:
    gen: 4 speed: 16 GT/s lanes: 16 ports: active: none
    empty: DP-4,HDMI-A-2,Writeback-1 bus-ID: 12:00.0 chip-ID: 1002:164e
    class-ID: 0300 temp: 41.0 C
  Display: x11 server: X.Org v: 21.1.13 with: Xwayland v: 23.2.6
    compositor: kwin_x11 driver: X: loaded: amdgpu,nvidia unloaded: modesetting
    alternate: fbdev,nouveau,nv,vesa dri: radeonsi gpu: nvidia,nvidia-nvswitch
    display-ID: :0 screens: 1
  Screen-1: 0 s-res: 6000x1440 s-dpi: 107 s-size: 1425x342mm (56.10x13.46")
    s-diag: 1465mm (57.7")
  Monitor-1: DP-2 note: disabled pos: primary,right model: Dell AW3420DW
    serial: <filter> built: 2020 res: 3440x1440 dpi: 109 gamma: 1.2
    size: 798x335mm (31.42x13.19") diag: 865mm (34.1") modes: max: 3440x1440
    min: 640x480
  Monitor-2: DP-3 mapped: DP-4 note: disabled pos: left
    model: ViewSonic VX3276-QHD serial: <filter> built: 2020 res: 2560x1440
    dpi: 93 gamma: 1.2 size: 698x393mm (27.48x15.47") diag: 801mm (31.5")
    ratio: 16:9 modes: max: 2560x1440 min: 640x480
  API: EGL v: 1.5 hw: drv: nvidia drv: amd radeonsi platforms: device: 0
    drv: nvidia device: 2 drv: radeonsi device: 3 drv: swrast gbm: drv: nvidia
    surfaceless: drv: nvidia x11: drv: nvidia inactive: wayland,device-1
  API: OpenGL v: 4.6.0 compat-v: 4.5 vendor: nvidia mesa v: 550.78
    glx-v: 1.4 direct-render: yes renderer: NVIDIA GeForce RTX 2080 Ti/PCIe/SSE2
    memory: 10.74 GiB
  API: Vulkan v: 1.3.279 layers: 5 device: 0 type: discrete-gpu name: NVIDIA
    GeForce RTX 2080 Ti driver: nvidia v: 550.78 device-ID: 10de:1e07
    surfaces: xcb,xlib
Audio:
  Device-1: NVIDIA TU102 High Definition Audio driver: snd_hda_intel v: kernel
    pcie: gen: 3 speed: 8 GT/s lanes: 16 bus-ID: 01:00.1 chip-ID: 10de:10f7
    class-ID: 0403
  Device-2: AMD Rembrandt Radeon High Definition Audio vendor: ASUSTeK
    driver: snd_hda_intel v: kernel pcie: gen: 4 speed: 16 GT/s lanes: 16
    bus-ID: 12:00.1 chip-ID: 1002:1640 class-ID: 0403
  Device-3: Astro Gaming A50 driver: hid-generic,snd-usb-audio,usbhid
    type: USB rev: 2.0 speed: 12 Mb/s lanes: 1 mode: 1.1 bus-ID: 5-5.1:4
    chip-ID: 9886:002c class-ID: 0300
  Device-4: Conexant Systems (Rockwell) USB Audio
    driver: hid-generic,snd-usb-audio,usbhid type: USB rev: 2.0 speed: 12 Mb/s
    lanes: 1 mode: 1.1 bus-ID: 5-5.3.4:13 chip-ID: 0572:14c1 class-ID: 0300
    serial: <filter>
  API: ALSA v: k6.8.9-arch1-2 status: kernel-api
    tools: alsactl,alsamixer,amixer
  Server-1: PipeWire v: 1.0.5 status: active with: 1: pipewire-pulse
    status: active 2: wireplumber status: active 3: pipewire-alsa type: plugin
    4: pw-jack type: plugin tools: pactl,pw-cat,pw-cli,wpctl
Network:
  Device-1: Intel Ethernet I225-V vendor: ASUSTeK driver: igc v: kernel pcie:
    gen: 2 speed: 5 GT/s lanes: 1 port: N/A bus-ID: 09:00.0 chip-ID: 8086:15f3
    class-ID: 0200
  IF: eno1 state: up speed: 1000 Mbps duplex: full mac: <filter>
  IF-ID-1: virbr0 state: down mac: <filter>
  Info: services: NetworkManager,systemd-timesyncd
Bluetooth:
  Device-1: Intel AX210 Bluetooth driver: btusb v: 0.8 type: USB rev: 2.0
    speed: 12 Mb/s lanes: 1 mode: 1.1 bus-ID: 3-9:6 chip-ID: 8087:0032
    class-ID: e001
  Report: btmgmt ID: hci0 rfk-id: 0 state: up address: <filter> bt-v: 5.3
    lmp-v: 12 status: discoverable: no pairing: no class-ID: 6c0104
Drives:
  Local Storage: total: 2.31 TiB used: 46.88 GiB (2.0%)
  SMART Message: Unable to run smartctl. Root privileges required.
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Samsung model: SSD 980 PRO 2TB
    size: 1.82 TiB block-size: physical: 512 B logical: 512 B speed: 63.2 Gb/s
    lanes: 4 tech: SSD serial: <filter> fw-rev: 5B2QGXA7 temp: 45.9 C
    scheme: GPT
  ID-2: /dev/nvme1n1 maj-min: 259:3 vendor: Samsung model: SSD 970 PRO 512GB
    size: 476.94 GiB block-size: physical: 512 B logical: 512 B speed: 31.6 Gb/s
    lanes: 4 tech: SSD serial: <filter> fw-rev: 1B2QEXP7 temp: 36.9 C
    scheme: GPT
  ID-3: /dev/sda maj-min: 8:0 vendor: SanDisk model: USB 3.2Gen1
    size: 28.65 GiB block-size: physical: 512 B logical: 512 B type: USB
    rev: 3.2 spd: 5 Gb/s lanes: 1 mode: 3.2 gen-1x1 tech: N/A serial: <filter>
    fw-rev: 1.00 scheme: MBR
  SMART Message: Unknown USB bridge. Flash drive/Unsupported enclosure?
Partition:
  ID-1: / raw-size: 1.82 TiB size: 1.82 TiB (100.00%) used: 46.88 GiB (2.5%)
    fs: btrfs dev: /dev/nvme0n1p2 maj-min: 259:2
  ID-2: /boot/efi raw-size: 1024 MiB size: 1022 MiB (99.80%)
    used: 584 KiB (0.1%) fs: vfat dev: /dev/nvme0n1p1 maj-min: 259:1
  ID-3: /home raw-size: 1.82 TiB size: 1.82 TiB (100.00%)
    used: 46.88 GiB (2.5%) fs: btrfs dev: /dev/nvme0n1p2 maj-min: 259:2
  ID-4: /var/log raw-size: 1.82 TiB size: 1.82 TiB (100.00%)
    used: 46.88 GiB (2.5%) fs: btrfs dev: /dev/nvme0n1p2 maj-min: 259:2
Swap:
  Alert: No swap data was found.
Sensors:
  System Temperatures: cpu: 51.2 C mobo: N/A
  Fan Speeds (rpm): N/A
  GPU: device: nvidia screen: :0.0 temp: 32 C fan: 0% device: amdgpu
    temp: 41.0 C
  Power: 12v: 12.08 5v: N/A 3.3v: N/A vbat: N/A
Info:
  Memory: total: 128 GiB available: 125.38 GiB used: 7.56 GiB (6.0%)
  Processes: 578 Power: uptime: 2h 34m states: freeze,mem,disk suspend: deep
    avail: s2idle wakeups: 0 hibernate: platform avail: shutdown, reboot,
    suspend, test_resume image: 50.11 GiB services: org_kde_powerdevil,
    power-profiles-daemon, upowerd Init: systemd v: 255 default: graphical
    tool: systemctl
  Packages: 1213 pm: pacman pkgs: 1204 libs: 327 tools: yay pm: flatpak
    pkgs: 9 Compilers: gcc: 14.1.1 Shell: Bash v: 5.2.26 running-in: konsole
    inxi: 3.3.34

if you have wlroots installed may be see this point

aur/wlroots0.16-nvidia 0.16.2-1 (+1 0.21) 
    Modular Wayland compositor library, with fix for nvidia flickering, downgraded to version 0.16

thanks for your reply. I think, with this post I also could sort a bit my thoughts.

After many tests, it feels like the amd igpu is just the way to go together with wayland. It feels just better than x11 and I do not have any problems this way.
In the upcoming tests I will als try again to use the Fmstrat / winapps in the hope, this works better with the amd igpu.
After a while and understanding what this “dracut” is I was able to also pass through the nvidia gpu and it is also working perfectly.

I will also do some tests performance wise to see how good it is compared to bare metal.

My only “problem” atm is… still the power consumption. Because there is no nvidia drivers controling the clock, the gpu just consomes 40W more than normal in idle. Is there a way to somehow set the gpu to the lowest power consumption possible? or maybe to like a soft attach to the nvidia driver if no vm is running? or any script or what ever?

At best I would like to be able to truly use the gpu also on my host if no vm is running without a reboot needed. Maybe setting and force interrupts?

It seams like unbind it like this:

echo "0000:01:00.0" | sudo tee /sys/bus/pci/drivers/vfio-pci/unbind
echo "0000:01:00.1" | sudo tee /sys/bus/pci/drivers/vfio-pci/unbind
echo "0000:01:00.2" | sudo tee /sys/bus/pci/drivers/vfio-pci/unbind
echo "0000:01:00.3" | sudo tee /sys/bus/pci/drivers/vfio-pci/unbind

makes it possible to run again nvidia-smi which brings the nvidia driver to run again on my host and also using the power consumption profiles at it should.

But now It seams like it is not possible to bind it back to the vfio-pci even if

echo "0000:01:00.0" | sudo tee /sys/bus/pci/drivers/vfio-pci/bind
echo "0000:01:00.1" | sudo tee /sys/bus/pci/drivers/vfio-pci/bind
echo "0000:01:00.2" | sudo tee /sys/bus/pci/drivers/vfio-pci/bind
echo "0000:01:00.3" | sudo tee /sys/bus/pci/drivers/vfio-pci/bind

I have to reboot to make it work again. This is what I need to solve

I could bring it to work as I wanted. I think, I will create a script or such later on. If someone is interested:

Note: nvidia-inst was used to install the latest on my pc.

Pre install (AMD + Nvidia)

sudo nano /etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT="nowatchdog nvme_load=YES loglevel=3 amd_iommu=on iommu=pt nvidia-drm.modeset=1"

sudo nano /etc/modprobe.d/vfio.conf

options vfio-pci ids=10de:1e07,10de:10f7,10de:1ad6,10de:1ad7
softdep nvidia pre: vfio-pci

sudo nano /etc/dracut.conf.d/10-vfio.conf

force_drivers+=" vfio_pci vfio vfio_iommu_type1 "
sudo grub-mkconfig -o /boot/grub/grub.cfg
sudo dracut-rebuild

Output after reboot (fresh start)

lspci -nnk | grep -A 3 "NVIDIA"

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12fa]
        Kernel driver in use: vfio-pci
        Kernel modules: nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12fa]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
01:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:1ad6] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12fa]
        Kernel driver in use: vfio-pci
        Kernel modules: xhci_pci
01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:1ad7] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12fa]
        Kernel driver in use: vfio-pci
        Kernel modules: i2c_nvidia_gpu
  • VM working as expected
  • nvidia-smi not working as expected
    • “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”

vfio to host

echo "0000:01:00.0" | sudo tee /sys/bus/pci/drivers/vfio-pci/unbind
echo "0000:01:00.1" | sudo tee /sys/bus/pci/drivers/vfio-pci/unbind
echo "0000:01:00.2" | sudo tee /sys/bus/pci/drivers/vfio-pci/unbind
echo "0000:01:00.3" | sudo tee /sys/bus/pci/drivers/vfio-pci/unbind
sudo modprobe nvidia
sudo modprobe nvidia_modeset
sudo modprobe nvidia_drm
sudo modprobe nvidia_uvm

echo "0000:01:00.0" | sudo tee /sys/bus/pci/drivers/nvidia/bind

sudo modprobe snd_hda_intel
echo "0000:01:00.1" | sudo tee /sys/bus/pci/drivers/snd_hda_intel/bind
sudo modprobe xhci_pci
echo "0000:01:00.2" | sudo tee /sys/bus/pci/drivers/xhci_hcd/bind

For the Serial Bus Controller (01:00.3): This device is typically not manually bound like the others since its functionality is often managed internally by the kernel or other modules. If needed, ensure i2c_nvidia_gpu is loaded (though this is not commonly bound/unbound manually).

lspci -nnk | grep -A 3 "NVIDIA"

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12fa]
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12fa]
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel
01:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:1ad6] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12fa]
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci
01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:1ad7] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12fa]
        Kernel modules: i2c_nvidia_gpu
  • vm not working as expected
  • nvidia-smi and so on are working as expected.

host to vfio

echo "0000:01:00.2" | sudo tee /sys/bus/pci/drivers/xhci_hcd/unbind
echo "0000:01:00.1" | sudo tee /sys/bus/pci/drivers/snd_hda_intel/unbind
sudo rmmod nvidia_uvm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia
echo "0000:01:00.0" | sudo tee /sys/bus/pci/drivers/vfio-pci/bind
echo "0000:01:00.1" | sudo tee /sys/bus/pci/drivers/vfio-pci/bind
echo "0000:01:00.2" | sudo tee /sys/bus/pci/drivers/vfio-pci/bind
echo "0000:01:00.3" | sudo tee /sys/bus/pci/drivers/vfio-pci/bind

lspci -nnk | grep -A 3 "NVIDIA"

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti Rev. A] [10de:1e07] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12fa]
        Kernel driver in use: vfio-pci
        Kernel modules: nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation TU102 High Definition Audio Controller [10de:10f7] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12fa]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
01:00.2 USB controller [0c03]: NVIDIA Corporation TU102 USB 3.1 Host Controller [10de:1ad6] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12fa]
        Kernel driver in use: vfio-pci
        Kernel modules: xhci_pci
01:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller [10de:1ad7] (rev a1)
        Subsystem: NVIDIA Corporation Device [10de:12fa]
        Kernel driver in use: vfio-pci
        Kernel modules: i2c_nvidia_gpu

virt-manager had to be restarted as it seams (reopen it if it was open before).

  • VM working as expected
  • nvidia-smi not working as expected
    • “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”

Notes:

modprobe has to be executed in this order!

sudo modprobe nvidia
sudo modprobe nvidia_modeset
sudo modprobe nvidia_drm
sudo modprobe nvidia_uvm

rmmod has to be executed in this order!

sudo rmmod nvidia_uvm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia