GPU Fans Ramp to the Moon And Stay There Without Reason

Anyone with insight or ideas is highly appreciated. I’ve had this issue for a few weeks now. 1660 Super, Ryzen 5800x. I suspect this issue to possibly be related to the fact my system seems unaware of the fan RPM of my GPU, but I don’t know how to resolve that or if it’s the cause. inxi output (https://clbin.com/kxLGy):

System:
  Kernel: 5.15.13-arch1-1 x86_64 bits: 64 compiler: gcc v: 11.1.0
    parameters: BOOT_IMAGE=/boot/vmlinuz-linux root=UUID=747fa68f-2809-49b6-96ba-f0ded570b8b4 rw
    quiet loglevel=3 nowatchdog nvme_load=YES nvidia-drm.modeset=1
  Console: pty pts/0 DM: SDDM Distro: EndeavourOS base: Arch Linux
Machine:
  Type: Desktop System: ASUS product: N/A v: N/A serial: <superuser required>
  Mobo: ASUSTeK model: PRIME B550M-A (WI-FI) v: Rev X.0x serial: <superuser required>
    UEFI: American Megatrends v: 2423 date: 08/09/2021
CPU:
  Info: model: AMD Ryzen 7 5800X bits: 64 type: MT MCP arch: Zen 3 family: 0x19 (25)
    model-id: 0x21 (33) stepping: 0 microcode: 0xA201016
  Topology: cpus: 1x cores: 8 tpc: 2 threads: 16 smt: enabled cache: L1: 512 KiB
    desc: d-8x32 KiB; i-8x32 KiB L2: 4 MiB desc: 8x512 KiB L3: 32 MiB desc: 1x32 MiB
  Speed (MHz): avg: 2753 high: 3094 min/max: 2200/4850 boost: enabled scaling:
    driver: acpi-cpufreq governor: schedutil cores: 1: 2960 2: 2570 3: 2724 4: 2762 5: 2877 6: 2402
    7: 2593 8: 2435 9: 3094 10: 2871 11: 2873 12: 2874 13: 2870 14: 2870 15: 2869 16: 2414
    bogomips: 121424
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
  Vulnerabilities:
  Type: itlb_multihit status: Not affected
  Type: l1tf status: Not affected
  Type: mds status: Not affected
  Type: meltdown status: Not affected
  Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via prctl and seccomp
  Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer sanitization
  Type: spectre_v2
    mitigation: Full AMD retpoline, IBPB: conditional, IBRS_FW, STIBP: always-on, RSB filling
  Type: srbds status: Not affected
  Type: tsx_async_abort status: Not affected
Graphics:
  Device-1: NVIDIA TU116 [GeForce GTX 1660 SUPER] vendor: Gigabyte driver: nvidia v: 495.46
    alternate: nouveau,nvidia_drm bus-ID: 0a:00.0 chip-ID: 10de:21c4 class-ID: 0300
  Display: server: X.org 1.21.1.3 compositor: kwin_x11 driver: loaded: nvidia
    unloaded: modesetting alternate: fbdev,nouveau,nv,vesa tty: 102x27
  Message: Advanced graphics data unavailable in console. Try -G --display
Audio:
  Device-1: NVIDIA TU116 High Definition Audio vendor: Gigabyte driver: snd_hda_intel v: kernel
    bus-ID: 0a:00.1 chip-ID: 10de:1aeb class-ID: 0403
  Device-2: AMD Starship/Matisse HD Audio vendor: ASUSTeK driver: snd_hda_intel v: kernel
    bus-ID: 0c:00.4 chip-ID: 1022:1487 class-ID: 0403
  Sound Server-1: ALSA v: k5.15.13-arch1-1 running: yes
  Sound Server-2: JACK v: 1.9.19 running: no
  Sound Server-3: PulseAudio v: 15.0 running: no
  Sound Server-4: PipeWire v: 0.3.43 running: yes
Network:
  Device-1: Intel Wi-Fi 6 AX200 driver: iwlwifi v: kernel bus-ID: 08:00.0 chip-ID: 8086:2723
    class-ID: 0280
  IF: wlan0 state: up mac: <filter>
  Device-2: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet
    vendor: ASUSTeK PRIME B450M-A driver: r8169 v: kernel port: f000 bus-ID: 09:00.0
    chip-ID: 10ec:8168 class-ID: 0200
  IF: enp9s0 state: down mac: <filter>
  IF-ID-1: wg-mullvad state: unknown speed: N/A duplex: N/A mac: N/A
Bluetooth:
  Device-1: Intel AX200 Bluetooth type: USB driver: btusb v: 0.8 bus-ID: 1-5:2 chip-ID: 8087:0029
    class-ID: e001
  Report: rfkill ID: hci0 rfk-id: 0 state: up address: see --recommends
Drives:
  Local Storage: total: 21.11 TiB used: 8.05 TiB (38.1%)
  SMART Message: Unable to run smartctl. Root privileges required.
  ID-1: /dev/sda maj-min: 8:0 vendor: PNY model: SSD2SC120G1SA754D117-820 size: 111.79 GiB
    block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s type: SSD serial: <filter> rev: 0A
    scheme: GPT
  ID-2: /dev/sdb maj-min: 8:16 vendor: Samsung model: SSD 850 PRO 1TB size: 953.87 GiB
    block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s type: SSD serial: <filter> rev: 2B6Q
    scheme: GPT
  ID-3: /dev/sdc maj-min: 8:32 vendor: Seagate model: ST10000NM0086-2AA101 size: 9.1 TiB
    block-size: physical: 4096 B logical: 512 B speed: 6.0 Gb/s type: HDD rpm: 7200
    serial: <filter> rev: SN05 scheme: GPT
  ID-4: /dev/sdd maj-min: 8:48 vendor: Utania model: OOS10000G size: 9.1 TiB block-size:
    physical: 4096 B logical: 512 B speed: 6.0 Gb/s type: HDD rpm: 7200 serial: <filter> rev: OOS1
    scheme: GPT
  ID-5: /dev/sde maj-min: 8:64 type: USB vendor: Lexar model: JumpDrive size: 59.62 GiB
    block-size: physical: 512 B logical: 512 B type: N/A serial: <filter> rev: 1.00 scheme: MBR
  SMART Message: Unknown USB bridge. Flash drive/Unsupported enclosure?
  ID-6: /dev/sdf maj-min: 8:80 type: USB vendor: Western Digital model: WD My Passport 25E1
    size: 1.82 TiB block-size: physical: 512 B logical: 512 B type: N/A serial: <filter> rev: 1021
    scheme: GPT
Partition:
  ID-1: / raw-size: 111.49 GiB size: 109.18 GiB (97.93%) used: 19.74 GiB (18.1%) fs: ext4
    dev: /dev/sda2 maj-min: 8:2
  ID-2: /boot/efi raw-size: 300 MiB size: 299.4 MiB (99.80%) used: 296 KiB (0.1%) fs: vfat
    dev: /dev/sda1 maj-min: 8:1
Swap:
  Alert: No swap data was found.
Sensors:
  System Temperatures: cpu: N/A mobo: N/A gpu: nvidia temp: 30 C
  Fan Speeds (RPM): N/A gpu: nvidia fan: 0%
Info:
  Processes: 331 Uptime: 4h 41m wakeups: 0 Memory: 31.32 GiB used: 1.87 GiB (6.0%) Init: systemd
  v: 250 tool: systemctl Compilers: gcc: 11.1.0 Packages: pacman: 1252 lib: 419 Shell: Bash
  v: 5.1.16 running-in: pty pts/0 (SSH) inxi: 3.3.11

When playing any game (Yuzu, Genshin Impact, GTAV) after a certain time spent playing my GPU fans will instantly spike up to 100% and stay there. Even if I close the game/app, the fans will still stay maxxed out until I restart the computer. I have MangoHUD running in all of my games, and I can see that my GPU temperatures are never getting even close to high enough to warrant this. It seems like once the card reaches a certain (relatively low) temprature threshold, maybe around 55c, it just goes into killer mode and stays there for no reason.

I reinstalled Windows to test, and that was the only solution I’ve found for this issue (no fan problems on Win10).

Issue is present:

  • On Manjaro, Endeavour, Pop!_OS
  • On kernels 5.10, 5.13, and 5.15
  • On nvidia drivers 495, 475, and 470

I’ve also attempted setting the “Coolbits” option to various points, however utilities like GreenWithEnvy seem unable to see the fan RPM. I know the coolbits options are otherwise working, because it did open the overclock options. Attempting to set a manual fan profile has no effect.

Please let me know if there are any tests I should run, or logs when attempting to launch some of these games. I’m not entirely sure where or how to capture information from what’s happening.

My first question would be do you have the fan connected to the proper CPU Fan header on the motherboard?

These are the fans on the shroud of the 1660S, so the only plug on the card is the 6 pin PCI-E plug. I did open the case and confirm everything was seated correctly and plugged in without issue.

I installed Win10 as a way of trying to make sure it wasn’t a hardware issue, after having it affect all the Linux flavors I had listed. On the same hardware with no touches, Win10 didn’t have issue. I was also able to run this card on Linux for the last year+, and only a few weeks ago this began, but don’t know what started it.

I’ve also checked BIOS settings and there’s nothing there that I can see that should impact this.

So it’s the GPU fans you were talking about. Funny you mentioned it because i have a GTX 1060 and i have never seen the fans run ever. Supposed they don’t even come on until it reaches 50 degrees? I don’t know if they even work? :thinking: It seems bizarre but i have no issues with it paired up to a 5 Ghz Intel i7-8086K It’s very quite even with the case fans. I don’t think it’s ever gotten hot enough.

Edit: I also just wanted to mention on my Ryzen 3800X i have a Gigabyte RX 590 8 GB card and it has lighting on the top of the card that say’s Gigabyte and Fan Stop. So it has RGB lighting that lights the Fan Stop when they are OFF.

I think the fans are running when it’s under 50c, pretty sure. I’m able to run the computer without issue until I start stressing the graphic card hard enough by playing a game.

It’s happened before in Yuzu, but it’s actually very rare for Yuzu to spike it. It seems more GPU demanding things are quicker to set it off - seems like it’s hitting a threshold, spiking the fans to 100%, and then staying there for some reason. I know for sure I’ve never seen MangoHUD show the temperature go over 60c, on anything.

I feel like this is possibly related, but don’t know the cause:

Sensors:
System Temperatures: cpu: N/A mobo: N/A gpu: nvidia temp: 30 C
Fan Speeds (RPM): N/A gpu: nvidia fan: 0%

GreenWithEnvy and other utilities have also been unable to identify the fan speed/usage.

That is strange as well as my Nvidia is strange not even running as far as i can tell. One of these days I’m going to look and see if they actually are? :thinking: I don’t do games so i really have overkill considering the processor has HD 630 Intel graphics also.

I’d check for a BIOS update, and a VBIOS update, and if it happens with several different distros (and in Windows?) then it might be a hardware issue and therefore worth an RMA.

Bios is latest on the motherboard. Maybe the graphics card can be updated? That’s one thing i have never done in all my years.

Wait - are you having a problem with your system?

No i just meant his motherboard is already the latest Bios update. Sorry I should have worded that differently. But i am interested in Graphics card update as that is something i have never done not being a gamer and all.

I’m not sure I can update the firmware on the graphics card directly. Searching Nvidia’s website and a quick online search, I don’t think there’s anything available for the 1660 Super. When I went back to Windows 10 for a day to test, I downloaded the latest 497 drivers from Nvidia’s website and installed them, I’m not sure if there could have been any firmware update bundled with that but regardless the issue was still present when returning to EndeavourOS.

Jonathon, just to clarify, Windows doesn’t exhibit this problem. For right now it’s been every Linux distro I’ve attempted (Manjaro, Endeavour, Pop!)

I’m really curious if there’s some sort of package, configuration, or whatever I could use for Linux to see the fan RPM off of the graphics card. I have no way of knowing for sure but I feel like that could be related.

What do you see in nvidia-settings? This is what mine looks like:

have you installed sensors-detect ?

Do you mean lm_sensors?

Might be worth a try. After installing the package run sensors-detect as root, so all sensors are being detected and a config file is written.

Same as GreenWithEnvy or the inxi output, nvidia-settings is also unable to determine the fan speed. I get a generic error attempting to upload my screenshot, but my temperature was 31c, and under “fan information” ID, RPM, % are all 0. My “Control type” and “Cooling target” are both the same as yours. GPU fan settings below are disabled, unlike yours mine also shows “0” for fan 0 speed.

I confirmed I had lm_sensors installed and ran through sensors-detect as well, answering “yes” to all and rebooting.

Did you also execute sudo systemctl enable sensord.service before rebooting?
This service is necessary for monitoring sensors.

Ok, well this is interesting, after making sure the service was enabled and rebooting, this is what I get when checking the status of sensord.service

× sensord.service - Log hardware monitoring data
     Loaded: loaded (/usr/lib/systemd/system/sensord.service; enabled; vendor preset: disabled)
     Active: failed (Result: exit-code) since Wed 2022-01-12 06:59:32 CST; 6min ago
    Process: 696 ExecStart=/usr/sbin/sensord -i $INTERVAL -l $LOG_INTERVAL -f daemon (code=exited, status=127)
        CPU: 367us

Jan 12 06:59:32 HTPC systemd[1]: Starting Log hardware monitoring data...
Jan 12 06:59:32 HTPC sensord[696]: /usr/sbin/sensord: error while loading shared libraries: librrd.so.8: cannot open shared object file: No such file or directory
Jan 12 06:59:32 HTPC systemd[1]: sensord.service: Control process exited, code=exited, status=127/n/a
Jan 12 06:59:32 HTPC systemd[1]: sensord.service: Failed with result 'exit-code'.
Jan 12 06:59:32 HTPC systemd[1]: Failed to start Log hardware monitoring data.

Looks like “librrd” is an issue with this service, currently trying to see if there’s a dependency or something I’m missing.

Edit: Installing “rrdtool” resolved the service status error

● sensord.service - Log hardware monitoring data
     Loaded: loaded (/usr/lib/systemd/system/sensord.service; enabled; vendor preset: disabled)
     Active: active (running) since Wed 2022-01-12 07:11:31 CST; 4s ago
    Process: 2632 ExecStart=/usr/sbin/sensord -i $INTERVAL -l $LOG_INTERVAL -f daemon (code=exited, status=0/SUCCESS)
   Main PID: 2633 (sensord)
      Tasks: 1 (limit: 38391)
     Memory: 1.2M
        CPU: 4ms
     CGroup: /system.slice/sensord.service
             └─2633 /usr/sbin/sensord -i 60s -l 30m -f daemon

Jan 12 07:11:31 HTPC systemd[1]: Starting Log hardware monitoring data...
Jan 12 07:11:31 HTPC sensord[2633]: sensord started
Jan 12 07:11:31 HTPC sensord[2633]: Chip: k10temp-pci-00c3
Jan 12 07:11:31 HTPC sensord[2633]: Adapter: PCI adapter
Jan 12 07:11:31 HTPC sensord[2633]:   Tctl: 26.6 C
Jan 12 07:11:31 HTPC sensord[2633]:   Tccd1: 35.0 C
Jan 12 07:11:31 HTPC sensord[2633]: Chip: iwlwifi_1-virtual-0
Jan 12 07:11:31 HTPC sensord[2633]: Adapter: Virtual device
Jan 12 07:11:31 HTPC sensord[2633]:   temp1: 49.0 C
Jan 12 07:11:31 HTPC systemd[1]: Started Log hardware monitoring data.

I found and attempted this fix, in addition to generating nvidia-xconfig --cool-bits=12, but there is still no improvement: https://github.com/NVIDIA/nvidia-settings/issues/65#issuecomment-832921061

I’m going to undo those changes to Xwrapper since it didn’t resolve the problem and I also don’t want X running as root every time I turn the PC on.

To make things more confusing/weird, sometimes when the fan spikes after I reboot the fan noticeably drops in RPM, but still runs high (maybe 50-70% fan speed). Even AFTER rebooting, the fan (sometimes) seems to stay at that percentage until I reboot once again, at which point I can hear the fan go silent.

Absolutely no idea what’s going on here, I really don’t want to format this PC again much less stick it back on Windows…

This could also be a driver bug - try the nvidia-beta 510 in the AUR.

I can/will attempt that, but still seems odd considering I’ve attempted the 470, 475, and 495 drivers without change.

Away from the house right now but will test when back.