I updated my system today using yay -Syu and ever since my Nvidia drivers don’t start on boot, I get the dreaded NVIDIA-SMI message :
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
This has happened to me before, it was a day where I updated in the very short window between a new linux kernel release and a new driver release. Waiting a few hours and updating again fixed it.
However, this is not one of those times. I’ve waited for like 7 hours and there might be something wrong with my system, especially considering the fact that my friend who has the same hardware, the same linux kernel and the same nvidia-utils version doesn’t have any issues.
The most disturbing part, to me, is that my drivers work if I reinstall any of the nvidia packages (i.e. yay -Syu nvidia-utils). So it’s just that the drivers don’t properly start on boot.
I haven’t found anything online, so I turned to gulp AI (any solution is a good solution when you really wanna play deadlock) and it told me to modify my dracut and grub configs to force nvidia drivers to be started when booting. It did not work (shocker…).
Do you have any idea how to fix this ? Do I just wait for another update ? If you need more information please ask.
With editing my dracut and grub configurations : https://0x0.st/PM8_.txt (i sifted through it and it has more “no nvidia gpu found“ errors + a little more at the bottom)
The first boot log would seem to treat the RTX 3060 as non-existant, which is strange.
I ask about dual booting, because the Wiki mentions having Eco Mode active in Windows may cause the GPU to be undetectable.
Note When dual booting on a system with hybrid graphics, enabling Windows or third-party apps Eco mode (like ASUS Eco mode) may fully disable the NVIDIA discrete GPU, making it undetectable.
That’s interesting, but perhaps unexpected that the issue would only crop up now, if as you say, you haven’t booted into Windows in a year.
Did you make any changes to the BIOS?
I also wonder, if you might try booting the USB live ISO, selecting the Nvidia boot option, and seeing if it correctly loads the official Nvidia drivers on startup?
If it fails there, then that might confirm an issue outside of the OS at least.
I haven’t touched my BIOS since installing EOS, no. I don’t think I even have eco-mode enabled on my windows, though it may enable itself when I unplug the cable. So the hypothesis would be that, when booting, linux acts as if the computer is on battery and enables windows’ eco mode ?
About the USB live ISO, can I boot on it without a flash drive ? I don’t think I have one, the one I used to install EOS was lent to me last year, I might be able to get another one tomorrow.
It’s also possible this is a Nvidia Optimus issue, in which case what you’re encountering may be normal, I’m not sure. I’ll have to admit I’m not experienced with Nvidia Optimus.
In short, it’s using the integrated AMD GPU on boot, as the more efficient GPU. The Nvidia one may only come online when accessed?
Edit: You could try (temporarily) disabling or lowering the priority of the integrated GPU in the BIOS, if that’s an option available to you?
Okay so new information : I don’t actually need to reinstall nvidia-utils for it to work, it was a coincidence, I just need to wait for them to boot up, it’s just that I was checking before reinstalling and after reinstalling, which obviously took some time, enough time for the drivers to boot up.
HOWEVER, they still take like 1 to 5 minutes to start, which, even if not prohibitively hindering, is pretty suspicious and may indicate something’s wrong with them ? I always check my nvidia drivers first thing after rebooting after a big update (out of trauma of that one time they broke), so I definitely hit nvidia-smi in the time window they were down.
I restarted and tried immediately launching Zed, which runs on my GPU, and it couldn’t start. Just a few seconds later my fastfetch was showing my GPU… This is so weird. I don’t know what’s causing this inconsistency in delay for my GPU to be detected.
Maybe you’re right, the GPU is off until something tries to use it ? But in this case the first attempt fails and then it boots it up, in a “woopsie daisies” manner ?
EDIT : I tried again more rigorously, and the GPU does in fact boot up IMMEDIATELY after zed fails to start. So it IS the driver not detecting the GPU until it is needed somewhere…
I’m kind of lost, but at least the problem - though still there - is not as bad as it first seemed. Thanks for your time anyway !! At least I can sleep on both ears knowing my system is not irreparably broken.
Thanks for the resources, I read through it and tried implementing as much as I could (editing the dracut configuration – again, setting up nvidia optimus rules) and nothing changed.
Ultimately, I fixed my issue by just slamming nvidia-smi in my .bash_profile, that way my GPU gets booted up immediately after login.