Nvidia again

So after reload of my Asus Rog strix 3 laptop I was working with optimum manager… Until it didn’t. And it simply refused to load nvidia, or the gui for optimus manager. So, looking at the wiki, I followed the details to load optimus switch. Now using this I can switch to Intel, and it will start sddm and I can see something. But if running the nvidia driver I just get a plain black screen from sddm.


Edit: and whilst running under the Intel it is getting really hot, like hot enough to have to take it off my lap.

Tried the LTS kernel as well as standard 5.14.3. No difference.

And whilst looking at the logs, I notice another issue, where the CRC errors on the 2TB SSD are stacking up. In the reinstall I set it to btrfs, encrypted, would that be a cause? It is only the /home drive, so that shouldn’t affect the system’s nvidia laoding.

KDE Partition Manager: SMART Status Report

KDE Partition Manager: SMART Status Report

Date: 15/09/21 10:27 PM
Program version: 21.08.1
Backend: pmsfdiskbackendplugin (1)
KDE Frameworks version: 5.85.0
Machine: Linux onyx-rogstrixg531gt 5.14.3-arch1-1 #1 SMP PREEMPT Sun, 12 Sep 2021 20:51:34 +0000 x86_64

SMART status: good
Model: CT2000MX500SSD1
Serial number: 1929E21426BD
Firmware revision: M3CR023
Temperature: 36° C / 97° F
Bad sectors: none
Powered on for: 7157:00:00
Power cycles: 2,465
Self tests: Success
Overall assessment: Healthy

1 Read Error Rate
Rate of hardware read errors while reading data from the disk surface.
Pre-Failure Online 100 100 0 0x000000000000 N/A 0
5 Reallocated Sectors Count
Count of reallocated sectors. When the hard drive finds a read/write/verification error, it marks this sector as "reallocated" and transfers data to a special reserved area (spare area).
Old-Age Online 100 100 10 0x000000000000 good 0 sectors
9 Power-On Hours
Count of hours in power-on state.
Old-Age Online 100 100 0 0x000000001bf5 N/A 7157:00:00
12 Power Cycle Count
Count of full hard disk power on/off cycles.
Old-Age Online 100 100 0 0x0000000009a1 N/A 2,465
171 SSD Program Fail Count
Number of flash program operation failures since the drive was deployed.
Old-Age Online 100 100 0 0x000000000000 N/A 0
172 SSD Erase Fail Count
Number of flash erase operation failures since the drive was deployed.
Old-Age Online 100 100 0 0x000000000000 N/A 0
173 SSD Wear Leveling Count
Counts the maximum worst erase count on any block.
Old-Age Online 99 99 0 0x000000000012 N/A N/A
174 SSD Unexpected power loss count
Number of shutdowns without STANDBY IMMEDIATE as the last command (regardless of PLI activity using capacitor power). Normalized value is always 100.
Old-Age Online 100 100 0 0x000000000083 N/A N/A
180 SSD Unused Reserved Block Count Total
"Pre-Fail" HP attribute.
Pre-Failure Online -1 -1 0 0x00000000004f N/A 79
183 SATA Downshift Error Count
Western Digital and Samsung attribute.
Old-Age Online 100 100 0 0x00000000000e N/A 14
184 End-to-End Error
Part of HP's SMART IV technology: After transferring through the cache RAM data buffer the parity data between the host and the hard drive did not match.
Old-Age Online 100 100 0 0x000000000000 N/A 0
187 Reported Uncorrectable Errors
Count of errors that could not be recovered using hardware ECC.
Old-Age Online 100 100 0 0x000000000000 N/A 0 sectors
194 Temperature
Current internal temperature.
Old-Age Online 35 64 0 0x004100000024 N/A 36° C / 97° F
196 Reallocation Event Count
Count of remap operations. The raw value of this attribute shows the total number of attempts to transfer data from reallocated sectors to a spare area.
Old-Age Online 100 100 0 0x000000000000 N/A 0
197 Current Pending Sector Count
Number of "unstable" sectors (waiting to be remapped, because of read errors).
Old-Age Online 100 100 0 0x000000000000 N/A 0 sectors
198 Uncorrectable Sector Count
Count of uncorrectable errors when reading/writing a sector.
Old-Age Offline 100 100 0 0x000000000000 N/A 0 sectors
199 UltraDMA CRC Error Count
Count of errors in data transfer via the interface cable as determined by ICRC.
Old-Age Online 100 100 0 0x00000000006a N/A 106
202 Data Address Mark Errors
Number of Data Address Mark errors (or vendor-specific).
Old-Age Offline 99 99 1 0x000000000001 good 1
206 Flying Height
Height of heads above the disk surface. A flying height that is too low increases the chances of a head crash while a flying height that is too high increases the chances of a read/write error.
Old-Age Online 100 100 0 0x000000000000 N/A N/A
210
Old-Age Online 100 100 0 0x000000000000 N/A N/A
246
Old-Age Online 100 100 0 0x0004f8e9807c N/A N/A
247
Old-Age Online 100 100 0 0x000014b0a5c6 N/A N/A
248
Old-Age Online 100 100 0 0x000021bc4726 N/A N/A

Getting late and I need my beauty sleep. 'Night all :purple_heart:

Well I have wasted too many hours on this, but at least the SSD is not throwing CRC errors now, as the machine hasn’t been on long enough to get hot.
Tried to go back to scratch for Optimus Manager, as that did work once upon a time.

The current situation:
Still no nvidia working.

[onyx@onyx-rogstrixg531gt etc]$ sudo dmesg -T | grep nvidia
[sudo] password for onyx: 
[Thu Sep 16 14:19:00 2021] nvidia: module license 'NVIDIA' taints kernel.
[Thu Sep 16 14:19:00 2021] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[Thu Sep 16 14:19:00 2021] nvidia 0000:01:00.0: enabling device (0006 -> 0007)
[Thu Sep 16 14:19:00 2021] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[Thu Sep 16 14:19:00 2021] nvidia: probe of 0000:01:00.0 failed with error -1
[Thu Sep 16 14:19:00 2021] nvidia-nvlink: Unregistered the Nvlink Core, major device number 235
[Thu Sep 16 14:19:11 2021] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[Thu Sep 16 14:19:11 2021] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[Thu Sep 16 14:19:11 2021] nvidia: probe of 0000:01:00.0 failed with error -1
[Thu Sep 16 14:19:11 2021] nvidia-nvlink: Unregistered the Nvlink Core, major device number 235

[onyx@onyx-rogstrixg531gt etc]$ inxi -G --display
Graphics:  Device-1: Intel CoffeeLake-H GT2 [UHD Graphics 630] driver: i915 v: kernel 
           Device-2: NVIDIA TU117M [GeForce GTX 1650 Mobile / Max-Q] driver: N/A 
           Display: x11 server: X.org 1.20.13 driver: loaded: intel unloaded: fbdev,modesetting,vesa 
           resolution: <missing: xdpyinfo> 
           OpenGL: renderer: Mesa Intel UHD Graphics 630 (CFL GT2) v: 4.6 Mesa 21.2.1 

[onyx@onyx-rogstrixg531gt etc]$ pacman -Q | grep nvidia
lib32-nvidia-utils 470.63.01-1
nvidia-dkms 470.63.01-1
nvidia-installer-db 2.4.25-1
nvidia-installer-dkms 3.3.9-1
nvidia-settings 470.63.01-1
nvidia-utils 470.63.01-1
[onyx@onyx-rogstrixg531gt etc]$ optimus-manager --switch hybrid
ERROR: the latest GPU setup attempt failed at Xorg pre-start hook.
Log at /var/log/optimus-manager/switch/switch-20210916T141900.log
Cannot execute command because of previous errors.
[onyx@onyx-rogstrixg531gt etc]$ cat /var/log/optimus-manager/switch/switch-20210916T141900.log
[10] INFO: # Xorg pre-start hook
[10] INFO: Previous state was: {'type': 'pending_pre_xorg_start', 'requested_mode': 'nvidia', 'current_mode': None}
[10] INFO: Requested mode is: nvidia
[869] INFO: Available modules: ['nouveau', 'bbswitch', 'acpi_call', 'nvidia', 'nvidia_drm', 'nvidia_modeset', 'nvidia_uvm']
[869] INFO: Unloading modules ['nouveau'] (if loaded)
[872] INFO: Loading module acpi_call
[875] INFO: Setting GPU power to ON via acpi_call
[876] INFO: Found saved ACPI strings
[876] INFO: Sending ACPI string \_SB.PCI0.PEG0.PEGP._ON
[876] INFO: ACPI string \_SB.PCI0.PEG0.PEGP._ON works, saving
[896] INFO: Loading module nvidia
[1698] ERROR: Xorg pre-start setup error
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/optimus_manager/kernel.py", line 245, in _load_module
    subprocess.check_call(
  File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'modprobe nvidia NVreg_UsePageAttributeTable=1' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/optimus_manager/hooks/pre_xorg_start.py", line 51, in main
    setup_kernel_state(config, prev_state, requested_mode)
  File "/usr/lib/python3.9/site-packages/optimus_manager/kernel.py", line 22, in setup_kernel_state
    _nvidia_up(config, hybrid=(requested_mode == "hybrid"))
  File "/usr/lib/python3.9/site-packages/optimus_manager/kernel.py", line 95, in _nvidia_up
    _load_nvidia_modules(config, available_modules)
  File "/usr/lib/python3.9/site-packages/optimus_manager/kernel.py", line 164, in _load_nvidia_modules
    _load_module(available_modules, "nvidia", options=nvidia_options)
  File "/usr/lib/python3.9/site-packages/optimus_manager/kernel.py", line 249, in _load_module
    raise KernelSetupError(f"Error running modprobe for {module}: {e.stderr}") from e
optimus_manager.kernel.KernelSetupError: Error running modprobe for nvidia: None
[1699] INFO: Removing /etc/X11/xorg.conf.d/10-optimus-manager.conf (if present)
[1699] INFO: Writing state {'type': 'pre_xorg_start_failed', 'switch_id': '20210916T141900', 'requested_mode': 'nvidia'}
[onyx@onyx-rogstrixg531gt etc]$ sudo modprobe nvidia
[sudo] password for onyx: 
modprobe: ERROR: could not insert 'nvidia': No such device
[onyx@onyx-rogstrixg531gt etc]$ uname -a
Linux onyx-rogstrixg531gt 5.10.64-1-lts #1 SMP Sun, 12 Sep 2021 09:11:21 +0000 x86_64 GNU/Linux

Anybody have a clue what is going on?

Thats not great!

Before that can you try:

prime-offload

I had a similar problem yesterday and this worked for switching. But after an update optimus also crashed. I decided to go nouveau driver :wink:

Thanks, but no difference, nothing reported.

[onyx@onyx-rogstrixg531gt etc]$ prime-offload
[onyx@onyx-rogstrixg531gt etc]$ sudo dmesg -T | grep nvidia
[sudo] password for onyx: 
[Thu Sep 16 14:19:00 2021] nvidia: module license 'NVIDIA' taints kernel.
[Thu Sep 16 14:19:00 2021] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[Thu Sep 16 14:19:00 2021] nvidia 0000:01:00.0: enabling device (0006 -> 0007)
[Thu Sep 16 14:19:00 2021] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[Thu Sep 16 14:19:00 2021] nvidia: probe of 0000:01:00.0 failed with error -1
[Thu Sep 16 14:19:00 2021] nvidia-nvlink: Unregistered the Nvlink Core, major device number 235
[Thu Sep 16 14:19:11 2021] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[Thu Sep 16 14:19:11 2021] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[Thu Sep 16 14:19:11 2021] nvidia: probe of 0000:01:00.0 failed with error -1
[Thu Sep 16 14:19:11 2021] nvidia-nvlink: Unregistered the Nvlink Core, major device number 235
[Thu Sep 16 14:27:05 2021] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[Thu Sep 16 14:27:05 2021] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[Thu Sep 16 14:27:05 2021] nvidia: probe of 0000:01:00.0 failed with error -1
[Thu Sep 16 14:27:05 2021] nvidia-nvlink: Unregistered the Nvlink Core, major device number 235


Hmmm, so are we dealing with a kernel vs nvidia arm-wrestle then?
Edit: I am on the LTS kernel currently, so that shouldn’t be the case?

After prime offload do you still get an error for ?

optimus-manager --switch hybrid

Yeh, same as before.

Bummer! I assume your are currently in linux kernel not lts, and/or installed the corresponding linux-headers and linux-lts-headers.

I saw you installed nvidia-dkms

To be honest this part of nvidia install was always dubious to me.

I am currently in the LTS, running on the Intel built-in. Both headers are installed; LTS- and standard.
I understood from the wiki that I needed DKMS if I was going to use the LTS.
But thanks for checking!

I am wondering if something went wrong during your kernel upgrade. I just read something about pacman hook and nvidia that may be useful.

https://wiki.archlinux.org/title/NVIDIA

Pacman hook
To avoid the possibility of forgetting to update initramfs after an NVIDIA driver upgrade, you may want to use a pacman hook:

/etc/pacman.d/hooks/nvidia.hook
[Trigger]
Operation=Install
Operation=Upgrade
Operation=Remove
Type=Package
Target=nvidia
Target=linux
# Change the linux part above and in the Exec line if a different kernel is used

[Action]
Description=Update Nvidia module in initcpio
Depends=mkinitcpio
When=PostTransaction
NeedsTargets
Exec=/bin/sh -c 'while read -r trg; do case $trg in linux) exit 0; esac; done; /usr/bin/mkinitcpio -P'

Make sure the Target package set in this hook is the one you have installed in steps above (e.g. nvidia, nvidia-dkms, nvidia-lts or nvidia-ck-something).

1 Like

I was just looking at preloading the module with mkinitcpio!
Thanks, will let you know…

1 Like

Documenting the journey…
I added this to the /etc/mkinitcpio.conf:
MODULES=“nvidia nvidia_modeset nvidia_uvm nvidia_drm”
then:

[onyx@onyx-rogstrixg531gt etc]$ sudo mkinitcpio -P
==> Building image from preset: /etc/mkinitcpio.d/linux-lts.preset: 'default'
  -> -k /boot/vmlinuz-linux-lts -c /etc/mkinitcpio.conf -g /boot/initramfs-linux-lts.img
==> Starting build: 5.10.64-1-lts
  -> Running build hook: [base]
  -> Running build hook: [udev]
  -> Running build hook: [autodetect]
  -> Running build hook: [modconf]
  -> Running build hook: [block]
==> WARNING: Possibly missing firmware for module: xhci_pci
  -> Running build hook: [keyboard]
  -> Running build hook: [keymap]
  -> Running build hook: [encrypt]
  -> Running build hook: [filesystems]
  -> Running build hook: [fsck]
==> Generating module dependencies
==> Creating zstd-compressed initcpio image: /boot/initramfs-linux-lts.img
==> Image generation successful
==> Building image from preset: /etc/mkinitcpio.d/linux-lts.preset: 'fallback'
  -> -k /boot/vmlinuz-linux-lts -c /etc/mkinitcpio.conf -g /boot/initramfs-linux-lts-fallback.img -S autodetect
==> Starting build: 5.10.64-1-lts
  -> Running build hook: [base]
  -> Running build hook: [udev]
  -> Running build hook: [modconf]
  -> Running build hook: [block]
==> WARNING: Possibly missing firmware for module: aic94xx
==> WARNING: Possibly missing firmware for module: wd719x
==> WARNING: Possibly missing firmware for module: xhci_pci
  -> Running build hook: [keyboard]
  -> Running build hook: [keymap]
  -> Running build hook: [encrypt]
  -> Running build hook: [filesystems]
  -> Running build hook: [fsck]
==> Generating module dependencies
==> Creating zstd-compressed initcpio image: /boot/initramfs-linux-lts-fallback.img
==> Image generation successful
==> Building image from preset: /etc/mkinitcpio.d/linux.preset: 'default'
  -> -k /boot/vmlinuz-linux -c /etc/mkinitcpio.conf -g /boot/initramfs-linux.img
==> Starting build: 5.14.3-arch1-1
  -> Running build hook: [base]
  -> Running build hook: [udev]
  -> Running build hook: [autodetect]
  -> Running build hook: [modconf]
  -> Running build hook: [block]
==> WARNING: Possibly missing firmware for module: xhci_pci
  -> Running build hook: [keyboard]
  -> Running build hook: [keymap]
  -> Running build hook: [encrypt]
==> WARNING: Possibly missing firmware for module: qat_4xxx
  -> Running build hook: [filesystems]
  -> Running build hook: [fsck]
==> Generating module dependencies
==> Creating zstd-compressed initcpio image: /boot/initramfs-linux.img
==> Image generation successful
==> Building image from preset: /etc/mkinitcpio.d/linux.preset: 'fallback'
  -> -k /boot/vmlinuz-linux -c /etc/mkinitcpio.conf -g /boot/initramfs-linux-fallback.img -S autodetect
==> Starting build: 5.14.3-arch1-1
  -> Running build hook: [base]
  -> Running build hook: [udev]
  -> Running build hook: [modconf]
  -> Running build hook: [block]
==> WARNING: Possibly missing firmware for module: aic94xx
==> WARNING: Possibly missing firmware for module: wd719x
==> WARNING: Possibly missing firmware for module: xhci_pci
  -> Running build hook: [keyboard]
  -> Running build hook: [keymap]
  -> Running build hook: [encrypt]
==> WARNING: Possibly missing firmware for module: qat_4xxx
  -> Running build hook: [filesystems]
  -> Running build hook: [fsck]
==> Generating module dependencies
==> Creating zstd-compressed initcpio image: /boot/initramfs-linux-fallback.img
==> Image generation successful

The pacman hook was alredy in place, very similar to that mentioned here: https://wiki.archlinux.org/title/NVIDIA, except it was target of nvidia-dkms.
Will reboot shortly.

You could throw Nvidia in the trash and start again with a fresh install of nvidia. Or as @Zircon34 just stick with Nouveau.

sudo nvidia-installer-dkms -n

Just as a short note. My opinion is the best way to install nvidia is when starting from scratch is to install nvidia by with sudo nvidia-installler-dkms. Then before rebooting add all the additional enhancements from the wiki which includes regenerating the kernel modules and adding the hook or the hook package from the EndeavourOS repo… Then reboot and install optimus-manager.

Edit: Looks like maybe you got it back working!

What is the difference to?

sudo pacman -S nvidia-dkms

Just wondering because I did not know it was implemented in pacman but discovered kind of that it was.

Now I’m confused what you ask?

Edit: I think that installs the installer package from the eos repo.

Actually it is in the eos repo. Shoot, I tried to add the command in terminal to check what package it is and now its installing nvidia over nouveau.

Ok, so after reboot, nothing different. Will rip out nvidia as you suggest, and go at it again.

Was just wondering the difference between nvidia-installer-dkms that can be installed directly from eos repo and pacman -S nvidia-dkms

nvidia-insaller-dkms is a script package in the eos repo to install nvidia-dkms. Or you can do the other way. From my understanding anyway.