Ladies, Gentleman, Others, I have seen Hell. I have been fighting with this system for two months, and I have done absolutely everything I can think of. I have never had this level of problems, and I’ve used Linux exclusively for a decade. Here’s… just everything I can think of.
The current situation: My Displayport monitor does not function, and my Nvidia graphics card is stuck drawing only 45W of power. These issues are not present on Windows. They are present on Endeavour (w/ and w/o nvidia drivers), Nobera-Live, Endeavour-Live (w/ and w/o nvidia drivers), and only the Displayport issue is present on PopOS-Live.
Attempting to connect the DP monitor results in constant connect/disconnect/reconnect issues, rendering the system unusable. When disabled the system correctly detects that the DP monitor is a copy of the HDMI monitor (1080p resolution on both). Connecting limits the output to 1024x768, but no video ever shows on the screen regardless of resolution set.
Output of
inxi -FAZ --no-host | eos-sendlog
Issue initially presented as only being the inability to draw more than 45W of power. A brief summary of previous troubleshooting for that:
- Installed and tested every major version of nvidia-dkms, and nvidia-open-dkms, respectively.
- Installed and tested every major kernel release in conjunction with the above, no resolution.
- Installed Windows. (Eewww.) Confirmed issue is present on Windows.
- Dell provided a motherboard replacement. Temporarily resolved issue.
- Dell provided a battery replacement. Resolved issue on Windows, but introduced overheating issue.
- Dell provided a heatsink replacement. Resolved issue on Windows. Initially also resolved on Linux… sort of.
Following this Linux worked correctly, seemingly at random. Multiple reboots would be necessary, but eventually the system would boot up and everything would be completely functional. This lasted ~3 days. Now, my system is again stuck at 45W of power (as detailed by nvidia-smi). Additionally, my displayport external monitor no longer functions. I have been completely unable to fix this on Linux. Windows works perfectly, and the motherboard, heatsink and battery have all been replaced… so I’m utterly baffled.
Current TS:
- Reinstalled and tested the following kernel/nvidia driver combinations:
- 6.1.67.1-lts/nvidia-dkms 545.29.06 *** Previously working configuration
- 6.6.6/nvidia-dkms 545.29.06
- 6.6.7/nvidia-dkms 545.29.06
- 6.6.7/nvidia-dkms 525.85.05
- 6.6.6/nvidia-dkms 525.85.05
- 6.1.67.1-lts/nvidia-dkms 525.85.05
- 6.3.67.1-lts/nouveau
Kernels installed via AKM. Nvidia packages via:
sudo downgrade nvidia-dkms nvidia-utils nvidia-settings lib32-nvidia-utils libxnvctrl nvidia-prime
The journal has errors, the most relevant looking ones appearing to be: (Possible I’m missing something, I’m no expert)
Dec 19 04:56:27 zeta-gehenna kernel: i915 0000:00:02.0: [drm] *ERROR* Link Training Unsuccessful
and
Dec 19 04:53:05 zeta-gehenna (udev-worker)[642]: nvidia: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 255'' failed with exit code 1.
Dec 19 04:53:05 zeta-gehenna (udev-worker)[639]: nvidia_modeset: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidia-modeset c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 254'' failed with exit code 1.
Dec 19 04:53:05 zeta-gehenna (udev-worker)[642]: nvidia: Process '/usr/bin/bash -c 'for i in $(cat /proc/driver/nvidia/gpus/*/information | grep Minor | cut -d \ -f 4); do /usr/bin/mknod -Z -m 666 /dev/nvidia${i} c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) ${i}; done'' failed with exit code 1.
Output of uname -r:
[zeta@zeta-gehenna ~]$ uname -r
6.1.67-1-lts
Output of nvidia-smi:
Tue Dec 19 06:01:48 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3070 ... On | 00000000:01:00.0 On | N/A |
| N/A 39C P8 15W / 45W | 35MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 4246 G /usr/lib/Xorg 14MiB |
+---------------------------------------------------------------------------------------+
Output of xrandr (w/ DP connected):
Screen 0: minimum 16 x 16, current 5504 x 1600, maximum 32767 x 32767
HDMI-A-1 connected primary 1920x1080+0+0 (normal left inverted right x axis y axis) 600mm x 340mm
1920x1080 59.96*+
1440x1080 59.99
1400x1050 59.98
1280x1024 59.89
1280x960 59.94
1152x864 59.96
1024x768 59.92
800x600 59.86
640x480 59.38
320x240 59.52
1680x1050 59.95
1440x900 59.89
1280x800 59.81
1152x720 59.97
960x600 59.63
928x580 59.88
800x500 59.50
768x480 59.90
720x480 59.71
640x400 59.95
320x200 58.96
1600x900 59.95
1368x768 59.88
1280x720 59.86
1024x576 59.90
864x486 59.92
720x400 59.55
640x350 59.77
eDP-2 connected 2560x1600+1920+0 (normal left inverted right x axis y axis) 340mm x 220mm
2560x1600 164.97*+
2048x1536 164.95
1920x1440 164.87
1600x1200 164.88
1440x1080 164.85
1400x1050 164.77
1280x1024 164.81
1280x960 164.80
1152x864 164.77
1024x768 164.99
800x600 164.72
640x480 164.26
320x240 163.69
1920x1200 164.87
1680x1050 164.82
1440x900 164.74
1280x800 164.81
1152x720 164.77
960x600 164.72
928x580 164.51
800x500 164.66
768x480 164.47
720x480 164.67
640x400 164.67
320x200 161.83
2560x1440 164.90
2048x1152 164.90
1920x1080 164.83
1600x900 164.77
1368x768 164.98
1280x720 164.71
1024x576 164.94
864x486 164.71
720x400 164.23
640x350 164.46
DP-1 connected 1024x768+4480+0 (normal left inverted right x axis y axis) 600mm x 340mm
1024x768 59.92*+
800x600 59.86
640x480 59.38
320x240 59.52
960x600 59.63
928x580 59.88
800x500 59.50
768x480 59.90
720x480 59.71
640x400 59.95
320x200 58.96
1024x576 59.90
864x486 59.92
720x400 59.55
640x350 59.77
Note: Internal monitor, which is also driven by the Intel GPU, works perfectly fine provided the DP Monitor is not connected.
The full (mostly unedited) journal from my most recent boot (the one I’m writing this on) can be found here: https://pastebin.com/XN1yqS5s
- Attempted blacklisting i915 with
i915.modeset=0
– fixes journal entry errors but screen remains unusable, and NvidiaGPU is still stuck at 45W. - Wayland/X11 doesn’t make a difference.
- Confirmed that nvidia-powerd.service is enabled and running, no resolution.
- Tried several different Live USBs (via Ventoy). None have resolution for DP port monitor, but PopOS actually gives me a correctly detected 115-140W of power to the NvidiaGPU.
- Booted Windows again. (shudders in disgust). Windows works correctly on all fronts. DP Monitor works perfectly, Nvidia GPU correctly utilizes power.
- Upgraded BIOS from 1.20.0 to 1.21.0, no resolution.
- Downgraded BIOS from 1.21.0 to 1.14.0, no resolution. Fewer issues in journal though.
- Removed all nvidia-related soft w/ ‘pacman -Rnsc’, reinstalled. No resolution.
- Confirmed monitor in question does work correctly in Linux if connected via HDMI (Nvidia driven) port.
- Removed docking station from mix, no effect on issue one way or another. (Docking station is not driving display any display, to be clear – it’s essentially a glorified USB hub.)
- Googled absolutely everything I can think of, including ArchWiki, these forums, Arch Forums (they’re not as nice over there I’ve noticed), Reddit. Haven’t found anything that seems to fit.
- Changed BIOS options to drive external displays with Nvidia GPU. No resolution. Does break things even harder though.
- In conjunction with the above used
envycontrol -s nvidia
envycontrol --dm sddm
No resolution. Set back to Hybrid mode to get back to a bootable system.
I’ve… I’ve done literally everything I can think of to do. At this point I’d suspect another hardware issue, but the motherboard (and therefore the CPU/GPU, given how Dell replaces), heatsink and battery are all brand new. And Windows works perfectly. I don’t understand how I can suddenly just… start having issues. Only on Linux. If I wanted to go with Windows, I could have a working system, but I’ve been a Linux Only soul for a decade now, and frankly it just doesn’t make sense.
Any ideas or help would be appreciated. I’m fresh out. (Edit: Apologies for the edits. I keep remembering things I forgot to mention, and TS I’ve tried. It’s been a long two months.)