My recently-built PC randomly reboots without warning. But I’ve only ever noticed this in EOS, not (yet) in my backup OS, MX Linux, on a different SSD.
I can give more specs if requested, but let’s start with this:
i9-13900KF (with iGPU)
64GB ram
RTX 4070 (not ti, not OC edition)
850W Titanium PS
Nothing is overclocked. XMP profile to run RAM at rated speed, not over.
Nothing is getting hot. sensors and nvidia-smi never show more than barely warm, not even close to “high.”
Plenty of fans. Arctic Liquid Freezer II 280 for CPU cooler. Working great.
I am using the CPU’s iGPU for my desktop (Plasma/X11.) The 4070 is currently only being used in containers for AI/ML stuff. I do not want to disable the iGPU.
Reboot symptoms:
Sometimes after hours of use, the other day 30 seconds after I logged in. No warning, no logs I can find after reboot, if there are any. And at idle or near idle, never happened yet while there was any significant load.
But when I boot to MX Linux (where I haven’t installed nvidia drivers,) I haven’t had any reboots (yet.) Yesterday I let it run 12 hours in MX. Probably would have rebooted under Eos in that time. Impossible to prove since it doesn’t happen every day, but frequently.
Googling this, the usual first suspect is “bad hardware.” I hope not, I don’t have money to throw at new parts.
I used the PC for a month or so (in Eos) before I installed the 4070. I don’t recall any reboots until after it was installed. Can’t remember whether that was before or after installing nvidia-dkms, which I didn’t get to immediately.
I’ve checked/reseated all the connections to the 4070. Ran memtest until it showed PASS. Looking for ideas.
Is it possible for spontaneous reboots to be software-related, not hardware problems? Or if there are hardware problems such as poorly-seated connections, that it wouldn’t be a problem if the hardware is not accessed? (E.g. no nvidia drivers in MX.)
The 4070 is on a 6" riser cable. I am not able to seat it directly in the motherboard due to other case/component modifications. I realize risers can be suspect, that would be one piece of hardware I could potentially replace, if I had a good reason to suspect it.
Could it be software getting confused about which GPU to use? I did have some early problems, where if my monitor was off when I started the PC, I’d never get video when I turned the monitor on. Guessing it tried to use the 4070 (which is not connected to a monitor.) I fixed that by adding the video kernel param in GRUB to specify the output and mode. But sometimes Plasma has trouble waking up the monitor after screen locking (which I’ve now disabled timeouts for.)
Of course, I don’t know for sure it’s related to the 4070. But that seems like a good place to start. Open to ideas, especially any way to log what’s happening so I can check it after the reboot.