Random restarts (but maybe not on other OS?)

My recently-built PC randomly reboots without warning. But I’ve only ever noticed this in EOS, not (yet) in my backup OS, MX Linux, on a different SSD.

I can give more specs if requested, but let’s start with this:

i9-13900KF (with iGPU)
64GB ram
RTX 4070 (not ti, not OC edition)
850W Titanium PS

Nothing is overclocked. XMP profile to run RAM at rated speed, not over.

Nothing is getting hot. sensors and nvidia-smi never show more than barely warm, not even close to “high.”

Plenty of fans. Arctic Liquid Freezer II 280 for CPU cooler. Working great.

I am using the CPU’s iGPU for my desktop (Plasma/X11.) The 4070 is currently only being used in containers for AI/ML stuff. I do not want to disable the iGPU.

Reboot symptoms:

Sometimes after hours of use, the other day 30 seconds after I logged in. No warning, no logs I can find after reboot, if there are any. And at idle or near idle, never happened yet while there was any significant load.

But when I boot to MX Linux (where I haven’t installed nvidia drivers,) I haven’t had any reboots (yet.) Yesterday I let it run 12 hours in MX. Probably would have rebooted under Eos in that time. Impossible to prove since it doesn’t happen every day, but frequently.

Googling this, the usual first suspect is “bad hardware.” I hope not, I don’t have money to throw at new parts.

I used the PC for a month or so (in Eos) before I installed the 4070. I don’t recall any reboots until after it was installed. Can’t remember whether that was before or after installing nvidia-dkms, which I didn’t get to immediately.

I’ve checked/reseated all the connections to the 4070. Ran memtest until it showed PASS. Looking for ideas.

Is it possible for spontaneous reboots to be software-related, not hardware problems? Or if there are hardware problems such as poorly-seated connections, that it wouldn’t be a problem if the hardware is not accessed? (E.g. no nvidia drivers in MX.)

The 4070 is on a 6" riser cable. I am not able to seat it directly in the motherboard due to other case/component modifications. I realize risers can be suspect, that would be one piece of hardware I could potentially replace, if I had a good reason to suspect it.

Could it be software getting confused about which GPU to use? I did have some early problems, where if my monitor was off when I started the PC, I’d never get video when I turned the monitor on. Guessing it tried to use the 4070 (which is not connected to a monitor.) I fixed that by adding the video kernel param in GRUB to specify the output and mode. But sometimes Plasma has trouble waking up the monitor after screen locking (which I’ve now disabled timeouts for.)

Of course, I don’t know for sure it’s related to the 4070. But that seems like a good place to start. Open to ideas, especially any way to log what’s happening so I can check it after the reboot.

From a hardware perspective, since you suspect the 4070 and it’s not being used as your primary display output, my first port of call would be to pull the 4070 and run for a while with just the iGPU it to see if the reboots persist.

2 Likes

see this video : https://www.youtube.com/watch?v=D3StcUhVRWQ
recheck all options in your UEFi motherboard and keep only profile intel

Thanks, I saw that just after Jay posted it, but was in my UEFI earlier and realized even though I said “no overclocking” that there are some things set to “auto.” I’ll review the video again.

Actually I have an ASRock mobo (was thinking about the 4070, which is ASUS.)

Did you recheck the ram to make sure it’s seated properly.

Usually auto settings are default UEFI Bios settings that would not include any overclocking.

Is your motherboard UEFI Firmware (Bios) up todate?

Just to second this, “auto” settings in the BIOS are often good, particularly when it comes to CPU and memory settings. It would typically be using the peripherals sane values there.

Of course, the exception would be any “auto” settings specifically related to overclocking and those may exist, particularly on enthusiast level motherboards, but hopefully these are clearly labelled as such.

ASRock Z690M PG Riptide/D5

Yes, I’ve checked seating of RAM/power cables/SSDs etc.

“Auto” for things like “Base Frequency Boost” and “CPU Vcore Compensation” where it doesn’t tell you what “Auto” does or what the other choices for those do.

One would generally assume “auto” should be sane defaults, except after watching the video posted above where ASUS has some pretty insane “defaults.” Granted, I don’t know that ASRock does this stuff, but their UEFI settings are not well-explained in the manual, and I can’t find a better reference online.

For now, I’m leaving UEFI settings alone. Don’t want to throw too many things into the mix.

When I say the reboot happens “randomly” it means sometimes every other day and sometimes every other week. I don’t leave the PC on when I’m not using it, so it’s really hard to tell. I’ve been leaving it on all day to see if it restarts, but it has to be turned off at night to not light up the bedroom. (Because of the GPU LEDs that can’t be disabled, though I will probably disconnect them later.)

Any good diagnostic tools for Linux that will show/log things like voltages, stuff the UEFI settings is controlling? Most things I find are Windows-only. I may have to install that on another drive just to get some more answers.

You might try running MemTest86+ overnight. Even if it presents no issues, at least you can rule that out.

The other thing that comes to mind, is quality of power. Your PSU being a “850W Titanium” should be ample for the task, and I would assume, reliable. So no immediate flag there.

However, I just thought I’d check, is the quality of power to your premises good? Do dips in power occur, or is the PC plugged into an overloaded or old powerboard?

You don’t need to change default UEFI settings but you should make sure the UEFI Firmware (Bios) is up to date.

OK, BIOS is updated. Flashed, reset, reloaded what I had before (from photos of every UEFI page,) including “Load Intel Base Power Limit Settings” Enabled.

Went to the dark side. Installed Win11 on another drive so I could install some tools there. This is with the GPU removed. I want to disconnect the RGB LEDs before I reinstall it. But anyway…

Intel Processor Diagnostic Tool passes everything. Speccy shows CPU barely hit 66C during the CPU load test. And drops almost immediately to 30C when the test stops. So I doubt CPU overheating is part of the issue.

I’ll continue testing once I get the GPU back in.

Try also testing under endeavour without the GPU, if you truly want to eliminate the 4070 as a potential problem

1 Like

That is true, but frustrating since it sometimes goes longer than a week without a problem, and I want to use the GPU.

It’s frustrating I agree, but it’s the only way to rule it out entirely :slight_smile:

I would have just selected to load and save the default settings after installing the new Bios to see if there was anything different to what you originally had set and test with default settings.

To be clear, I didn’t retype all old values from the photos. I loaded defaults, made sure it was set to “Load Intel Base Power Limit Settings,” and changed a few things such as XMP profile to run RAM at rated speed, turning off RGB, etc.

But comparing my photos, there were MANY new default values for timings/voltages.

Since the UEFI upgrade, it has not rebooted. I’ve been leaving it on all day/night, using it when I want to, just not shutting down.

It’s a good sign, but still too soon to tell if it’s “solved” after only 3 days.

I’ll be back…

1 Like

I was planning to come back today and say “It’s been a week since I upgraded BIOS and there have been no reboots, despite leaving it on all day and overnight most of the time, so I’m marking it solved.”

Sigh.

This morning it had restarted. I still don’t know whether there are any logs that would capture such a problem. If it spontaneously reboots, it’s over before anything can be logged, right?

Of course, it could always be a power blip/brownout, especially when it happens overnight.

Still happening occasionally.

Then there’s this… ugh.

1 Like

To clarify, is it happening while the GPU is still installed?

The suggestion to remove it and test was a good one. If it doesn’t present the issue while it’s removed, there’s two possibilities I’d put forward:

  • the GPU is directly responsible; or
  • the power draw with the GPU installed is sufficient to cause other issues.

Yes, it is a good suggestion, but one I’d been avoiding because the whole point (main point anyway) of this machine is to use the GPU for AI/ML stuff. But I pulled it two days ago. I’ll let it run at least a week or two to see if it reboots.

I doubt it’s power draw. It reboots at idle/light load (GPU installed.) I have the system plugged into a watt meter (added AFTER the reboots started in case you suspect the meter) and it was drawing 60-75 watts with the GPU, and if I ran UNIGINE Superposition or similar would get to 270W with an occasional peak of 300-350W. (But again, it never rebooted under load.) I notice without the GPU, the system draws about 35W at idle.

If it doesn’t eventually reboot, I’ll try the GPU again, but double-check and/or replace connectors.

1 Like