With the GPU removed, it ran 24/7 for about a week without restarting.
Two days ago, I put the GPU back, with these changes:
Disconnected/reseated GPU power cables at PSU end, and of course paid much attention when seating all other GPU connections
Flipped PSU switch from ECO to AUTO
The first (reseating) isn’t a “change” per se, but of course may be important if I’d missed something loose before.
The second, changing the PSU mode…
I have found search results where people have said taking their PSU out of ECO mode solved a rebooting problem, but in the same posts other people saying it shouldn’t matter.
Supposedly ECO only affects fan but not power delivery. But it is an easy thing to test, so I flipped it and will let it run in AUTO for a while.
I have never heard the PSU fan kick in while in ECO mode. In AUTO, it seems to run all the time. (The fan is louder than all my other case fans, despite being marketed as “quiet.” But it’s not terrible.)
If it runs a week like this, I may try flipping back to ECO and see if the restarts resume.
Well, I didn’t wait a week to flip it back to ECO, as it was too noisy for me and had been running fine. It’s not rebooted since I flipped it back (and not since I had removed and then reinstalled the GPU.)
So it seems the moral of the story is “Always check/reseat connections even if you are ‘sure’ they are OK.”
In my case, the cables between the PSU and GPU, and GPU riser cable. ricklinux had mentioned reseating RAM but I didn’t do that. (But I did run memtest86+ for over 9 hours with no errors.)
I’m calling it done. I appreciate all the replies, but I’m not sure if any individual one is exactly “the solution.”
FYI this isn’t over, still restarting, not sure what to try next, maybe see if I can get all new PSU/GPU cables, maybe I bent some a bent some a bit much in my zeal for “cable management” (i.e. “making it pretty” lol.) Other than that, no idea, already on the latest BIOS with Intel’s “fix” so it’s either not that, or already damaged. But, I’ve never read of spontaneous reboots being a symptom of that problem.
Replaced PSU and all cables, different brand, still spontaneously reboots
Removed GPU and riser cable, still spontaneously reboots
Both of those today.
What’s left:
Motherboard
CPU, but I don’t think Intel’s symptoms included spontaneous reboots, just application crashes, AFAIK
RAM, I tested and it has passed so far (9 hours memtest no errors)
Anything I forgot? SSDs are the only other hardware installed, but I’d assume read/write errors if that was a problem
Also, just in case, ran memtest another 4 hours tonight and it passed, then turned off XMP and booted to OS, and it just rebooted a few minutes ago, so I think RAM is not the problem.
Would it be possible to get another CPU for testing? What kind of mainboard do you have or have you already installed the latest bios update (microcode 0x129)?
Note:
Even if you now have the latest bios, the CPU may already be irreversibly damaged.
This might be a very long shot but did you connect the reset button from your pc case to the motherboard?
If so remove the cable from the motherboard and see if that helps
There is no reset button, only power (and power LED, which is disconnected.)
As previously noted, already on the latest bios having the 0x129 microcode update. No, can’t (cheaply) get another CPU for testing.
As far as Intel CPU instability, I have not been able to find a site listing “spontaneous restart” as a symptom, only that “applications crash” or “blue screen” (Windows, of course.) But I have a few Linux installs and Windows I can multiboot, and the symptom is the same in all of them, screen goes black and then it restarts.
So the issue is now present across both Windows and Linux OS’s?
It is 100% your hardware if that’s the case, unfortunately it’s an incredibly non-specific symptom and you already seem to have covered off the most obvious potential causes.
One of the less-likely-but-still-possible causes could be a failing USB device causing an electrical issue and forcing the system to hard reset - perhaps try switching out any USB devices, KB/M, etc.
I’m sorry that I didn’t read properly what has already been tried and what has not.
Unfortunately it is quite difficult to narrow down the problem if there is no other mainboard or processor to test.
Can you possibly test the stability with XTU? Please also observe the temperatures of the CPU during the stress test so that it doesn’t get too hot.
Maybe offtopic:
I had similar problems (random restarts) with a defective Ryzen 9 5900x in April 2021. There were rarely blue screens, mostly the system just rebooted out of nowhere. Even with a blue screen (usually WHEA Error 18), it sometimes simply restarted shortly afterwards. Memtests were without errors.
In the end the CPU was then replaced by AMD and the problems were gone.
As to how I got there, I looked at journalctl, didn’t see any specific errors relating to restarts (still probably restarted before any could be written,) but I did see some errors from Brave Browser (several minutes before the restart,) and realized it seemed to restart more frequently when I was actively browsing, opening/switching/closing tabs.
So I guessed. Used Firefox a few days, no restarts. Used Brave again, restarted within an hour or two. Noted I had the Flatpak version, so replaced that with the AUR version, and it hasn’t restarted since. If anything changes (further restarts,) I’ll update here.
If that was it, and it’s a software-only problem, I’m still very curious how a user space program could do that. Immediate black screen and restart without logging what actually happened. Wouldn’t that point toward a kernel bug, that Flatpak or the FlatPak/Brave combo “hit” under the right conditions? What else?
If you install flatpak it is added to kernel, as module i guess (i’m not exactly sure how this works)
So if a installed flatpak app causes issues it might be possible that flatpak itself can cause a problem in the kernel which causes a reboot?
I have never seen that before but that might explain this reboot issue.
It is interesting for sure though
System’s been stable since my last post, until two days ago, when it rebooted under Windows, and today, rebooted under EOS.
Grasping at straws here. Could virtualization settings cause instability?
When it crashed under Windows, I was running WSL2 (virtual machine running Linux.)
Today, I was running a machine in Incus.
Previously, I said I’d been using the Flatpak of Brave, but switched to the native one. I know Flatpak containerizes apps, but I haven’t looked into whether it uses virtualization, but I would guess yes, why wouldn’t it use available features?
And once the other month, I was trying to install some pinokio apps, and it crashed multiple times that night as I kept retrying.
Coincidence? Maybe. I’ve already looked at UEFI and don’t see any “tuning” for virtualization, just on or off. But I guess this isn’t over just yet.
Yes, I know this is going on a long time. No, I don’t expect answers here. I am updating for future readers searching for similar problems.
It’s not Brave Browser. I’ve been using Firefox only for a while. Not booting it often, but only using Firefox when I do. Well, it just rebooted while running Firefox.
It does seem to happen when I have a lot of tabs open with YouTube videos, but that’s probably coincidental since I do that all the time.
So just to recap:
It’s not PSU, I replaced it and all cables
It’s probably not RAM, it’s passed many hours of memtest86
It’s not the GPU, still reboots with the GPU removed
Reboots under Windows too (not recently) so probably hardware/bios issue
No solution, not asking for one, just logging current state.
Re thinking this, it might be related to Intel has a lot of problems with 13e generation processors?
If your CPU is still under warranty (they did extend it to 4 years for boxed) returned it and get a new one.
Next would be update bios, load defaults (Motherboard manufactures have always set powerlimits too high) they repair that with a bios update to lower the power limits.
After my last reply, I started an Intel warranty claim. It took a month, installed replacement CPU, but same restarting problem within a couple hours.
Not much left but motherboard and drives, so I went back to recheck RAM again one more time to be sure.
64GB DDR5 5600 as matched set of two 32GB sticks in slots A2/B2.
Passed all tests, as before.
Removed B2, ran on one stick in A2 a few days, seemed stable.
Swapped the other stick into A2, also stable a few days.
Reinstalled remaining stick into B2, and it’s been running 24/7 for a week.
Most likely, RAM wasn’t seated properly despite multiple prior reseatings. Unless anybody has heard of individual sticks being finicky about which slot they are in. (And I mean the “correct” slots according to your mobo manual, usually A2/B2.)
If this is the solution, I guess the moral is to double, triple check connections and seatings. I really hope this is the end of the problem. I’ve been avoiding doing real work on this PC for fear of losing unsaved data when it restarts.