Random reboots once a day several kernel versions

I am very frustrated, I have been having random reboots for months now, and everything I have done to track down has failed. Looking for some advice…

PROBLEM: Reboots happens about once a day. I first thought it was only when I was on command line, but then had it happen while using Firefox. The closest I have gotten to reproducing is in System Profiler and Benchmark if I click “CPU Blowfish” it has crashed several times suggesting its cpu related. The pc does not have to be under load for this to happen, I am not running any games the fans are hardly spinning.

SYSTEM INFO:
Kernel Linux 6.6.10-arch1-1 (x86_64)
Version #1 SMP PREEMPT_DYNAMIC Fri, 05 Jan 2024 16:20:41 +0000
C Library GNU C Library / (GNU libc) 2.38
Distribution EndeavourOS Linux
Processor 13th Gen Intel(R) Core™ i9-13900K
Memory 65648MB (2125MB used)
GPU: NVIDIA GeForce RTX 3090/PCIe/SSE2 (BIOS 94.024b.00.0b)
Mother Board: Asus Prime Z790-A Wifi (BIOS 1604)

THINGS I HAVE TRIED:

  1. I updated to the latest mother board bios, and gpu vbios neither stopped reboots.
  2. I monitored temps while idle and under load CPU never got about 65c . Also didn’t reboot under load.
  3. I ran Memtest64+ for 30min all tests passed.
  4. I have searched every log I can find the only things I have found are in journalctl -b and are as follows:
[    0.000000] x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks

I have turned this off from grub with “split_lock_detect=off”, this did not stop the random reboots.

Jan 11 19:36:38 batcave kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20230628/dsfield-184)
Jan 11 19:36:38 batcave kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20230628/dswload2-477)
Jan 11 19:36:38 batcave kernel: ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20230628/psparse-529)

I haven’t figured out how to make these stop but its my understanding these can be ignored. Also there are about 20-25 of these msgs.

  1. when I run “last reboot” right after a crash I will see 2 running kernels, I believe second one is just the crashed one that the os never shutdown.
reboot   system boot  6.6.10-arch1-1   Sun Jan 14 13:25   still running
reboot   system boot  6.6.10-arch1-1   Sun Jan 14 13:23   still running

My first reboot was in kernel 6.0.8-arch1-1

MY GUT:
When I first built computer I was not having this issue, it seems to have “developed”. The only new hardware I have added is a new Keychron Q6 keyboard. My power supply is a beefy Seasonic 1000w PSU.
I think this is some kernel process that wakes up maybe related to CPU usage because of the “CPU Blowfish” seems to make it happen fairly consistently, but its not from load it never gets that chance. Its like it tries to start something and crashes.

Any help appreciated, I feel like I am looking in wrong logs, that there has to be something somewhere that says what is happening…

I’m not an expert but have you tried using the LTS kernel?

What are the last entries in the journal before it reboots?

You have 64 GB of RAM. Are you sure that memtest complets all tests in 30 min? That seems pretty fast.

Please do a stress test with mprime -t and let this run for several minutes, at least until test 10. What does that do to the CPU temperature?

If mprime shows an error or forces a reboot you can try to increase the voltage for the CPU a bit and see if that helps.

Have you seen this? Similar issue:

https://www.reddit.com/r/MSI_Gaming/comments/ziugk2/z690m_i913900k_msi_4090_random_reboots/

I also had some rare random reboots for months, mostly while the system was under heavy Gaming load but also when the system was only compiling - could not find the cause because logs contained absolutely nothing.
I thought I solved them by doing less undervolting on the CPU and by putting my power supply into Single Rail mode which makes the power supply less prone to power spikes (bequiet can do that by activating “OC-mode” which only switches on the single rail function in the power supply) - but on the long run I realized that the power supply got faulty. Replaced it, now my system hasn’t randomly rebooted in months.

I would definitely consider that before replacing any hardware!

I have not rolled back to LTS, perhaps I will try. Honestly I have never rolled back a kernel on arch.

I have seen this thread. Basically it came down to faulty motherboard. I have had this thought too, but haven’t found many issues with this particular MB online. Replacing the MB is kind of worst case but it might come to that.

LTS is not rolled back to - you simply install the LTS kernel in addion ( sudo pacman -S linux-lts linux-lts-headers + maybe your nvidia packages depending on which you use) to have a failsafe kernel with a different major release version installed and then can choose it at boot to check if issues also happen with that.

1 Like

I just ran mprime -t for 11 tests…

[Worker #2 Jan 15 08:16] Worker stopped.
[Worker #1 Jan 15 08:16] Torture Test completed 11 tests in 6 minutes - 0 errors, 0 warnings.

Here is a screenshot of psensor while on test 11…

Hottest temps I have seen but should be par for the course on this cpu. Surprising I didn’t crash, however the computer has never been under load when previous crashes have occured.

Yes I am possitive 1 full pass of memtest was completed, it might of been 33 min I was just rounding. Its DDR5 6000 mem.

I am going to have to wait for next crash on this one. I have rebooted so many times while I was upgrading the vbios that I don’t know which boot was the crash. I will get it though.

When you know the time the crash occured you can tell journalctl start looking at that time. Example:

journalctl --since "2024-01-04 13:50:00"

or just from the beginning of that day:

journalctl --since "2024-01-04"

I doubt the LTS kernel will result in any change, because the LTS kernel is on 6.6, which he already used while having this problem. Kernel 6.7 is inside the repos now. That may actually change things up.

This is not what I understood. The reddit suggests that it is an nvidia bug for RTX 4090 (which seems to be close to your RTX 3090).

The fix is to go to Nvidia Control panel under Manage 3D settings, set Power Management Mode to Prefer Maximum Performance.

For me it would crash when ever I would let it go to low power then do something to make it go to higher power mode, like sit at the desktop for 30 seconds then drag or resize a window quickly, heck even the wallpaper slide show would do it.

Forcing the GPU to stay in high power mode removes the swithching and stops the crashing.
But it will make the GPU use more power and run warmer even when idling - not ideal.

Seem every other motherboard manufactuer has a BIOS update that fixes this issue, except MSI, so I guess at some point it will be fixed, but not yet.

Here are the last lines from journalctl -b-1, I was able to get the reboot from once again running HardInfo 0.6-alpha and clicking through the Benchmarks this time the reboot happened when I clicked “CPU Zlib”.

Jan 15 12:52:21 batcave systemd[1]: Stopping User Manager for UID 970...
Jan 15 12:52:21 batcave systemd[1038]: Activating special unit Exit the Session...
Jan 15 12:52:21 batcave systemd[1038]: Stopped target Main User Target.
Jan 15 12:52:21 batcave systemd[1038]: Stopping Accessibility services bus...
Jan 15 12:52:21 batcave dbus-broker[1081]: Dispatched 20 messages @ 10(±15)μs / message.
Jan 15 12:52:21 batcave systemd[1038]: Stopping dbus-:1.6-org.a11y.atspi.Registry@0.service...
Jan 15 12:52:21 batcave systemd[1038]: Stopping User preferences database...
Jan 15 12:52:21 batcave systemd[1038]: Stopping Virtual filesystem service...
Jan 15 12:52:21 batcave systemd[1038]: Stopped Accessibility services bus.
Jan 15 12:52:21 batcave systemd[1038]: Stopped dbus-:1.6-org.a11y.atspi.Registry@0.service.
Jan 15 12:52:21 batcave systemd[1038]: Stopped User preferences database.
Jan 15 12:52:21 batcave systemd[1038]: Removed slice Slice /app/dbus-:1.6-org.a11y.atspi.Registry.
Jan 15 12:52:21 batcave systemd[1]: run-user-970-gvfs.mount: Deactivated successfully.
Jan 15 12:52:21 batcave systemd[1038]: Stopped Virtual filesystem service.
Jan 15 12:52:21 batcave systemd[1038]: Stopped target Basic System.
Jan 15 12:52:21 batcave systemd[1038]: Stopped target Paths.
Jan 15 12:52:21 batcave systemd[1038]: Stopped target Sockets.
Jan 15 12:52:21 batcave systemd[1038]: Stopped target Timers.
Jan 15 12:52:21 batcave systemd[1038]: Closed GnuPG network certificate management daemon.
Jan 15 12:52:21 batcave systemd[1038]: Closed GCR ssh-agent wrapper.
Jan 15 12:52:21 batcave systemd[1038]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jan 15 12:52:21 batcave systemd[1038]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Jan 15 12:52:21 batcave systemd[1038]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Jan 15 12:52:21 batcave systemd[1038]: Closed GnuPG cryptographic agent and passphrase cache.
Jan 15 12:52:21 batcave systemd[1038]: Closed GnuPG public key management service.
Jan 15 12:52:21 batcave systemd[1038]: Closed p11-kit server.
Jan 15 12:52:21 batcave systemd[1038]: Closed PipeWire PulseAudio.
Jan 15 12:52:21 batcave systemd[1038]: Closed PipeWire Multimedia System Sockets.
Jan 15 12:52:21 batcave dbus-broker[1055]: Dispatched 542 messages @ 3(±5)μs / message.
Jan 15 12:52:21 batcave systemd[1038]: Stopping D-Bus User Message Bus...
Jan 15 12:52:21 batcave systemd[1038]: Stopped D-Bus User Message Bus.
Jan 15 12:52:21 batcave systemd[1038]: Removed slice User Core Session Slice.
Jan 15 12:52:21 batcave systemd[1038]: Closed D-Bus User Message Bus Socket.
Jan 15 12:52:21 batcave systemd[1038]: Removed slice User Application Slice.
Jan 15 12:52:21 batcave systemd[1038]: Reached target Shutdown.
Jan 15 12:52:21 batcave systemd[1038]: Finished Exit the Session.
Jan 15 12:52:21 batcave systemd[1038]: Reached target Exit the Session.
Jan 15 12:52:21 batcave systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
Jan 15 12:52:21 batcave (sd-pam)[1040]: pam_warn(systemd-user:setcred): function=[pam_sm_setcred] flags=0x8004 service=[systemd-user] terminal=[] us>
Jan 15 12:52:21 batcave systemd[1]: user@970.service: Deactivated successfully.
Jan 15 12:52:21 batcave systemd[1]: Stopped User Manager for UID 970.
Jan 15 12:52:21 batcave systemd[1]: Stopping User Runtime Directory /run/user/970...
Jan 15 12:52:21 batcave systemd[1]: run-user-970.mount: Deactivated successfully.
Jan 15 12:52:21 batcave systemd[1]: user-runtime-dir@970.service: Deactivated successfully.
Jan 15 12:52:21 batcave systemd[1]: Stopped User Runtime Directory /run/user/970.
Jan 15 12:52:21 batcave systemd[1]: Removed slice User Slice of UID 970.
Jan 15 12:52:21 batcave systemd[1]: user-970.slice: Consumed 18.579s CPU time.
Jan 15 12:52:27 batcave systemd-timesyncd[839]: Contacted time server 69.164.213.136:123 (0.arch.pool.ntp.org).
Jan 15 12:52:27 batcave systemd-timesyncd[839]: Initial clock synchronization to Mon 2024-01-15 12:52:27.770585 CST.
Jan 15 12:52:27 batcave systemd[1]: Starting Update locate database...
Jan 15 12:52:31 batcave systemd[1]: updatedb.service: Deactivated successfully.
Jan 15 12:52:31 batcave systemd[1]: Finished Update locate database.
Jan 15 12:52:31 batcave systemd[1]: updatedb.service: Consumed 1.265s CPU time.
Jan 15 12:52:42 batcave cinnamon-session[1166]: WARNING: t+30.95360s: Detected that screensaver has left the bus
Jan 15 12:52:43 batcave systemd[1]: blueman-mechanism.service: Deactivated successfully.

I just tried this, but for me it was Nvidia X Server Settings → PowerMizer → Perferred Mode → Prefer Maximum Performance.

Will have to see if I still crash…

From my point of view this journal shows a regular shutdown / reboot. Is this the last messages or is there more to come till the end? I am missing entries like “Syncing filesystems” or “Journal stopped”

Those were the last rows of the journal for that boot. I agree it looks pretty normal. I can post the whole journal if you think that would help. Just wasn’t sure how much this forum could take so I just gave you everything from the first stop.

I got the reboot even with GPU “Prefer Maximum Performance” on. I thought it might be working because it took me a few times clicking around on HardInfo’s benchmarks but the third time I tried it tripped when I clicked “CPU Zlib” again. It really feels like either something wrong with CPU/MB or a bad instruction being sent to CPU.

I am not overclocking the cpu, only bios change I have ever made is to turn XMP on so memory runs at 6000 Mhz.

You described reboots but what exactly happens in that case?

Does your pc cuts out directly or does it gracefully shutdown and reboots?

To me mprime shows high temp (too high for normal use but i9 are hot headed) but it didn’t reboot as you say so basic hardware seems to be fine

You can try to remove the graphics card and replace it with a spare (if you have one) or (if you have a cpu with graphics support) use the onboard graphics to see if that makes any difference