Random restarts on my computer

flirora · March 26, 2025, 2:19pm

I’ve had an issue with random restarts for a while, but lately they’ve been happening every few days.

inxi -Fxxc0z | eos-sendlog: https://0x0.st/8j7V.txt

journalctl -k -b -2 | eos-sendlog: https://0x0.st/8j74.txt – no logs about the crash here

lsblk -o name,type,size,PTTYPE,FSTYPE:

NAME        TYPE   SIZE PTTYPE FSTYPE
nvme0n1     disk 931.5G gpt    
├─nvme0n1p1 part   300M gpt    vfat
├─nvme0n1p2 part 922.4G gpt    ext4
└─nvme0n1p3 part   8.8G gpt    swap

I did manage to capture a kdump for this crash using kdumpst: https://0x0.st/8j7y.txt

I’ve run memtest86+ several times but have yet to find any errors.

xircon · March 26, 2025, 2:30pm

What are your temperatures like?

sudo sensors

flirora · March 26, 2025, 2:41pm

Right now, it shows the following:

iwlwifi_1-virtual-0
Adapter: Virtual device
temp1:        +35.0°C  

nvme-pci-0400
Adapter: PCI adapter
Composite:    +31.9°C  (low  =  -5.2°C, high = +79.8°C)
                       (crit = +84.8°C)

amdgpu-pci-0500
Adapter: PCI adapter
vddgfx:        1.32 V  
vddnb:       768.00 mV 
edge:         +45.0°C  
PPT:           6.00 W  

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:         +52.6°C  

BAT0-acpi-0
Adapter: ACPI interface
in0:          16.25 V  
curr1:         0.00 A

flirora · March 26, 2025, 2:49pm

This just happened again while the temperatures were hardly higher, so it doesn’t seem to be a temperature issue. Unfortunately, I didn’t get a kdump for this one.

xircon · March 26, 2025, 2:50pm

Yeh, temps look fine.

flirora · March 26, 2025, 3:13pm

Here are some earlier logs captured by kdumpst from the same system, if that happens to be helpful:

xircon · March 26, 2025, 4:30pm

Have you installed/enabled all the System76 services?

flirora · March 26, 2025, 4:47pm

I have system76.service enabled, but:

system76-firmware-daemon.service is loaded but disabled:

$ systemctl status system76-firmware-daemon.service
○ system76-firmware-daemon.service - System76 Firmware Daemon
   Loaded: loaded (/usr/lib/systemd/system/system76-firmware-daemon.service; disabled; preset: disabled)
   Active: inactive (dead)

So I enabled and started it now.

system76-power.service is not found:

$ systemctl status system76-power.service 
Unit system76-power.service could not be found.

xircon · March 26, 2025, 5:08pm

1 aur/system76-power 1.2.3-1 (+13 0.06)
    System76 Power Management

TBH do not know much about System76

flirora · March 26, 2025, 5:18pm

Strange; I have the system76-power package from AUR installed, but system76-power.service doesn’t appear on the list when I run systemctl, and systemctl status system76-power.service complains about not finding it as I mentioned.

Output for querying for system76-power

$ pacman -Qi system76-power
Name            : system76-power
Version         : 1.2.3-1
Description     : System76 Power Management
Architecture    : any
URL             : https://github.com/pop-os/system76-power
Licenses        : GPL
Groups          : None
Provides        : power-profiles-daemon
Depends On      : dbus  libusb  polkit
Optional Deps   : system76-acpi-dkms: some systems might need this driver [installed]
                  system76-dkms: some systems might need this driver [installed]
                  system76-io-dkms: some systems might need this driver
Required By     : gnome-shell-extension-system76-power-git
Optional For    : gnome-shell  system76-driver
Conflicts With  : None
Replaces        : None
Installed Size  : 5.55 MiB
Packager        : Unknown Packager
Build Date      : Wed 26 Feb 2025 02:32:19 PM EST
Install Date    : Wed 26 Feb 2025 02:33:57 PM EST
Install Reason  : Explicitly installed
Install Script  : Yes
Validated By    : None

flirora · May 22, 2025, 7:05am

This is still happening to me. Latest successful kdumpst capture

I’ll try running memtest86+ again.

MyNameIsRichard · May 22, 2025, 7:12am

Have you tried the LTS kernel if you’re running the latest or visa versa?

flirora · May 22, 2025, 3:51pm

I’m using latest right now, but this problem has persisted for a few years, so I don’t think switching to LTS is likely to help.

While running memtest86+, I did find a memory error at address 0x9d4d48, so I added memmap=4K$0x0000009d4000 to the kernel command line to reserve the memory around that. Now I’ll have to see if that improves the situation.

flirora · May 24, 2025, 7:33pm

Update (since I can’t edit my last post anymore): Doesn’t seem that this solved the problem, as I got another random reboot. I tried running memtest86+ again, but after 12 hours of testing, I didn’t find any errors. I’ll try the LTS kernel if this problem occurs again.

wmorrison · May 25, 2025, 5:26pm

I had a similar problem. For me, eventually replacing the motherboard seems to have been the fix. Not saying you should jump to that. But search my post in this subforum in case it gives any insights. A frustrating problem for sure.

bitterhalt · May 25, 2025, 5:28pm

Sounds like dying power supply or motherboard. If those are good I would check RAM sticks.

Yes it is pain to debug if you don’t have spare ones to test

flirora · May 30, 2025, 6:47pm

I can confirm that the same problems occur on the LTS kernel.

I’ll try getting some RAM to try replacing. I hope it’s only a RAM problem; I don’t think it’s possible to replace the motherboard on my laptop.

flirora · June 5, 2025, 1:23am

I’ve replaced my RAM now (64 GB now from 32, to boot!). I was worried when my laptop kept shutting down while running memtest86+, but when I booted my OS, I found that the problem was the battery being empty. (Maybe the new RAM is a lot more power-hungry?) I hope I don’t run into any problems now that I’ve replaced the RAM, but the only way to find out is to wait.

Edit: nope, it crashed after 1 hour of running. See most recent log from kdumpst. Maybe it’s because I bought two single DDR4 modules instead of a pair. Perhaps I’ll try using only one of them; I can live with 32 GB.

flirora · July 15, 2025, 6:47pm

I tried running memtester (not memtest86+) to try to see if I could detect any memory problems while working on other things, and strangely, my system stayed up for 39 days before crashing – a lot longer than any time when this problem occurred. The crash log captured by kdumpst looks a lot like the earlier ones, but here it is just in case.

wolfn · July 17, 2025, 9:57am

Have you checked this:
https://unix.stackexchange.com/questions/709278/get-rid-of-rebooting-messages-and-watchdog

If you google for linux & watchdog you will find much more.

I’ve struggled this for some days (again), because my mobo (ASUS B360M) presents a HW-watchdog to every system and caused random reboots…
You should check (I am on Intel, AMD may use other module):
# lsmod | grep wdt
# cat /proc/sys/kernel/nmi_watchdog (should be 0)
# ps -af | grep watch
If you find something here, maybe you should stop it…
I have had to kill the HW-wd and the NMI-wd, since the problem is gone.

Add the kernel-params “nowatchdog nmi_watchdog=0” at bootup,
Blacklist HW-watchdog (& mei-stuff):
/etc/modprobe.d/watchdog_disable.conf should contain:

    # Disable intel mei (including mei_watchdog).
    blacklist intel_pmc_bxt
    blacklist iTCO_vendor_support

    # Do not load the 'iTCO_wdt' watchdog module on boot.
    blacklist iTCO_wdt

If the NMI-wd is still present:
build /etc/sysctl.d/disable_watchdog.conf ,
with this line:
kernel.nmi_watchdog = 0

Maybe, your BIOS has an option to disable the WD, mine doesn’t…
This helped me to get it quiet - HTH.
Good luck!