This just happened again while the temperatures were hardly higher, so it doesn’t seem to be a temperature issue. Unfortunately, I didn’t get a kdump for this one.
Strange; I have the system76-power package from AUR installed, but system76-power.service doesn’t appear on the list when I run systemctl, and systemctl status system76-power.service complains about not finding it as I mentioned.
Output for querying for system76-power
$ pacman -Qi system76-power
Name : system76-power
Version : 1.2.3-1
Description : System76 Power Management
Architecture : any
URL : https://github.com/pop-os/system76-power
Licenses : GPL
Groups : None
Provides : power-profiles-daemon
Depends On : dbus libusb polkit
Optional Deps : system76-acpi-dkms: some systems might need this driver [installed]
system76-dkms: some systems might need this driver [installed]
system76-io-dkms: some systems might need this driver
Required By : gnome-shell-extension-system76-power-git
Optional For : gnome-shell system76-driver
Conflicts With : None
Replaces : None
Installed Size : 5.55 MiB
Packager : Unknown Packager
Build Date : Wed 26 Feb 2025 02:32:19 PM EST
Install Date : Wed 26 Feb 2025 02:33:57 PM EST
Install Reason : Explicitly installed
Install Script : Yes
Validated By : None
I’m using latest right now, but this problem has persisted for a few years, so I don’t think switching to LTS is likely to help.
While running memtest86+, I did find a memory error at address 0x9d4d48, so I added memmap=4K$0x0000009d4000 to the kernel command line to reserve the memory around that. Now I’ll have to see if that improves the situation.
Update (since I can’t edit my last post anymore): Doesn’t seem that this solved the problem, as I got another random reboot. I tried running memtest86+ again, but after 12 hours of testing, I didn’t find any errors. I’ll try the LTS kernel if this problem occurs again.
I had a similar problem. For me, eventually replacing the motherboard seems to have been the fix. Not saying you should jump to that. But search my post in this subforum in case it gives any insights. A frustrating problem for sure.
I’ve replaced my RAM now (64 GB now from 32, to boot!). I was worried when my laptop kept shutting down while running memtest86+, but when I booted my OS, I found that the problem was the battery being empty. (Maybe the new RAM is a lot more power-hungry?) I hope I don’t run into any problems now that I’ve replaced the RAM, but the only way to find out is to wait.
Edit: nope, it crashed after 1 hour of running. See most recent log from kdumpst. Maybe it’s because I bought two single DDR4 modules instead of a pair. Perhaps I’ll try using only one of them; I can live with 32 GB.
I tried running memtester (not memtest86+) to try to see if I could detect any memory problems while working on other things, and strangely, my system stayed up for 39 days before crashing – a lot longer than any time when this problem occurred. The crash log captured by kdumpst looks a lot like the earlier ones, but here it is just in case.
If you google for linux & watchdog you will find much more.
I’ve struggled this for some days (again), because my mobo (ASUS B360M) presents a HW-watchdog to every system and caused random reboots…
You should check (I am on Intel, AMD may use other module): # lsmod | grep wdt # cat /proc/sys/kernel/nmi_watchdog (should be 0) # ps -af | grep watch
If you find something here, maybe you should stop it…
I have had to kill the HW-wd and the NMI-wd, since the problem is gone.
Add the kernel-params “nowatchdog nmi_watchdog=0” at bootup,
Blacklist HW-watchdog (& mei-stuff):
/etc/modprobe.d/watchdog_disable.conf should contain:
# Disable intel mei (including mei_watchdog).
blacklist intel_pmc_bxt
blacklist iTCO_vendor_support
# Do not load the 'iTCO_wdt' watchdog module on boot.
blacklist iTCO_wdt
If the NMI-wd is still present:
build /etc/sysctl.d/disable_watchdog.conf ,
with this line:
kernel.nmi_watchdog = 0
Maybe, your BIOS has an option to disable the WD, mine doesn’t…
This helped me to get it quiet - HTH.
Good luck!