System freezing or randomly restarting

SvalbardDistrict · February 26, 2024, 9:50am

Hi, I searched for solutions to this on the forum, but no threads seem to have anything specifically relevant to my case. I hope I’m opening the thread in the correct section - I assume the issue is related either to kernel or hardware.

I came back to EndeavourOS a few weeks ago with a new PC build, but now I am experiencing the same issue I had with my previous system. Back then I did not find any solutions to this. My system keeps freezing or restarting, and I’m not sure which logs to look at – or what to pay attention to in the logs – given my inexperience with this.

I will post some outputs as recommended here:

inxi -Fxxc0z | eos-sendlog: https://0x0.st/HRXS.txt

journalctl -k -b -0 | eos-sendlog: https://0x0.st/HRXM.txt

lsblk -o name,type,size,PTTYPE,FSTYPE:

NAME        TYPE   SIZE PTTYPE FSTYPE
nvme1n1     disk 476,9G gpt
├─nvme1n1p1 part  1000M gpt    vfat
├─nvme1n1p2 part 442,4G gpt    ext4
└─nvme1n1p3 part  33,6G gpt    swap
nvme0n1     disk 931,5G gpt
├─nvme0n1p1 part    16M gpt
├─nvme0n1p2 part   100M gpt    vfat
├─nvme0n1p3 part 930,9G gpt    ntfs
└─nvme0n1p4 part   509M gpt    ntfs

journalctl -b -0 # last boot:

თებ 26 13:27:29 SSD-DTP-LNX kernel: Linux version 6.6.18-1-lts (linux-lts@archlinux) (gcc (GCC>
თებ 26 13:27:29 SSD-DTP-LNX kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-linux-lts root=UUID>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-provided physical RAM map:
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usa>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] res>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000009afefff] usa>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x0000000009aff000-0x0000000009ffffff] res>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usa>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000000a200000-0x000000000a20ffff] ACP>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000000a210000-0x000000000affffff] usa>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000000b000000-0x000000000b020fff] res>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000000b021000-0x000000008a4c0fff] usa>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000008a4c1000-0x00000000904c0fff] res>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x00000000904c1000-0x000000009067efff] ACP>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000009067f000-0x000000009267efff] ACP>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000009267f000-0x000000009a7fefff] res>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000009a7ff000-0x000000009bff8fff] usa>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000009bff9000-0x000000009bffdfff] res>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000009bffe000-0x000000009bffffff] usa>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000009c000000-0x000000009cffffff] res>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000009d7f3000-0x000000009fffffff] res>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] res>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x00000000f7000000-0x00000000ffffffff] res>
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x0000000100000000-0x000000083defffff] usa>
lines 1-24...skipping...
თებ 26 13:27:29 SSD-DTP-LNX kernel: Linux version 6.6.18-1-lts (linux-lts@archlinux) (gcc (GCC) 13.2.1 20230801, GNU ld (GNU Binutils) 2.42.0) #1 SMP PREEMPT_DYNAMIC Fri, 23 Feb 2024 14:58:40 +0000
თებ 26 13:27:29 SSD-DTP-LNX kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-linux-lts root=UUID=a84707c1-01de-4a58-944f-9e4808826c2d rw nowatchdog nvme_load=YES resume=UUID=36f0c337-2635-4717-82b2-67a7c4f6219a loglevel=3
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-provided physical RAM map:
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000009afefff] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x0000000009aff000-0x0000000009ffffff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000000a200000-0x000000000a20ffff] ACPI NVS
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000000a210000-0x000000000affffff] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000000b000000-0x000000000b020fff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000000b021000-0x000000008a4c0fff] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000008a4c1000-0x00000000904c0fff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x00000000904c1000-0x000000009067efff] ACPI data
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000009067f000-0x000000009267efff] ACPI NVS
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000009267f000-0x000000009a7fefff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000009a7ff000-0x000000009bff8fff] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000009bff9000-0x000000009bffdfff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000009bffe000-0x000000009bffffff] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000009c000000-0x000000009cffffff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000009d7f3000-0x000000009fffffff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x00000000f7000000-0x00000000ffffffff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x0000000100000000-0x000000083defffff] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000083ef40000-0x00000008801fffff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: BIOS-e820: [mem 0x000000fd00000000-0x000000ffffffffff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: NX (Execute Disable) protection: active
თებ 26 13:27:29 SSD-DTP-LNX kernel: APIC: Static calls initialized
თებ 26 13:27:29 SSD-DTP-LNX kernel: e820: update [mem 0x80619018-0x80634657] usable ==> usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: e820: update [mem 0x80619018-0x80634657] usable ==> usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: e820: update [mem 0x80609018-0x80618057] usable ==> usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: e820: update [mem 0x80609018-0x80618057] usable ==> usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: e820: update [mem 0x805fe018-0x80608e57] usable ==> usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: e820: update [mem 0x805fe018-0x80608e57] usable ==> usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: extended physical RAM map:
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x0000000000000000-0x000000000009ffff] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x00000000000a0000-0x00000000000fffff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x0000000000100000-0x0000000009afefff] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x0000000009aff000-0x0000000009ffffff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x000000000a000000-0x000000000a1fffff] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x000000000a200000-0x000000000a20ffff] ACPI NVS
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x000000000a210000-0x000000000affffff] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x000000000b000000-0x000000000b020fff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x000000000b021000-0x00000000805fe017] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x00000000805fe018-0x0000000080608e57] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x0000000080608e58-0x0000000080609017] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x0000000080609018-0x0000000080618057] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x0000000080618058-0x0000000080619017] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x0000000080619018-0x0000000080634657] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x0000000080634658-0x000000008a4c0fff] usable
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x000000008a4c1000-0x00000000904c0fff] reserved
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x00000000904c1000-0x000000009067efff] ACPI data
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x000000009067f000-0x000000009267efff] ACPI NVS
თებ 26 13:27:29 SSD-DTP-LNX kernel: reserve setup_data: [mem 0x000000009267f000-0x000000009a7fefff] reserved

Let me know if there are other commands I could run to print useful information.

chikenf00t · February 26, 2024, 3:20pm

Looks like you’re maybe using a hybrid GPU set up? Could see if working with PRIME would help. Maybe see if switching power profiles does anything as well.

SvalbardDistrict · February 26, 2024, 3:23pm

Thank you for the reply. Has that known to be resulting in this kind of problem? And if I disable the integrated GPU in BIOS, would that work as an alternative to this workaround? I will attempt what you suggested as well, but disabling the iGPU would be simpler.

P.S. I do not see any power profiles in Power Management, unless I should look for that elsewhere.

chikenf00t · February 26, 2024, 3:30pm

I’m not too sure, all I know is that hybrid graphics tend to cause issues at times. You could try disabling it and see if that at least isolates the problem.

As for the power profiles, you could give this a try. I’m on Nvidia so I unfortunately can’t do a whole lot of testing on my end

anthony93 · February 26, 2024, 4:56pm

Do you mind posting the full logs? You claimed that your system was randomly restarting, yet I see no trace of said behavior inside the logs you posted (most likely because you didn’t post the log for the problematic boot). Remove the -k flag because that will filter the logs to only include kernel messages. If the system was randomly rebooting, something in the full log could help us identify the trigger.

Also, please identify the problematic boot. If your system had randomly restarted on its own, then using 0 with journalctl makes no sense because 0 means the current boot. By the time you were running the journalctl command, your system must have rebooted at least once already (due to the issue you mentioned). If that’s the case, the problematic boot must be 1, 2, etc. depending on how many times you have rebooted since the initial post was created. It simply cannot be 0.

SvalbardDistrict · February 26, 2024, 5:19pm

Makes sense. The next time it reboots or freezes, I will post the full log with 1 and without the k flag. Thank you for the reply.

SvalbardDistrict · February 27, 2024, 11:30am

Well, switching the iGPU off apparently has no bearing on this - just had another random reboot while watching a video on Twitch.

New dump:

journalctl -b -1 | eos-sendlog: https://0x0.st/HRmX.txt

I can see hardware errors throughout this, especially toward the end, but no idea what those are referring to.

anthony93 · February 27, 2024, 11:45am

თებ 27 15:07:05 SSD-DTP-LNX kernel: mce: [Hardware Error]: Machine check events logged
თებ 27 15:07:05 SSD-DTP-LNX kernel: [Hardware Error]: Corrected error, no action required.
თებ 27 15:07:05 SSD-DTP-LNX kernel: [Hardware Error]: CPU:3 (19:61:2) MC3_STATUS[-|CE|MiscV|-|-|-|SyndV|-|-|-]: 0x9820000000010150
თებ 27 15:07:05 SSD-DTP-LNX kernel: [Hardware Error]: IPID: 0x000300b020069600, Syndrome: 0x000000001a000236
თებ 27 15:07:05 SSD-DTP-LNX kernel: [Hardware Error]: Decode Unit Ext. Error Code: 1, Micro-op cache data parity error.
თებ 27 15:07:05 SSD-DTP-LNX kernel: [Hardware Error]: cache level: RESV, tx: INSN, mem-tx: IRD

These look like bad news. A machine-check exception is when the cpu detects a hardware error. Have you done any memtests lately?

You can also use a tool called mcelog to find out more about the error. mcelog is available on the AUR.

https://aur.archlinux.org/packages/mcelog

Caveat:

It is also necessary to compile your kernel with CONFIG_X86_MCELOG_LEGACY=Y in order to use mcelog

fred666 · February 27, 2024, 11:46am

If there is a updated bios update first as that might fix the mce hardware error

You state that this is the same problem as with your previous computer? Did you re use any hardware from your previous computer?

SvalbardDistrict · February 27, 2024, 12:01pm

BIOS is fully up to date. The only part I’m reusing from the previous system is the NVMe the distro is installed on.

chikenf00t · February 27, 2024, 3:47pm

Have tested that drive to make sure there are no errors present?

SvalbardDistrict · February 27, 2024, 4:04pm

I just noticed this reply. I have not done any recent memtests. I am also running my RAM overclocked - could that be the issue?

To reply to the other post, I hadn’t done a drive check until now, as I wasn’t having this issue with a previous distro.

I will look up how to do memtests and drive tests, and also check that AUR package. Will also look into rolling back some of the RAM overclocking in BIOS and see if that changes anything.

I appreciate everyone’s time and help with this.

anthony93 · February 27, 2024, 4:07pm

I think it’s a possibility. I’ve seen reports of users getting MCEs when they do things to mess with their hardware (like overclocking, undervolting, etc.)

Try to test your system without overclocking the RAM and see if the issue pops up.

ricklinux · February 27, 2024, 4:51pm

Definately could be caused by this.

SvalbardDistrict · February 27, 2024, 5:22pm

I had done three things for RAM overclocking - enabled the 6000-speed preset, Low Latency Support and High Bandwidth Support. I remember reading that many people have stability issues running with both of the latter enabled. I just disabled HBS - will test it like this, then if the issue persists also disable LLS, etc. It will probably take a few days to see definite results, but will report back. I will also look into how to do a memtest and drive test.

anthony93 · February 27, 2024, 5:24pm

Best of luck. Hopefully, the culprit is among those things, not faulty hardware.

Guan · February 28, 2024, 9:24am

I had the same issue with AMD + EXPO enabled at 6000hz Ram, I changed it to 5600HZ EXPO profile, seems fixed this issue, but still don’t know why.

SvalbardDistrict · February 28, 2024, 9:36am

Thank you for that feedback - it’s helpful to know. My RAM only seems to have the 6000 preset, but maybe I can manually tweak it, if it comes to that.

SvalbardDistrict · March 2, 2024, 10:20am

Disabling High Bandwidth Support made no effect on the issue, then also disabling Low Latency Support also did not solve it, and resetting RAM to stock speeds also did not alleviate random reboots/freezes. The most recent log was this: https://0x0.st/H7zT.txt

I did a drive test by installing smartmontools and running sudo smartctl -a /dev/nvme1n1p2 on the partition where EndeavourOS is installed. It returned the following:

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.18-1-lts] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       TEAM TM8FP6512G
Serial Number:                      TPBF2206230030401269
Firmware Version:                   SN08868
PCI Vendor/Subsystem ID:            0x1e4b
IEEE OUI Identifier:                0x3a5a27
Total NVM Capacity:                 512.110.190.592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512.110.190.592 [512 GB]
Namespace 1 Utilization:            113.782.514.176 [113 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            3a5a27 032000031e
Local Time is:                      Sat Mar  2 14:12:59 2024 +04
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     90 Celsius
Critical Comp. Temp. Threshold:     95 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.50W       -        -    0  0  0  0        0       0
 1 +     5.80W       -        -    1  1  1  1        0       0
 2 +     3.60W       -        -    2  2  2  2        0       0
 3 -   0.7460W       -        -    3  3  3  3     5000   10000
 4 -   0.7260W       -        -    4  4  4  4     8000   45000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    8%
Data Units Read:                    20.702.537 [10,5 TB]
Data Units Written:                 78.264.975 [40,0 TB]
Host Read Commands:                 208.119.645
Host Write Commands:                464.149.798
Controller Busy Time:               1.707
Power Cycles:                       1.216
Power On Hours:                     7.020
Unsafe Shutdowns:                   130
Media and Data Integrity Errors:    0
Error Information Log Entries:      2.345
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               42 Celsius
Temperature Sensor 2:               42 Celsius
Temperature Sensor 3:               42 Celsius
Temperature Sensor 4:               42 Celsius
Temperature Sensor 5:               42 Celsius
Temperature Sensor 6:               42 Celsius
Temperature Sensor 7:               42 Celsius
Temperature Sensor 8:               42 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0       2345     0  0x3015  0x2002  0x000            0     0     -  Invalid Command Opcode

Self-tests not supported

Not sure if the SMART test passing means this NVMe should not be the cause of the issue, or if logs could have answers. I don’t know where to find those either. And should I run the test on other partitions as well (swap, etc)?

I also did two full passes of memtest86 on the overclocked 6000-speed preset of my RAM, without any errors being shown.

Now I have installed the latest stable kernel (6.7.6) and switched to it, to see if that makes any difference.

Guan · March 4, 2024, 7:34am

does reset/update bios helped here? or do you check if the cpu overheated?