Half of my reboots lead to a black screen

Kundalini · October 17, 2023, 10:39pm

Hi,

Not sure what’s happening in the last two weeks or so, but suddenly half of my reboots lead me to a black screen and stays unresponsive until I do a hard reboot. Sometimes it only makes it to the screen asking me which kernel I want to load, and the other times it gets as far as asking me to enter my encrypted drive password. The only solution I have is to keep rebooting, and suddenly it works again after a few tries. I cannot think of anything I would’ve done as a user to cause this.

Is this experienced by anyone else? Is it a known issue? What can I do about it?

emk2203 · October 18, 2023, 7:11am

Sounds like a failing system drive.

What’s the output of smartctl -a <your system drive>?

Kundalini · October 18, 2023, 8:17pm

Can you be more specific how to use that command? How do I get my system drive name?

emk2203 · October 19, 2023, 8:40am

Do a lsblk and look which drive has the root (MOUNTPOINT /) partition. This drive at the top of the tree (not the partition itself) is the system drive. Use this (/dev/sda, /dev/nvme0n1 or whatever it is)

manuel · October 19, 2023, 9:22am

As @emk2203 seems to suspect, this may be caused by a disk problem.
So I suggest you backup all important data you have on that disk to an external drive as the first thing.

Kundalini · October 19, 2023, 8:16pm

I asked in order to be sure and turns out same result I had before I asked. Is this correct?

Kundalini · October 19, 2023, 8:26pm

Thankfully, I always keep all important data external

emk2203 · October 19, 2023, 9:07pm

for smartctl, you need to use sudo smartctl -a /dev/nvme0n1. Sorry if that wasn’t clear.

Kundalini · October 21, 2023, 7:48am

It’s happening more and more. Not sure why this is occurring suddenly, but it’s very annoying, and I need to get to the bottom of it.

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.5.8-arch1-1] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO Plus 2TB
Serial Number:                      S59CNM0W426593T
Firmware Version:                   2B2QEXM7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization:            395,703,070,720 [395 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5431925499
Local Time is:                      Sat Oct 21 14:26:51 2023 PDT
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.50W       -        -    0  0  0  0        0       0
 1 +     5.90W       -        -    1  1  1  1        0       0
 2 +     3.60W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        50 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    2,318,247 [1.18 TB]
Data Units Written:                 5,922,962 [3.03 TB]
Host Read Commands:                 15,693,109
Host Write Commands:                37,412,269
Controller Busy Time:               146
Power Cycles:                       56
Power On Hours:                     105
Unsafe Shutdowns:                   17
Media and Data Integrity Errors:    0
Error Information Log Entries:      447
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               50 Celsius
Temperature Sensor 2:               41 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0        447     0  0x001c  0x4004      -            0     0     -  Invalid Field in Command

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged

pebcak · October 21, 2023, 7:58am

Please post terminal output as formatted text and not screenshots!

This will make it easire to read, copy, paste, quote and comment on.

And also the text would be searchable by forum search function.

Thanks in advance!

emk2203 · October 21, 2023, 7:59am

A good SSD brand, brand-new, and the only errors with Invalid Field in Command looks like the SSD is healthy but there are things happening in the system which cause problems.

Faulty RAM? Faulty connector or controller? Difficult to say. Run memtest86+ from the boot menu to be sure that the RAM is OK.

One further avenue to explore could be nvme self-test-log /dev/nvme0 and try to see the errors in more detail. Don’t know if this proves to be insightful for all the work.

It’s
sudo nvme device-self-test /dev/nvme0n1 -s 1
followed by
sudo nvme self-test-log --dst-entries 1 -v /dev/nvme0

And if you post again, post diagnostics as text, not image.

If I would be you, I would swap out the SSD to a different system and also put a different SSD in the problem system. And then test both system under load.

Kundalini · October 21, 2023, 9:28pm

Haha, that’s twice you got me… I will learn

Kundalini · October 21, 2023, 9:30pm

I’ll have to find the time to do all this as I am super busy with work, but I can say for sure that all my equipment is brand new, so I am surprised if suddenly there’s an issue. I mean it’s possible, but what I’m getting at is it is all brand new reputable name brand hardware. I could be wrong, but I suspect it’s something to do with the OS and an update at some point. The reason why I say this is that I had a similar issue previously that magically disappeared, and now it appears to have resurfaced but worse than before.

emk2203 · October 22, 2023, 6:42am

Tech stuff failure usually follows a bathtub curve. You can expect new equipment to fail more often early (Early “infant mortality” failure).

This is why often people do a burn-in and let the equipment run under some stress for some time before using it in production.

Your failures look a lot more like hardware issues than software issues.

Kundalini · October 23, 2023, 3:10am

It’s an Intel NUC 12 Enthusiast device. No issues booting up anything else. Only happens on EndeavourOS

emk2203 · October 23, 2023, 4:32am

What do you mean by ‘anything else’? Windows only? On the system level we are talking about, there shouldn’t be a difference between Linux distributions if they run a similar kernel. The nvme access is always the same driver.

Linux and Windows are different, though. Windows is better in hiding flaws in the system.

Kundalini · October 24, 2023, 8:00pm

For the sake of knowing for sure, I am going to install a different distro for a day and run several reboot tests to see if I can replicate it. If it doesn’t happen, then wouldn’t it mean it’s not the hardware?

Kundalini · October 25, 2023, 1:36am

Turns out you were right. I swapped both the SSD and the RAM at the same time and that problem is now gone. Only now I am unsure if the problem was the SSD or the RAM. Which do you think it most likely was? Thanks for your help btw.

emk2203 · October 25, 2023, 11:00am

I had the same happening to me years ago with a new build, and it was the RAM. I had issues surfacing only after some hours of use, so somewhat different, but I would try RAM first.

There’s a reason why manufacturers make compatibility lists for RAM but rarely for other components.

Kundalini · October 25, 2023, 7:38pm

Nevermind. I spoke too soon. It suddenly happened again today. I’m going to RMA this Intel NUC. Well at least you helped me figure out it’s faulty while still within the warranty period. Hopefully it is actually an issue with it and it doesn’t also happen with a replacement unit.