NVME Boot Drive Dead?

I had gotten the error in the attached picture a few times at boot before. Each time I was able to power down, reseat the m.2 drive and things went back to working again. I assumed this indicated that the drive was failing, so I made a point to copy everything of importance on the drive off the last few times I was able to boot into the system (I’m hoping). I’m in the position now where no matter what I try with the drive, I can’t get things to boot. Does this point to a failing drive or possibly a bad m.2 slot (there is only one on the MB to try out) or something else? Any other suggestions on what to try or ways I can make sure I got what I needed off of the drive?

It could indicate a failing drive or it could be something altogether.

Boot off the ISO and see if you can mount the partition and take a look around.

2 Likes

I thought that might be the thing to do. Sadly I’m the only human without a USB drive laying around. Looks like I’m going to Target tomorrow AM. This will be new for me, so I’ll get into the system and do what I can after my research and post back with questions. Thanks.

2 Likes

I think a USB drive ($5 or so) is part of the essential toolkit :slight_smile:

1 Like

The screenshot says: “After logging in, type ‘journalctl -xb’ to view system logs”

Did you do that? Could give valuable information.

Can you run smartctl -a /dev/device? Or nvme smart-log /dev/device ?

1 Like

This is very true and all on me. To think, I was even a Boy Scout…

2 Likes

I did the journalctl -xb command, but didn’t see anything that stuck out in that wall of data. The errors I saw were along the same lines of the “timed out waiting for…” message that’s in the screen shot. Output of the smartctl command is below, the other gave me an error when I tried to run it.

Model Number:                       WDC WDS256G1X0C-00ENX0
Serial Number:                      175015425063
Firmware Version:                   B35900WD
PCI Vendor ID:                      0x15b7
PCI Vendor Subsystem ID:            0x1b4b
IEEE OUI Identifier:                0x001b44
Controller ID:                      0
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 4a44459ee0
Local Time is:                      Wed Nov 29 18:37:08 2023 EST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0016):     Wr_Unc DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     88 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.25W    5.00W       -    0  0  0  0        0       0
 1 +     8.25W    3.50W       -    1  1  1  1        0       0
 2 +     8.25W    3.00W       -    2  2  2  2        0       0
 3 -   0.0500W       -        -    0  0  0  0    51000   10000
 4 -   0.0055W       -        -    0  0  0  0  7000000  100000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    100%
Available Spare Threshold:          4%
Percentage Used:                    8%
Data Units Read:                    29,144,407 [14.9 TB]
Data Units Written:                 22,568,643 [11.5 TB]
Host Read Commands:                 236,476,007
Host Write Commands:                420,713,610
Controller Busy Time:               136,843
Power Cycles:                       146
Power On Hours:                     42,649
Unsafe Shutdowns:                   87
Media and Data Integrity Errors:    0
Error Information Log Entries:      1
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0          1     -       -  0x800c      -            0     0     -  Internal Error

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged```

I am currently in the live environment (I like KDE more than I remember) and I am able to see all of the drives I have installed in the system and access them without issue. I’m in the process of moving a few key things off of the nvme drive just incase, but so far I haven’t had any issues getting to files on any drive. I do think I might have something set up a little weird though since I see /home and home listed under devices in dolphin. Let me know what other info would be of use.

NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0         7:0    0   2.2G  1 loop /run/archiso/airootfs
sda           8:0    0 931.5G  0 disk 
└─sda1        8:1    0 931.5G  0 part /run/media/liveuser/home
sdb           8:16   0   1.8T  0 disk 
└─sdb1        8:17   0   1.8T  0 part /run/media/liveuser/a55e6246-430f-438a-8c80-9e1f9586b1bd
sdc           8:32   1  28.9G  0 disk 
├─sdc1        8:33   1  24.9G  0 part /run/archiso/bootmnt
└─sdc2        8:34   1     4G  0 part /run/media/liveuser/persistence
nvme0n1     259:0    0 238.5G  0 disk 
├─nvme0n1p1 259:1    0     1G  0 part 
├─nvme0n1p2 259:2    0  97.7G  0 part /run/media/liveuser/_
└─nvme0n1p3 259:3    0 139.8G  0 part /run/media/liveuser/_home```

have you tried taking a pencil eraser on the pins that plug into the motherboard?

1 Like

This is eerily similar to what I just recently experienced. My boots into endeavourOS were randomly going straight to black screens for a couple of weeks without any particular pattern of notice. Each time I was able to fix it by using a LIVE environment boot and checking the drive for errors with gparted which would fix it each time to allow successful boot again. Luckily the most recent fix, I backed up some recently created files because the next time the failure happened, it wouldn’t fix with gparted anymore. Turned out after plenty of troubleshooting hours that the issue was my SSD failing. As I’m typing this, my new SSD is out for delivery by Amazon.

Because of Black Friday sales, I wasn’t left with many options, so I went with SK Hynix Platinum. I never purchased a product of theirs before, but benchmarks and Reddit say it’s one of the best, so I am excited for it to arrive. I hope you figure yours out.

I will give this a try next time I shut the machine down. Thanks for the suggestion.

no guarantee’s i have used it on memory before so i don’t see much difference for your situation.

Did you have to install gparted on the live environment (is that even possible?)? I’m running off of the latest ISO right now and don’t see it on here. I’m trying the “check and repair” in KDE disk utilities right now though and will see how that goes.

Hmmm, I got errors with that:

Job: Check file system on partition ‘/dev/nvme0n1p2’ 
Command: btrfs check --repair /dev/nvme0n1p2 
Check file system on partition ‘/dev/nvme0n1p2’: Error
Check and repair partition ‘/dev/nvme0n1p2’ (97.66 GiB, btrfs): Error

Each DE has a disk management application, so it depends on what distro you’re using in the LIVE envrionment. XFCE uses gparted, KDE uses KDE Partition manager and so on. You can use any distro LIVE environment’s disk software to repair the disk. In one instance, the endeavourOS USB stopped booting for me as well, and then I switched over to Fedora ISO in recovery mode to repair it successfully one last time.

You did good by backing up your stuff that one opportunity you had. We both got lucky on that front.

This should not be there.

Can you do a

nvme smart-log <device>
and
nvme error-log <device>

Both of those give me an error.

[liveuser@eos-2023.11.17 ~]$ nvme error-log nvme0n1
bash: nvme: command not found
[liveuser@eos-2023.11.17 ~]$ nvme smart-log nvme0n1
bash: nvme: command not found

You need to have package nvme-cli installed.

Then you can list all nvme devices with sudo nvme list
This will give you the device names to be used with sudo nvme smart-log ...

The output for smart-log is below. The error-log shows 64 entries so it’s long, I’ll try to post that somewhere and share the link. This link should do the trick.

Edit: added link

[liveuser@eos-2023.11.17 ~]$ sudo nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 44 °C (317 K)
available_spare                         : 100%
available_spare_threshold               : 4%
percentage_used                         : 8%
endurance group critical warning summary: 0
Data Units Read                         : 29147489 (14.92 TB)
Data Units Written                      : 22568660 (11.56 TB)
host_read_commands                      : 236490109
host_write_commands                     : 420713686
controller_busy_time                    : 136844
power_cycles                            : 146
power_on_hours                          : 42662
unsafe_shutdowns                        : 87
media_errors                            : 0
num_err_log_entries                     : 1
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0

the kde-partitionmanager is installed almost the same on options.
but possible to install gparted i bet as long as the ISO is “fresh”

and this

unsafe_shutdowns                        : 87

makes me nervous, apparently the disk is shutdown many times possibly by power shutdown?
And already used 8% of spare used while the disk has 42662 power on hours and 11.56 TB written is high.

So 2 things could be happening
Either the disk itself is failing or your install is doing something to the disk that is not a normal operation, hence the damaged file system