Cannot boot after updating (disk failure)

anon49550902 · January 2, 2024, 11:36pm

I don’t think this is related to the mesa problems that others are experiencing in the past few days.

I’ve been using EOS as my daily driver for well over a year with very few problems. I usually update once a week and today (02 January 2024) is the first time an update has killed any Linux system in this way in over 15 years. This is a Linux only machine - no dual booting.

Potentially helpful information:
LUKS protected 2TB nvme ssd (purchased January 2023, only ever had EOS installed)
KDE Plasma

All system information from live session.

inxi -Fxa
System:
  Host: EndeavourOS Kernel: 6.6.1-arch1-1 arch: x86_64 bits: 64 compiler: gcc
    v: 13.2.1 clocksource: tsc available: acpi_pm
    parameters: initrd=\arch\boot\intel-ucode.img
    initrd=\arch\boot\amd-ucode.img
    initrd=\arch\boot\x86_64\initramfs-linux.img archisobasedir=arch
    archisolabel=EOS_202311 cow_spacesize=10G copytoram=n
    module_blacklist=nvidia,nvidia_modeset,nvidia_uvm,nvidia_drm,pcspkr
    nouveau.modeset=1 i915.modeset=1 radeon.modeset=1 nvme_load=yes
  Desktop: KDE Plasma v: 5.27.9 tk: Qt v: 5.15.11 wm: kwin_x11 vt: 1
    dm: SDDM Distro: EndeavourOS base: Arch Linux
Machine:
  Type: Laptop System: Dell product: XPS 15 9570 v: N/A
    serial: <superuser required> Chassis: type: 10 serial: <superuser required>
  Mobo: Dell model: 02MJVY v: A00 serial: <superuser required> UEFI: Dell
    v: 1.31.0 date: 03/13/2023
Battery:
  ID-1: BAT0 charge: 45.1 Wh (100.0%) condition: 45.1/56.0 Wh (80.5%)
    volts: 12.8 min: 11.4 model: LGC-LGC4.86 DELL 62MJV type: Li-ion serial: 84
    status: full
  Device-1: hidpp_battery_0 model: Logitech G305 Lightspeed Wireless Gaming
    Mouse serial: 08-7e-91-52 charge: 100% (should be ignored)
    rechargeable: yes status: discharging
CPU:
  Info: model: Intel Core i7-8750H bits: 64 type: MT MCP arch: Coffee Lake
    gen: core 8 level: v3 note: check built: 2018 process: Intel 14nm family: 6
    model-id: 0x9E (158) stepping: 0xA (10) microcode: 0xF4
  Topology: cpus: 1x cores: 6 tpc: 2 threads: 12 smt: enabled cache:
    L1: 384 KiB desc: d-6x32 KiB; i-6x32 KiB L2: 1.5 MiB desc: 6x256 KiB
    L3: 9 MiB desc: 1x9 MiB
  Speed (MHz): avg: 814 high: 880 min/max: 800/4100 scaling:
    driver: intel_pstate governor: powersave cores: 1: 800 2: 800 3: 800 4: 800
    5: 880 6: 862 7: 800 8: 834 9: 800 10: 800 11: 800 12: 800 bogomips: 52815
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
  Vulnerabilities:
  Type: gather_data_sampling mitigation: Microcode
  Type: itlb_multihit status: KVM: VMX disabled
  Type: l1tf mitigation: PTE Inversion; VMX: conditional cache flushes, SMT
    vulnerable
  Type: mds mitigation: Clear CPU buffers; SMT vulnerable
  Type: meltdown mitigation: PTI
  Type: mmio_stale_data mitigation: Clear CPU buffers; SMT vulnerable
  Type: retbleed mitigation: IBRS
  Type: spec_rstack_overflow status: Not affected
  Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via prctl
  Type: spectre_v1 mitigation: usercopy/swapgs barriers and _user pointer sanitization
Type: spectre_v2 mitigation: IBRS, IBPB: conditional, STIBP: conditional,
    RSB filling, PBRSB-eIBRS: Not affected
  Type: srbds mitigation: Microcode
  Type: tsx_async_abort status: Not affected
Graphics:
  Device-1: Intel CoffeeLake-H GT2 [UHD Graphics 630] vendor: Dell
    driver: i915 v: kernel arch: Gen-9.5 process: Intel 14nm built: 2016-20
    ports: active: DP-1,DP-3 off: eDP-1 empty: DP-2 bus-ID: 00:02.0
    chip-ID: 8086:3e9b class-ID: 0300
  Device-2: NVIDIA GP107M [GeForce GTX 1050 Ti Mobile] vendor: Dell
    driver: nouveau v: kernel alternate: nvidia_drm,nvidia non-free: 545.xx+
    status: current (as of 2023-10; EOL~2026-12-xx) arch: Pascal code: GP10x
    process: TSMC 16nm built: 2016-2021 pcie: gen: 1 speed: 2.5 GT/s lanes: 16
    link-max: gen: 3 speed: 8 GT/s bus-ID: 01:00.0 chip-ID: 10de:1c8c
    class-ID: 0302 temp: 33.0 C
  Device-3: Microdia Integrated_Webcam_HD driver: uvcvideo type: USB
    rev: 2.0 speed: 480 Mb/s lanes: 1 mode: 2.0 bus-ID: 1-12:7
    chip-ID: 0c45:671d class-ID: 0e02
  Display: server: X.Org v: 21.1.9 compositor: kwin_x11 driver: X:
    loaded: modesetting alternate: fbdev,intel,nouveau,nv,vesa dri: iris,nouveau
    gpu: i915 display-ID: :0 screens: 1
  Screen-1: 0 s-res: 3000x1920 s-dpi: 96 s-size: 791x506mm (31.14x19.92")
    s-diag: 939mm (36.97")
  Monitor-1: DP-1 pos: top-left model: Philips PHL 246E9Q serial: 15478
built: 2018 res: 1080x1920 hz: 60 dpi: 93 gamma: 1.2
    size: 296x527mm (11.65x20.75") diag: 604mm (23.8") ratio: 16:9 modes:
    max: 1920x1080 min: 720x400
  Monitor-2: DP-3 pos: primary,bottom-r model: Philips PHL 271E1
    serial: 29559 built: 2022 res: 1920x1080 hz: 60 dpi: 82 gamma: 1.2
    size: 598x336mm (23.54x13.23") diag: 686mm (27") ratio: 16:9 modes:
    max: 1920x1080 min: 720x400
  Monitor-3: eDP-1 note: disabled model: Sharp 0x149a built: 2018
    res: 1920x1080 dpi: 142 gamma: 1.2 size: 344x194mm (13.54x7.64")
    diag: 395mm (15.5") ratio: 16:9 modes: 1920x1080
  API: EGL v: 1.5 hw: drv: intel iris drv: nvidia nouveau platforms:
    device: 0 drv: nouveau device: 1 drv: iris device: 2 drv: swrast gbm:
    drv: nouveau surfaceless: drv: iris x11: drv: iris inactive: wayland
  API: OpenGL v: 4.6 compat-v: 4.3 vendor: intel mesa v: 23.2.1-arch1.2
    glx-v: 1.4 direct-render: yes renderer: Mesa Intel UHD Graphics 630 (CFL
    GT2) device-ID: 8086:3e9b memory: 30.27 GiB unified: yes
  API: Vulkan Message: No Vulkan data available.
Audio:
  Device-1: Intel Cannon Lake PCH cAVS vendor: Dell driver: snd_hda_intel
    v: kernel alternate: snd_soc_skl,snd_sof_pci_intel_cnl bus-ID: 00:1f.3
    chip-ID: 8086:a348 class-ID: 0403
  Device-2: AIAIAI ApS Tracks 2.0 USB-C
    driver: hid-generic,snd-usb-audio,usbhid type: USB rev: 2.0 speed: 12 Mb/s
    lanes: 1 mode: 1.1 bus-ID: 1-5.2.1:9 chip-ID: 30a5:0201 class-ID: 0300
    serial: 0000000000001114
  Device-3: DisplayLink Dell Universal Dock D6000
    driver: cdc_ncm,snd-usb-audio type: USB rev: 3.2 speed: 5 Gb/s lanes: 1
    mode: 3.2 gen-1x1 bus-ID: 4-1.1:3 chip-ID: 17e9:6006 class-ID: 0a00
    serial: 2006015267
  API: ALSA v: k6.6.1-arch1-1 status: kernel-api
    tools: alsactl,alsamixer,amixer
  Server-1: PipeWire v: 0.3.85 status: active with: 1: pipewire-pulse
    status: active 2: wireplumber status: active 3: pipewire-alsa type: plugin
    4: pw-jack type: plugin tools: pactl,pw-cat,pw-cli,wpctl
Network:
  Device-1: Intel Wi-Fi 6 AX210/AX211/AX411 160MHz driver: iwlwifi v: kernel
    modules: wl pcie: gen: 2 speed: 5 GT/s lanes: 1 bus-ID: 3b:00.0
    chip-ID: 8086:2725 class-ID: 0280
  IF: wlan0 state: down mac: c2:17:fb:4f:0a:13
  IF-ID-1: enp58s0u1u1i5 state: up speed: 1000 Mbps duplex: half
    mac: 0c:37:96:08:68:f6
Bluetooth:
  Device-1: Intel AX210 Bluetooth driver: btusb v: 0.8 type: USB rev: 2.0
    speed: 12 Mb/s lanes: 1 mode: 1.1 bus-ID: 1-4:3 chip-ID: 8087:0032
    class-ID: e001
  Report: btmgmt ID: hci0 rfk-id: 0 state: up address: F0:D4:15:50:ED:D9
    bt-v: 5.3 lmp-v: 12 status: discoverable: no pairing: no class-ID: 7c010c
Drives:
  Local Storage: total: 494.67 GiB used: 0 KiB (0.0%)
  SMART Message: Unable to run smartctl. Root privileges required.
  ID-1: /dev/sda maj-min: 8:0 vendor: Samsung model: SSD 860 EVO 500GB
    size: 465.76 GiB block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s
    tech: SSD serial: S3Z1NB0K867233V fw-rev: 1B6Q scheme: GPT
  ID-2: /dev/sdb maj-min: 8:16 vendor: Lexar model: USB Flash Drive
    size: 28.91 GiB block-size: physical: 512 B logical: 512 B type: USB
    rev: 2.0 spd: 480 Mb/s lanes: 1 mode: 2.0 tech: SSD
    serial: 04GTPVGQ50MCCK1K fw-rev: 1100 scheme: MBR
  SMART Message: Unknown USB bridge. Flash drive/Unsupported enclosure?
Partition:
  ID-1: / raw-size: N/A size: 10 GiB used: 95 MiB (0.9%) fs: overlay
    source: ERR-102
Swap:
  Alert: No swap data was found.
Sensors:
  System Temperatures: cpu: 46.0 C pch: 66.0 C mobo: 33.0 C gpu: nouveau
    temp: 37.0 C
  Fan Speeds (rpm): cpu: 2525 fan-2: 2523
Info:
  Processes: 290 Uptime: 5m wakeups: 0 Memory: total: 32 GiB note: est.
  available: 30.99 GiB used: 2.63 GiB (8.5%) Init: systemd v: 254
  default: multi-user tool: systemctl Compilers: gcc: 13.2.1 Packages:
  pm: pacman pkgs: 982 libs: 221 tools: yay Shell: Bash v: 5.2.15
  running-in: konsole inxi: 3.3.31

A couple of things to note about the information above. To reiterate, that is from a live USB session user and not the installed version. My installed kernel at time of the problems (as seen below) is EndeavourOS Kernel: 6.6.8-arch1-1. Also note, there are 2 separate disks in this computer and I have UEFI and AHCI enabled and NOT RAID. The other disk has another Linux OS installed for testing, that is the one visible in inxi. The disk encountering problems is not even seen by the inxi call. And yes, I was running the latest proprietary NVIDIA drivers without any problems.

I ran pacman -Syyu as I always do and after completing successfully, I was notified of the recommended reboot. Within a short time, things seemed off as the machine did not reboot after several minutes, when it normally takes seconds. I then did a hard reboot of the computer. Initially, the normal EOS boot selection displayed as below.

From here, things quickly went badly. I was presented with the following:

(Image of screen with error message: “Failed to open random seed file: Device error” and “Error opening root path: Out of resources.”)

From there, it just boots to my BIOS. I have searched for several hours here on the forum (and elsewhere online) and have not been successful in finding anything relevant or sufficient to resolve the problem. I am unable to access TTY at any time during the brief process.

As you can see, lsblk -f will list the device (nvme drive) but it does not show any file system, type, or UUID.

[liveuser@eos-2023.11.17 ~]$ lsblk -f
NAME     FSTYPE      FSVER        LABEL       UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
loop0    squashfs    4.0                                                                 0   100% /run/archiso/airootfs
sda                                                                                               
├─sda1   vfat        FAT32                    EE0B-3F7D                                           
└─sda2   crypto_LUKS 1                        dfc5a40d-5409-435e-9039-f5edcb337584                
sdb      iso9660     Joliet Exten EOS_202311  2023-11-17-11-34-04-00                              
├─sdb1   iso9660     Joliet Exten EOS_202311  2023-11-17-11-34-04-00                     0   100% /run/archiso/bootmnt
└─sdb2   vfat        FAT16        ARCHISO_EFI 0354-96F3                                           
nvme0n1                                                                                           
├─nvme0n1p1
│                                                                                                 
└─nvme0n1p2

When attempting to mount either partition I get the following error:
mount: /mnt: can't read superblock on /dev/nvme0n1p2.

The only interesting information in the dmesg echo is this, which I’ve never seen before.

[ 1297.887946] Buffer I/O error on dev nvme0n1p2, logical block 0, async page read
[ 1297.887950] Buffer I/O error on dev nvme0n1p2, logical block 1, async page read
[ 1297.887952] Buffer I/O error on dev nvme0n1p2, logical block 2, async page read
[ 1297.887953] Buffer I/O error on dev nvme0n1p2, logical block 3, async page read
[ 1297.887954] Buffer I/O error on dev nvme0n1p2, logical block 4, async page read
[ 1297.887955] Buffer I/O error on dev nvme0n1p2, logical block 5, async page read
[ 1297.887957] Buffer I/O error on dev nvme0n1p2, logical block 6, async page read
[ 1297.887958] Buffer I/O error on dev nvme0n1p2, logical block 7, async page read
[ 1297.888960] isofs_fill_super: bread failed, dev=nvme0n1p2, iso_blknum=16, block=32
[ 1297.889033] SQUASHFS error: Failed to read block 0x0: -5

Because I cannot mount the drive, I cannot chroot to attempt to fix it. Unless there is another way to do so that I’m not aware of at this time.

If you need additional information, please ask.

Any assistance is greatly appreciated.
Just for giggles, I tried starting the process to re-install EOS on the drive and when selecting the nvme drive, it states “2Tb, 0B unknown partition table”. I recently ran the smartctl health check on the nvme drive (about 2 weeks ago) when I was testing drives in my other machines and the health was excellent. Just over 2000 hours of power uptime, and no failed sectors or any errors. Not to mention, only about 3Tb of read/write IO.

dalto · January 2, 2024, 11:41pm

If they are encrypted with luks, you need to unlock them and then mounted the newly created device.

That being said, that looks like a hardware issue at first glance.

anon49550902 · January 3, 2024, 12:18am

Hmmm, I’d not thought about that. During the boot process, I thought boot begins and then the request to decrypt occurs before boot is finished. Not to mention, you can see above that /dev/sda is the second drive I referred to and it is listed as a LUKS file system. I’ll give it a try, but am skeptical it will do anything.

I don’t necessarily disagree. But I don’t understand why? I used the computer earlier in the day before the update without any problems. Besides, when I ran the health checks a couple weeks ago, all looked good.

dalto · January 3, 2024, 12:45am

Years ago, I worked for a company that repaired heaters. When people would call, they would almost always say the same thing. “I don’t understand what happened, it was working fine yesterday”.

Basically, things often work until they stop working.

Unlike with HDDs, SSD failures are often sudden. Especially when the controllers fail. In that case, there is rarely any advance notice of failure.

That being said, it isn’t definitively a failed nvme drive, that just seems like the most likely cause based on the symptoms. I have also seen things like chipset failures cause similar behavior. However, those are much, much rarer than drive failures.

Since you are seeing the same issue from the live environment, it seems unlikely that it was related to an update.

Do you have any other nvme drives you can swap in to test?

anon49550902 · January 3, 2024, 1:01am

Yes, I’ve been around IT long enough to know this happens. But it’s supposed to happen to other people, not me.

Okay, I tried decryption and the output is /dev/nvme0n1 is not a valid LUKS device.

Admittedly, I’m also upset at myself because I’ve not performed a backup for several months.

Ironically, I just bought a small one to replace the drive in my router box. I can put it in and see what happens.

If it’s unrelated to the update, I don’t know what else it is. I’ve searched for far too many hours today and that’s all I can come to, a failed disk.

anon49550902 · January 3, 2024, 1:40am

@dalto So, I tried two more things and I’m not sure if I should feel better or worse… Firstly, I opened a brand new 256gb nvme drive and put it into the computer. I booted both into an EndeavourOS live environment and into the testing drive I have installed (which ironically, I already knew that SATA SSD was in a failed state -though, oddly it still boots - and was why I bought the nvme last year). I ran a few similar calls to what I did above, as well as attempted to install EOS, all with the exact same results as the nvme on which I have EOS installed.

So that makes me begin to question (though I have no idea what!) if something happened to the motherboard.

Whilst I know the following often works partially at best and can downright fail most of the time, I did try nonetheless. I put the nvme into another computer (okay, okay, I put it into two others) to see if it would be recognized or possibly even boot. Yes, I know the latter was unrealistic. Anyhow, as I suspected, neither of the other two machines even recognized a disk was installed. And honestly, I have no more answers than before.

If anyone has any other suggestions or things to try, I’m open to hearing them. Even if to get the data off the disk, it would help.

anon49550902 · January 4, 2024, 12:34am

Well, I did a little more searching and couldn’t come up with anything new. At this point, my working theory is a failed drive. I’ve not come up with anything substantial outside of that. I’ve purchased a new nvme drive from a more reputable manufacturer and thus far, a new installation of EOS is going fine - as I’d expect based on my experience over the past year and a half.

Just a friendly reminder: if you’ve not taken a backup recently, stop and do it now. It only takes a few minutes and will save some pain in the future.