Random crashes on fresh install: EXT4-fs error

mate · October 19, 2021, 4:07pm

fsck gives me:

e2fsck 1.46.4 (18-Aug-2021)
/dev/nvme0n1p8: clean, 318510/3932160 files, 2919514/15728640 blocks

I’m not quite sure after reading the link on setting up periodic TRIM, how to check if my drive supports trimming.
sudo hdparm -I /dev/nvme0n1 | grep TRIM does not return anything. Can I just enable the timer anyway or could that be problematic?

joekamprad · October 19, 2021, 4:44pm

nvme drives does not work the same way as sata-ssd works for hdparm…
Afaik TRIM should work on all nvme drives.

mate · October 19, 2021, 4:49pm

Okay, I have enabled it now. Will let you know the outcome

mate · October 19, 2021, 6:20pm

It was running fine now for the last 1.5h but I just got a crash with TRIM enabled.

manuel · October 19, 2021, 6:24pm

You can also disable TRIM for now. It doesn’t typically need to be run very often, maybe once a week or month, depending on your disk usage.

So if you disable it fully, does that help anything? If it doesn’t, then likely TRIM is not the culprit.

mate · October 19, 2021, 6:28pm

It was disabled before, but I will see if it still crashes after disabling it again. It now takes a few hours in between crashes most of the time.

manuel · October 19, 2021, 6:35pm

Solving this strange problem requires eliminating potential reasons one by one.

ricklinux · October 19, 2021, 6:57pm

Have you looked at dmesg?

mate · October 19, 2021, 7:04pm

Would this help?

journalctl -b -1 -o short-precise -k -p 4

-- Journal begins at Sun 2021-10-17 23:29:43 CEST, ends at Tue 2021-10-19 20:59:33 CEST. --
Okt 19 18:46:54.639881 matebook kernel: AMD-Vi: [Firmware Bug]: : IOAPIC[4] not in IVRS table
Okt 19 18:46:54.639891 matebook kernel: AMD-Vi: [Firmware Bug]: : IOAPIC[5] not in IVRS table
Okt 19 18:46:54.639899 matebook kernel: AMD-Vi: [Firmware Bug]: : No southbridge IOAPIC found
Okt 19 18:46:54.639907 matebook kernel: AMD-Vi: Disabling interrupt remapping
Okt 19 18:46:54.640107 matebook kernel: APIC calibration not consistent with PM-Timer: 363ms instead of 100ms
Okt 19 18:46:54.661168 matebook kernel: pci 0000:00:00.2: can't derive routing for PCI INT A
Okt 19 18:46:54.661276 matebook kernel: pci 0000:00:00.2: PCI INT A: not connected
Okt 19 18:46:54.784168 matebook kernel: i2c_hid i2c-ELAN2203:00: supply vdd not found, using dummy regulator
Okt 19 18:46:54.784373 matebook kernel: i2c_hid i2c-ELAN2203:00: supply vddl not found, using dummy regulator
Okt 19 18:46:54.804079 matebook kernel: tpm_crb MSFT0101:00: can't request region for resource [mem 0x8f7a2000-0x8f7a5fff]
Okt 19 18:46:54.804406 matebook kernel: tpm_crb: probe of MSFT0101:00 failed with error -16
Okt 19 18:46:55.474051 matebook kernel: snd_pci_acp3x 0000:03:00.5: Invalid ACP audio mode : 1
Okt 19 18:46:55.604233 matebook kernel: thermal thermal_zone0: failed to read out thermal zone (-61)
Okt 19 18:46:56.844419 matebook kernel: Bluetooth: hci0: Reading supported features failed (-16)
Okt 19 18:46:58.894078 matebook kernel: kauditd_printk_skb: 33 callbacks suppressed
Okt 19 18:47:04.554041 matebook kernel: kauditd_printk_skb: 8 callbacks suppressed
Okt 19 18:47:15.054174 matebook kernel: kauditd_printk_skb: 7 callbacks suppressed
Okt 19 19:02:14.368856 matebook kernel: kauditd_printk_skb: 19 callbacks suppressed

ricklinux · October 19, 2021, 7:07pm

Have you tried any kernel parameters?

Edit: Do you have the latest UEFI Bios ?

mate · October 19, 2021, 7:19pm

The only kernel parameter I have changed is adding acpi_backlight=vendor since I had an issue with the backlight after the initial install.
I have the latest BIOS version from the vendor installed, which is sadly not actually very recent.

mate · October 21, 2021, 5:40am

So far I get a crash every few hours, which is definitely better than at the start, but not ideal.

ricklinux · October 21, 2021, 3:00pm

mate · October 21, 2021, 5:24pm

Since updating the motherboard is sadly not an option I will try the workaround and let you know how it went. Thanks!

ricklinux · October 21, 2021, 5:36pm

There are a number of error messages in your dmesg output. This one may or may not have anything to do with the crashes but i don’t know? One of the other error messages has to do with TPM security. Maybe turn it off in Bios settings. There are a few other error messages. You may be able to remove some but not all? Some messages are just informational and may not affect performance and you may not be able to get rid of ALL error messages anyway.

ycom1 · October 21, 2021, 7:27pm

smart status checked ? goes something like:

sudo smartctl -x /dev/sda

mate · October 22, 2021, 6:30am

I have not been able to test it long enough to see if the system is now stable, but I don’t get these error in my dmesg output anymore. Will test further.

mate · October 22, 2021, 6:32am

smart status

sudo smartctl -x /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.75-1-lts] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       WDC WDS500G2B0C-00PXH0
Serial Number:                      2004DQ460107
Firmware Version:                   211070WD
PCI Vendor/Subsystem ID:            0x15b7
IEEE OUI Identifier:                0x001b44
Total NVM Capacity:                 500.107.862.016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500.107.862.016 [500 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            001b44 4a44049d33
Local Time is:                      Fri Oct 22 08:31:17 2021 CEST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.50W    2.10W       -    0  0  0  0        0       0
 1 +     2.40W    1.60W       -    0  0  0  0        0       0
 2 +     1.90W    1.50W       -    0  0  0  0        0       0
 3 -   0.0200W       -        -    3  3  3  3     3900   11000
 4 -   0.0050W       -        -    4  4  4  4     5000   39000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        35 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    7.002.959 [3,58 TB]
Data Units Written:                 7.983.703 [4,08 TB]
Host Read Commands:                 75.609.874
Host Write Commands:                55.307.514
Controller Busy Time:               251
Power Cycles:                       609
Power On Hours:                     151
Unsafe Shutdowns:                   102
Media and Data Integrity Errors:    0
Error Information Log Entries:      1
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

Seems to be good.

mate · October 22, 2021, 10:13am

After a few hours I got another crash
I guess next I will try turning off TPM.

mbod · October 22, 2021, 11:06am

These pictures imply that it is always the same inodes causing the crash. Can you confirm that?
Can you find out which files own these inodes:

What happens if you access those files then? Does read/write to those files trigger the crash?