OS failure upon reboot after hanging processes

Soon after the latest (presumably kernel) updates I notice the following:

  1. A application/process “hangs” (notably firefox, but also thunar, pulse audio etc.)
  2. Trying to kill the application fails (kill -9 pid)
  3. Soon other application/processes also hang in the same way and the system becomes unusable.
  4. reboot fails
  5. So I restart the machine.

Then booting from initramfs fails and I have to have to boot from fallback initramfs.
Following Random System Freeze I have tried sudo grub-mkconfig -o /boot/grub/grub.cfg with no success.

Some logs are:

inxi -Fxxxc0

System:
Host: lightweight Kernel: 5.16.4-arch1-1 x86_64 bits: 64 compiler: gcc
v: 11.1.0 Desktop: Xfce 4.16.0 tk: Gtk 3.24.29 info: xfce4-panel
wm: xfwm 4.16.1 vt: 7 dm: LightDM 1.30.0 Distro: EndeavourOS
base: Arch Linux
Machine:
Type: Laptop System: Dell product: Inspiron 5558 v: N/A
serial: Chassis: type: 9 serial:
Mobo: Dell model: 0YM0G8 v: A00 serial: UEFI: Dell
v: A16 date: 09/27/2018
Battery:
ID-1: BAT0 charge: 30.7 Wh (100.0%) condition: 30.7/41.4 Wh (74.0%)
volts: 16.9 min: 14.8 model: SMP DELL VN3N047E584 type: Li-ion serial: 6066
status: Full
CPU:
Info: dual core model: Intel Core i5-5250U bits: 64 type: MT MCP
smt: enabled arch: Broadwell rev: 4 cache: L1: 128 KiB L2: 512 KiB
L3: 3 MiB
Speed (MHz): avg: 1608 high: 1643 min/max: 500/2700 cores: 1: 1597
2: 1597 3: 1596 4: 1643 bogomips: 12773
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
Device-1: Intel HD Graphics 6000 vendor: Dell driver: i915 v: kernel
bus-ID: 00:02.0 chip-ID: 8086:1626 class-ID: 0300
Device-2: Suyin Integrated_Webcam_HD type: USB driver: uvcvideo
bus-ID: 2-5:3 chip-ID: 064e:920b class-ID: 0e02
serial: HF1026-T838-HN01-2-REV4261
Display: x11 server: X.Org 1.21.1.3 compositor: xfwm4 v: 4.16.1 driver:
loaded: intel unloaded: fbdev,modesetting,vesa resolution: 1: 1366x768~60Hz
2: 1920x1080~60Hz s-dpi: 96
Message: Unable to show advanced data. Required tool glxinfo missing.
Audio:
Device-1: Intel Broadwell-U Audio vendor: Dell driver: snd_hda_intel
v: kernel bus-ID: 00:03.0 chip-ID: 8086:160c class-ID: 0403
Device-2: Intel Wildcat Point-LP High Definition Audio vendor: Dell
driver: snd_hda_intel v: kernel bus-ID: 00:1b.0 chip-ID: 8086:9ca0
class-ID: 0403
Sound Server-1: ALSA v: k5.16.4-arch1-1 running: yes
Sound Server-2: JACK v: 1.9.20 running: no
Sound Server-3: PulseAudio v: 15.0 running: yes
Network:
Device-1: Intel Wireless 3160 driver: iwlwifi v: kernel bus-ID: 06:00.0
chip-ID: 8086:08b3 class-ID: 0280
Device-2: Realtek RTL810xE PCI Express Fast Ethernet vendor: Dell
driver: N/A port: e000 bus-ID: 07:00.0 chip-ID: 10ec:8136 class-ID: 0200
Device-3: Qualcomm Mobile Router type: USB driver: cdc_ether
bus-ID: 2-3.4:8 chip-ID: 05c6:f00e class-ID: 0a00 serial: 0123456789ABCDEF
IF: enp0s20u3u4 state: up speed: 425 Mbps duplex: half
mac: fc:de:56:ff:01:06
Bluetooth:
Device-1: Intel Bluetooth wireless interface type: USB driver: btusb v: 0.8
bus-ID: 2-6:5 chip-ID: 8087:07dc class-ID: e001
Report: bt-adapter ID: hci0 rfk-id: 1 state: up
address: B4:6D:83:EB:1F:DF
Drives:
Local Storage: total: 223.58 GiB used: 62.18 GiB (27.8%)
ID-1: /dev/sda vendor: Western Digital model: WDS240G2G0A-00JH30
size: 223.58 GiB speed: 6.0 Gb/s type: SSD serial: 182838805181 rev: 0000
scheme: GPT
Partition:
ID-1: / size: 210.11 GiB used: 62.18 GiB (29.6%) fs: ext4 dev: /dev/sda2
ID-2: /boot/efi size: 299.4 MiB used: 280 KiB (0.1%) fs: vfat
dev: /dev/sda1
Swap:
ID-1: swap-1 type: partition size: 8.8 GiB used: 0 KiB (0.0%) priority: -2
dev: /dev/sda3
Sensors:
System Temperatures: cpu: 48.0 C mobo: 25.0 C sodimm: SODIMM C
Fan Speeds (RPM): cpu: 0 fan-2: 0 fan-3: 0
Info:
Processes: 198 Uptime: 1h 40m wakeups: 1 Memory: 7.68 GiB
used: 2.35 GiB (30.6%) Init: systemd v: 250 Compilers: gcc: 11.1.0
Packages: pacman: 972 Shell: Bash v: 5.1.16 running-in: xfce4-terminal
inxi: 3.3.12

sudo dmesg -l warn,err,crit

[ 0.026134] ACPI: NMI not connected to LINT 1!
[ 0.026136] ACPI: NMI not connected to LINT 1!
[ 0.026137] ACPI: NMI not connected to LINT 1!
[ 0.026139] ACPI: NMI not connected to LINT 1!
[ 0.117637] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[ 0.117637] #3
[ 0.121265] ENERGY_PERF_BIAS: Set to ‘normal’, was ‘performance’
[ 0.211781] ACPI BIOS Error (bug): Could not resolve symbol [^CPU0._PPC], AE_NOT_FOUND (20210930/psargs-330)
[ 0.211796] ACPI Error: Aborting method _PR.PPCE due to previous error (AE_NOT_FOUND) (20210930/psparse-529)
[ 0.211804] ACPI Error: Aborting method \EV11 due to previous error (AE_NOT_FOUND) (20210930/psparse-529)
[ 0.211812] ACPI Error: Aborting method \SMIE due to previous error (AE_NOT_FOUND) (20210930/psparse-529)
[ 0.211819] ACPI Error: Aborting method \NEVT due to previous error (AE_NOT_FOUND) (20210930/psparse-529)
[ 0.211826] ACPI Error: Aborting method _SB.PCI0.LPCB.ECDV._Q66 due to previous error (AE_NOT_FOUND) (20210930/psparse-529)
[ 2.901888] wmi_bus wmi_bus-PNP0C14:00: WQBC data block query control method not found
[ 4.136920] systemd-journald[256]: Configuration file /etc/systemd/journald.conf.d/volatile-storage.conf is marked executable. Please remove executable permission bits. Proceeding anyway.
[ 4.895899] i915 0000:00:02.0: can’t derive routing for PCI INT A
[ 4.895905] i915 0000:00:02.0: PCI INT A: not connected
[ 4.941377] dell-smbios A80593CE-A997-11DA-B012-B622A1EF5492: WMI SMBIOS userspace interface not supported(0), try upgrading to a newer BIOS
[ 5.128472] kauditd_printk_skb: 8 callbacks suppressed
[ 5.206026] at24 2-0050: supply vcc not found, using dummy regulator
[ 5.257205] urandom_read: 2 callbacks suppressed
[ 6.712533] Bluetooth: hci0: unexpected event for opcode 0xfc2f
[ 7.033002] usbhid 2-3.2:1.1: couldn’t find an input interrupt endpoint
[ 11.633496] kauditd_printk_skb: 113 callbacks suppressed
[ 17.678991] kauditd_printk_skb: 14 callbacks suppressed

Sorry, can’t fit journalctl -b -0 in here.

Use a pastebin site. Also, -b -1 would provide the previous (failed?) boot.

Welcome to the forum @vdecomarmond :partying_face::tada::balloon:

journalctl -b -0 | eos-sendlog
creates pastebin and gives you a short url you can post

2 Likes

I think in the case of a failed previous boot that -b -1 might be more useful? So

journalctl -b -1 | eos-sendlog
1 Like

exactly … i was only giving an example…

2 Likes

Hello everyone

Thank you for the replies.
Unfortunately I don’t have access to the machine in question right now.
I’ll post the journalctl results once I can.

Sorry about that

Thank you

I’m going to guess it’s related to "PSA: Linux 5.16 has major regression in btrfs causing extreme IO load" .

Don’t think so … I’m not using btrfs (well unless I’m very much mistaken).

Apologies to everyone.
It turns out that I did not have persistent storage enabled for journalctl, so I cannot recover the logs from the previous failure :cry: :cry:

I’ve enabled that now … I’m stuck on here until another failure occurs.

Caught the error on the log this time
Pastebin link is here:
https://clbin.com/iCGW1

On this particular instance my internet was offline when I booted, but otherwise the behavior seemed identical.

Note also: Although in /etc/systemd/journald.conf I had:

[Journal]
Storage=auto
...

and /var/log/journal/ looks like this:

drwxr-sr-x+ 3 root systemd-journal        4096 Sep 20  2020 ./
drwxr-xr-x  8 root root                   4096 Feb  8 20:05 ../
drwxr-sr-x  2 root systemd-journal-remote 4096 Sep 20  2020 remote/

I got this:

journalctl -b -1
Specifying boot ID or boot offset has no effect, no persistent journal was found.

I did journalctl -b -0 > file.log when the failure happened which is what I sent above.

According to https://www.freedesktop.org/software/systemd/man/journald.conf.html

  • … data will be stored preferably on disk, i.e. below the /var/log/journal hierarchy (which is created if needed), with a fallback to /run/log/journal (which is created if needed), during early boot and if the disk is not writable …

This suggests to me that the disk was not writeable during early boot. Just adding this here in case it helps.

With auto the journal is not re-created, although I am not sure it won’t, but for certainty, set it to Storage=persistent, as Archwiki suggests.

Maybe the issue you are experiencing re-mounts FS in ro mode (in some time).

Edit: After inspecting the posted journal, mounting as ro, running system like an overlay FS seems most probable. I have no real experience with such situations, but there are several signs that point to what I suspected.

If it was my system, I would definitely reinstall, unless there was some better idea/advice.

I just wonder how did this happen… :thinking:

Thank you. I did this after the last time it didn’t capture the logs.

Looks like a good lead. I have no idea how it happend, though.

Will probably have to reinstall :frowning_face:.
Found something worthwhile here: https://www.toolbox.com/tech/operating-systems/question/file-system-runs-into-read-only-mode-061114/

Now I’m worried my hard drive may be starting to die.

1 Like

Just an update

I reinstalled :sob:
Everything seems to be working normally again :crossed_fingers: .

I do think the installer needs some polishing, but that’s an entirely separate issue.

Thanks for all the help everybody.

The only working solution I found was this:

This would be the root cause; there was a bug in iwd recently that has since been fixed.

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.