Systemd-boot -> refind hangs on start job for partition

ser · June 30, 2024, 1:55am

Hi,

systemd-boot was working fine, but I’d like to have access to snapshot boots, so I’m trying to switch to refind. I am getting the refind menu, and did configure the various configs enough such that the EndeavourOS logo shows up and I’m pretty sure it’s pointed at the correct linux and initrd.

There is no encryption, or other complications; it’s a fairly standard btrfs install on a single NVMe with a partition that looks (more or less, from memory):

nvme0n1p1    vfat    /efi    
nvme0n1p2    btrfs    subvol=/@        /
nvme0n1p2    btrfs    subvol=/@home    /home

Everything is in /efi; there’s some xen images in /boot – which is a dir, not a partition – and refind puts some config file in there, but /boot doesn’t otherwise seem to be used. I’ve enabled all of the naming changes in the refind configs from vmlinuz to linux, which I think is dracut’s doing, and provided full paths (minus the leading /efi, as instructed in one site). The refind-btrfs.efi(?) file was dutifully copied to the /efi directory as instructed (refind-install did it, actually; I just verified that it was there)

My issue is that, shortly into the boot process, I get the dreaded

A start job is running for /dev/disk/by-uuid/<uuid> (16min 35s / no limit)

Yeah, that’s where it is right now on my most recent attempt. I have a rescue EOS USB stick I’m booting into, mounting the nvme btrfs subvols and then arch-chroot-ing into.

None of the search results seem to match my situation; folks have problems with encrypted partitions and so on, and a lot of folks have problems in general with the migration… but it’s showing me the correct UUID (triple-checked) – it just never starts up the root partition.

FWIW, here’s the mostly useless screen shot of the boot:

I’ll have to reboot again into the rescue USB to get at the refind configs, but I’m already late for bed and starting to nod off here, so if they’re crucial I’ll do it tomorrow.

I would greatly appreciate any pointers. I suspect that the easiest fix to get this back to booting is just to uninstall refind and re-enable systemd-boot, but I’d really like to have access to refind-btrfs, which will allow me to boot from snapshots – which systemd-boot does not allow.

I have considered re-installing EndeavourOS with no boot manager; I’ve read that doing this and then installing refind before exiting the install media works flawlessly; but I’d obviously prefer to not have to go through re-installing and reconfiguring everything. This ironically is what the snapshots were supposed to prevent.

Anyway, has anyone else encountered this? Did I miss a crucial step? Why is it unable to initialize/mount the partition? This feels as if I’m just missing one tiny piece.

Thank you.

dalto · June 30, 2024, 2:19am

You need to remove kernel-install-for-dracut and then install eos-dracut.

Then reinstall your kernel packages.

From then on, your kernels and initrams will be in /boot

ser · June 30, 2024, 11:38am

I’ll give that a shot at soon as the darn cat gets off my lap. Until then, I’m immobile.

I’d seen the advice to remove kernel-install-for-dracut and had done that; the eos-dracut was the part I was missing.

Thank you!

ser · June 30, 2024, 1:21pm

I manually created an entry to include the EndeavourOS icon, and set initrd=/boot/initramfs-linux.img, which eos-dracut created. This is:

menuentry EndeavourOS {
  icon /EFI/refind/icons/os_endeavouros.png
  volume <partition UUID>
  loader /boot/vmlinuz-linux
  initrd /boot/initramfs-linux.img
  options "nvme_load=YES rw rootflags=subvol=/@ root=UUID=<partition UUID> systemd.machine_id=<copied from old systemd-boot>"
  enabled
}

When I boot that configuration, I get an Invalid loader file and Error: Not Found while loading vmlinuz-linux. The generated load options are as above, with the initrd added as initrd=\boot\initramfs-linux.img. ~~It’s obviously finding the vmlinuz-linux file… what is it not finding after that?~~

is it not finding the vmlinuz-linux?
Is the root=UUID=<partition GUID> redundant, because it’s specified in the volume attribute above?
do I need to move /efi to /boot, and move /boot to the vfat partition?
should I be adding /boot/amd_ucode.img before /boot/initramfs-linux.img? Does dracut automatically pick it up? The Arch wiki says that dracut will embed the microcode in the initramfs, but I haven’t changed any dracut configuration to tell it to do so.
I do have use_nvram true in the refind config, FWIW; I don’t think that would make a difference?

Update

So, reading the refind.conf more closely, the volume apparently has to be the partition GUID. I replaced the volume with the PARTUUID output from blkid, which didn’t help. I changed it in both the volume attribute, and in the root=UUID=... argument in options – I’m going to next try putting the root=UUID, because systemd-boot never used the PARTUUID; this is the first time I’ve ever used it, but the refind.conf instructions are pretty clear about using the Partition GUID.

Strangely, I also tried booting with minimal options – ro root=/dev/nvme0n1p2 which got far further than any other configuration, but still got hung on the original problem: job stuck on /dev/nvme0n1p2 (this time a path because the options were different).

TL;DR: I think refind can’t find vmlinuz-linux, and my suspicion is because it and the initramfs are in /boot which is a directory in a btrfs subvolume and somehow refind isn’t able to access that. I did verify that /efi/EFI/refind/drivers_x64/btrfs_x64.efi exists, so refind should have access to btrfs volumes; refind-install states that it’s installing it in any case.

I have one other concern: despite mounting the proper partition and arch-chrooting into it, if I move /boot/refind_linux.conf and call refind-install, it creates a wacko refind_linux.conf based on the rescue USB I booted from, which makes me wonder what it might be doing weird b/c of my rescue boot. Also, dracut keeps trying to write initrds to /efi/<somehashcode>/, which I thought was a systemd-boot thing.

The next thing I’m going to try is moving vmlinuz-linux and initramfs-linux to /efi and see if I can boot them from there, since refind is obviously able to access that; but I’m not sure what then to do with the various partition UUID settings, and it’s still not going to be right when I upgrade in the future. I think my main question still remains: is my layout wrong; does /boot need to be on the vfat partition and /efi in boot, like /boot/efi?

dalto · June 30, 2024, 3:33pm

Did you reinstall your kernel packages after installing eos-dracut? If not, your kernels won’t in the correct place.

Once you do that, refind should automatically find your kernels without the need for a manual stanza.

No, if you do that, you will break snapshot booting.

No

Yes

It shouldn’t. Make sure you use sudo dracut-rebuild if you want to manually run dracut.

Don’t do that. Even if it worked, it would break your snapshot booting.

No

ser · June 30, 2024, 3:48pm

Edit 2
Yeah, I think I have all of the paths right now; it’s no longer complaining early about not finding vmlinuz or whatnot. However, whether I use UUID or PARTUUID, the boot process continues to hang at

A start job is running for /dev/disk/by-uuid/...  (... / no limit)
      or
A start job is running for /dev/disk/by-partuuid/...  (... / no limit)

Both IDs are correct, confirmed with blkid, refer to the btrfs partition containing /boot in the @ subvolume, which itself is specified in the boot options. I’m back at the beginning.

Edit Sorry, I forgot to address your entire comment. I only ran dracut manually in a Pollock-esque desperation.

Did you reinstall your kernel packages after installing eos-dracut? If not, your kernels won’t in the correct place.

I did re-install linux and linux-headers.

Once you do that, refind should automatically find your kernels without the need for a manual stanza.

You mean, I shouldn’t need the menus in refind.conf? Or that I won’t need the lines in /boot/refind_linux.conf? I did try moving /boot/refind_linux.conf out of the way, but refind tried to create a configuration based on the rescue USB, so I think I have to wait until I can boot before trying to let it create that itself.

Ok, thanks for the answers.

I found, eventually, that I could make some progress by giving the correct paths to the files; setting initrd=\@\boot\initramfs-linux.img got one of the configs to boot further, but it’s still getting stuck at that original “A start job is running” for the disk partition. Waiting, with no limit, to infinity and beyond. So, I think I know how to set the correct paths vmlinuz-linux and initramfs-linux; I still do not know why it’s getting stuck at that start job.

Progress, in that I think I’ve excorcized the systemd-boot parts, but it still just puts me back at the beginning. Am I supposed to use the PARTUUID for volume? Am I supposed to use it in the root option?

ser · June 30, 2024, 4:47pm

Hmmm. I took out the nvme_load=YES noticed something flash by about NVRAM overflow and microcode not being loaded. And now there’s an error near the top about nvme nvme0: missing or invalid SUBNQN field. Maybe it was always reporting this, and I didn’t see it until I turned off the NVME load option? I don’t know. In any case, it’s still not able to get past that darned start job for the drive. It keeps getting hung up in the same place, no matter what I try.

I’ve re-installed the kernel from a rescue chroot several times now; I’ve run refind-install a couple of times – it mostly just says things are already there and it’s not doing anything.

Why can’t it settle the NVMe and mount it?

dalto · June 30, 2024, 4:48pm

How are you building your chroot?

ser · June 30, 2024, 4:53pm

How are you building your chroot?

Boot from rescue USB, then:

sudo mount --options subvol=/@ /dev/nvme0n1p2 /mnt
sudo mount --options subvol=/@home /dev/nvme0n1p2 /mnt/home  # I know this is entirely unnecessary
sudo su -
arch-chroot /mnt
mount /dev/nvme0n1p1 /efi

The rescue USB is an older EndeavourOS USB I used to originally install this machine; it’s the 6.1.9-arch1-2 kernel.

I said somewhere above that I changed the volume to be the PARTUUID because of a comment in the config. Does that make a difference? Will the UUID work? Does it matter whether, in the options, I use root=UUID=... or root=PARTUUID=..., so long as they have the correct values? Could any of these be making a difference?

It sort of feels as if the right modules are not being included in the kernel or initrd images. That’s the only thing I can think of why it’d get stuck “starting” the root partition.

Update I messed around with dracut, and added btrfs and btrfs-subvol were in the add_modules, re-installed linux, and ensured dracut said they were being pulled in. Rebooted, got stuck on starting that device, went to lunch and it was still stuck 48 minutes later.

BluishHumility · June 30, 2024, 9:32pm

The volume entry is the partition where the kernels and images are stored (i.e. the Btrfs partiton). You must describe the volume with either the filesystem label, the partition label, or the PARTUUID–not the filesystem UUID.

Maybe you can post your stanza into the thread so we can take a look.

ser · June 30, 2024, 10:58pm

With a lot of help from @iam_tj:matrix.org, we tracked the issue down to the fact that the BusyBox included with dracut does not include zstd. Since all of the modules are zstd, the BusyBox modprobe was unable to insmod any of the kernel modules, including btrfs.

I’m still in the process of figuring out what to do with dracut – if it’s a configuration issue, or what – such that it produces a bootable system. Note, the kernel itself contained zstd, but since BusyBox is one monolithic tool that pretends to be many things, and since modprobe is just a symlink to BusyBox, having zstd in the kernel doesn’t help.

I’m assuming there’s some configuration option for dracut that’ll allow the boot process to load zstd compressed kernel modules, since that seems to be the EndeavourOS standard; the linux channel felt that dracut needed to be patched, but certainly I’m not the only person on EndeavourOS using dracut.

Anyway, by using rd.break=pre-mount in the boot options, I was able to determine that kmod has zstd (and BusyBox does not), and I was able to confirm that I could mount the root partition. Now I just have to figure out what config options include zstd and so create a bootable process.

c.f. https://github.com/dracutdevs/dracut/issues/945

dalto · June 30, 2024, 11:04pm

That doesn’t make a lot of sense. If that was the issue, how were you able to boot with systemd-boot?

ser · June 30, 2024, 11:10pm

That doesn’t make a lot of sense. If that was the issue, how were you able to boot with systemd-boot?

Beats me. I didn’t have this issue until I switched to refind, but maybe I was using mkinitcpio before? According to the ticket I linked above (in an edit, sorry) if you omit BusyBox from dracut, it boots using kmod and can load zstd-compressed modules. Include BusyBox, and it breaks.

The last comment on that ticket was 2021; I don’t know how relevant it is, but I’m going to try the simple solution and exclude BusyBox and see if it boots. I am afraid that if I try to patch dracut to use a… less minimalist… BusyBox, then it’ll break again on some future upgrade. There must either be a dracut configuration to get a better BusyBox that supports zstd, or exclude it.

ser · June 30, 2024, 11:35pm

The patch provided in the Linux channel was to update the dracut modules to include proper insmod, rmmod, modinfo, and modprobe and install them. This overwrites BusyBox’s minimal links and allows modprobing zstd-compressed modules.

Oh, and the dracut in EndeavourOS is broken with busybox; it was patched to replace a call to sh -i -l with a call to sulogin -e, except that the BusyBox sulogin doesn’t accept -e and so it breaks. Changing this back to sh allows interrupting the boot process into a shell.

However, if you want to see this yourself, and you’re using dracut, this is how we tracked this down:

Boot, but in the menu, edit the boot options and add rd.break=pre-mount
Boot with those options. You will see an error as BusyBox complains that sulogin doesn’t understand -e, leaving the boot in a hung state.
Reboot. Edit /usr/lib/dracut/modules.d/98dracut-systemd/dracut-emergency.sh, and on line 38 change exec sulogin -e to exec sh -i -l. You can try just removing the -e; I have not tested that.
Reboot, and re-do from step 1. You will then get a shell just before mount, which is where mine is hanging.
Check your modules, and confirm that you have zstd compressed modules. Try to modprobe one. Notice how it fails. In particular, try to modprobe btrfs, and you won’t be able to.
Now, find kmod. This is copied into the image by dracut, and properly contains zstd. To make it act like modprobe, make a symlink: ln -s $(which kmod) ./modprobe. Note, you can not overwrite any of the BusyBox symlinks, even if you do remount /usr as rw; something about BusyBox takes those over, so you just have to symlink somewhere else.
Run ./modprobe btrfs. Notice that it does work, and that you are now able to mount btrfs partitions.

Something is borked in the EOS dracut/refind/BusyBox ecosystem; I don’t know enough to know what, but I’m in the process of searching for a way to configure dracut to use kmod to load modules instead of BusyBox tools; and failing that, I’m going to disable BusyBox in dracut, which will then default to kmod. I expect that, then, that the boot process will be able to load modules, mount my root partition, and boot.

ser · June 30, 2024, 11:46pm

Removing busybox fixed it. BusyBox overwrites kmod with symlinks to itself, and it’s not capable of handling zstd-compressed modules. The fix was merely to remove /etc/dracut.conf.d/busybox.conf (or, comment out the add_dracutmodules line, or remove “busybody” from the list). I re-installed linux and linux-headers to ensure the initramfs was regenerated, rebooted, and the system came right up.

dalto · July 1, 2024, 12:24am

That isn’t installed by default. That must be something you added

ser · July 2, 2024, 1:35pm

Maybe. After installing refind, I had a non-bootable system for a couple of days and was trying various things to get it to boot.

busybox is a standard dracut module; adding it overwrites kmon symlinks and breaks the ability to boot systems with zstd compressed kernel modules. Surely, that’s not expected behavior.

system · July 4, 2024, 1:36pm

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.