Unofficial zfs on root installation instructions

Nyyr · July 7, 2023, 9:46pm

Yes, I’m going to do the same.

In my temporary setup with ext4 /boot I found that GRUB fails to probe ZFS root filesystem with encryption feature resulting in missing zpool name in the root= kernel option.

The ugly workaround until GRUB is fixed is to put the whole root= option into /etc/default/grub. Those options there are appended to kernel command line and the last root= option is effective. Let’s hope all software parsing kernel command line options behave the same way

Oh, by the way, when install finishes, DO NOT REBOOT. Open terminal, switch to root and issue killall gpg-agent ; umount /tmp/calamares........ ; zpool export -a, otherwise the boot process will fail (refuses to import dirty pool).

Meow · July 8, 2023, 2:31am

I, too, am looking for a way to easily install ZFS root however, Cachy has some mods that I don’t like, but it’s still very fast and usable. I like BTRFS yet, I don’t think it’s as stable as ZFS.

I did try a few scripts, but there are some problems and they error out before completing the install.

this fork of Stev stuff didn’t fix the install for me: https://github.com/atomjack/arch-linux-configuration

There are some good instructions here:
https://docs.zfsbootmenu.org/en/v2.2.x/

samarium · July 8, 2023, 4:48am

I found these instructions fairly good, worked for me. Need to check /etc/kernel/cmdline after install to make sure you don’t have a duplicate and bad root= parameter, probably before reboot would be good, and before linux kernel upgrade mandatory or you might get the UUID wait for ever issue, but at least now we know you can just edit the kernel cmdline and remove the trailing root=UUID parameter. I did the no desktop install as I was interested in the ZFS install, and desktop can always be added later. Quicker too. Test install inside a VM is always a good idea for something complex. For ZFS install, I often install inside a VM, then shutdown the VM, and use qemu-nbd to create a nbd device, or losetup & kpartx, and then mirror or zfs send/recv the pool/datasets I want, but have integrate the booting myself. If you are after simple minimal background, then don’t do that, but the test install in a VM is almost always a good idea.

Nyyr · July 8, 2023, 11:35pm

So, now I have this disk setup:

## bpool ZFS mirror ##################################
#                                                    #
#    == MD RAID0 (HDD) (/dev/md/boot) ============   #
#    |   -----------------   -----------------   |   #
#    |   |   /dev/sdb3   |   |   /dev/sdc3   |   |   #
#    |   -----------------   -----------------   |   #
#    =============================================   #
#                                                    #
#        -- (SSD) ----------------------------       #
#        |             /dev/sda3             |       #
#        -------------------------------------       #
#                                                    #
######################################################


## MD RAID1 (HDD) (/dev/md/mswap) ####################
#                                                    #
#    == MD RAID0 (HDD) (/dev/md/swap) ============   #
#    |   -----------------   -----------------   |   #
#    |   |   /dev/sdb4   |   |   /dev/sdc4   |   |   #
#    |   -----------------   -----------------   |   #
#    =============================================   #
#                                                    #
#        -- (SSD) ----------------------------       #
#        |             /dev/sda4             |       #
#        -------------------------------------       #
#                                                    #
######################################################


## zpendeavouros ZFS mirror ##########################
#                                                    #
#    == MD RAID0 (HDD) (/dev/md/root) ============   #
#    |   -----------------   -----------------   |   #
#    |   |   /dev/sdb5   |   |   /dev/sdc5   |   |   #
#    |   -----------------   -----------------   |   #
#    =============================================   #
#                                                    #
#        -- (SSD) ----------------------------       #
#        |             /dev/sda5             |       #
#        -------------------------------------       #
#                                                    #
######################################################

EOS falls into GRUB rescue complaining that it does not know hostdisk//dev/sda,gpt3 disk:

prefix='(hostdisk//dev/sda,gpt3)/BOOT/eos/boot@/grub'
root='hostdisk//dev/sda,gpt3'

However if I follow these steps in GRUB rescue:

set prefix='(hd0,gpt3)/BOOT/eos/boot@/grub'
set root='(hd0,gpt5)/ROOT/eos/root'
insmod normal
normal

I am able to boot. If I setup bpool as single disk only (/dev/sda3) it boots fine. Not sure now if GRUB does not like ZFS mirror or MD RAID0 or combination of these two…

samarium · July 9, 2023, 1:28am

Boot from raid0 sounds like a terrible idea to me. I want more reliability, not twice the likelihood of breakage, especially for boot, and performance not important for boot. Same for swap and system root, reliability over performance. I would be interested to know that in general use if it even made any difference, or if you have a specific use case in mind.

In comments you have mirror and raid0 referring to the same devices? Mirror != raid0. Do you mean raid1? What did you actually set up? 0 or 1?

Anyway, grub config needs to load the appropriate mdraid module before it can access the MD data. I don’t know if grub supports raid0. Should be auto detected, but you didn’t show that part of the config. UUIDS should match too.

But it you can manually load in grub it would seem like modules for mdraid are loaded.

dalto · July 9, 2023, 2:10am

So, you have two HDD partitions in RAID0 in a zfs mirror with an SSD? Out of curiosity, why are you using that configuration?

It looks like the RAID 0 config is half of the zfs mirror. I have never seen a config like that before.

samarium · July 9, 2023, 2:22am

Ah … the perils of interpreting large ASCII art on a phone screen : - )

Maybe an attempt to have more IOPs for the HDD side of the mirror? Still wonder if it will make much of a difference to normal use, and complexity is the enemy of stability.

Not sure the second root= is correct, root= in grub means the boot partition with the grub modules and the initrd and linux kernel. I wish grub people had used boot rather than root to specify the boot device, many less confused people over the years, but maybe an artifact from when boot was part of root. I wonder what happens if you don’t specify the second root= in grub rescue, I wonder if the second root= does anything since you have prefix very fully specified.

I’ve never see the hostdisk syntax before, I wonder if it is breakage of some kind. https://forums.linuxmint.com/viewtopic.php?t=358323 ???

I would be ensuring you have gpt partitioning, but …gpt3 suggests that anyway.

Nyyr · July 9, 2023, 8:00am

On my physical machine I have 2 older 500GB HDDs and 1TB new SSD disk. RAID0 is not primarily for speeding up HDDs, I could just join them but as both will make one virtual disk for mirroring SSD I can use whatever and RAID0 speeds it up a bit.

The idea behind this combination is that when you buy several disks at a time, you usually buy ones with their serial numbers close to each other. That means they were built by the same design, from the same materials under the same conditions and that means they will likely fail AT THE SAME TIME (confirmed many times in reality). One might say “buy disks from different manufacturers” but I wouldn’t be surprised if most of them are manufactured at the same Taiwan factory labeling them “SEAGATE” in the morning, “Western Digital” afternoon and “HITACHI” (or whatever) during night. My approach is more failproof: use completely different storage technology

Although there is no such thing like MD RAID’s “wrilte only” in ZFS I assume that ZFS I/O scheduler will utilize SSD more often as because of its speed it will become ready much sooner for next I/O than HDD and as usual desktop work represents mostly disk reads (roughly guessed ratio is 90:10) it should still be fast enough. Writes will of course be blocked by HDD but as it’s only 10% of I/O it does not bother me much, I prefer data safety

Nyyr · July 9, 2023, 8:07am

Tried that and yup, fixing only prefix= is enough. Yes, I have GPT layout with hybrid MBR.

samarium · July 9, 2023, 8:54am

I think the assumption of SSD faster, so user SSD by preference may be wrong. IIRC Jim Salter did some testing of various hybrid SSD HDD pools, and I don’t think the result was as you surmise. I don’t recall the results or if he did read and write testing, but he did test and did publish the results. Suggest you search out Jim’s blog where he published the results, lots of interesting ZFS stuff there.

I wonder if the md layer is equivalent to a hw raid controller, and if it will interfere with ZFS’s idea of data commited to storage, and possible open a corruption window in the sync data to disk path.

I’ve considered doing something like this in the past, gluing 2x 2TB disks to mirror up with a 4TB disk, but never went that way.

DId you check grub.cfg to see if you are loading the mdraid module as part of the setup for the boot/root device? Seems like this would have to be the case.

I don’t think hybrid is the issue, maybe Mint user had MBR only? It was just that it popped up with the harddisk and grub cross reference, and seemed similar.

Nyyr · July 9, 2023, 9:10am

### BEGIN /etc/grub.d/00_header ###
insmod part_gpt
insmod part_msdos
insmod mdraid1x
insmod zfs
insmod zfscrypt
insmod zfsinfo
if [ -s $prefix/grubenv ]; then
  load_env

To me it seems that this error happens before GRUB is able to read grub.cfg, otherwise it would know the correct location. I don’t know how that GRUB stage is being configured (where the prefix= is stored for that GRUB stage) though so I might be wrong, GRUB is a complicated mess.

Nyyr · July 9, 2023, 9:22am

One thing that confuses me here is this:

[root@nbpg0603vm grub]# lsblk -o NAME,ALIGNMENT,MIN-IO,OPT-IO,PHY-SEC,LOG-SEC,ROTA,SCHED,RQ-SIZE,RA,WSAME,UUID
NAME                                              ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED       RQ-SIZE  RA WSAME UUID
sda                                                       0    512      0     512     512    1 mq-deadline      64 128    0B 
├─sda1                                                    0    512      0     512     512    1 mq-deadline      64 128    0B 
├─sda2                                                    0    512      0     512     512    1 mq-deadline      64 128    0B 6B3A-0752
├─sda3                                                    0    512      0     512     512    1 mq-deadline      64 128    0B 10221501392988466463
├─sda4                                                    0    512      0     512     512    1 mq-deadline      64 128    0B 4e70ad0c-e1a6-48bb-8b71-b1444bed2afc
└─sda5                                                    0    512      0     512     512    1 mq-deadline      64 128    0B 17044796439083906037
sdb                                                       0    512      0     512     512    1 mq-deadline      64 128    0B 
├─sdb1                                                    0    512      0     512     512    1 mq-deadline      64 128    0B 
├─sdb2                                                    0    512      0     512     512    1 mq-deadline      64 128    0B C087-DDF8
├─sdb3                                                    0    512      0     512     512    1 mq-deadline      64 128    0B 9d11d677-faea-251f-0d8b-841c1496d2a7
│ └─md127                                                 0  65536 131072     512     512    1                 128 256    0B 
│   ├─md127p1                                             0  65536 131072     512     512    1                 128 256    0B 10221501392988466463
│   └─md127p9                                             0  65536 131072     512     512    1                 128 256    0B 
├─sdb4                                                    0    512      0     512     512    1 mq-deadline      64 128    0B 03f6c2b3-e894-f3e1-72c9-e1ee1d7cc11e
│ └─md125                                                 0  65536 131072     512     512    1                 128 256    0B a6431182-8735-7b14-0748-1c66848b028e
│   └─md124                                               0  65536 131072     512     512    1                 128 256    0B 4b5e64ff-ef6d-46ce-bf64-be4bc42e92e4
│     └─luks-4e70ad0c-e1a6-48bb-8b71-b1444bed2afc         0  65536 131072     512     512    1                 128 256    0B 69dfa4e5-aad8-4d4f-9619-58f0bcd5a0d8
└─sdb5                                                    0    512      0     512     512    1 mq-deadline      64 128    0B 84cceaac-c7ec-91cd-dc7d-11e02e7897f3
  └─md126                                                 0  65536 131072     512     512    1                 128 256    0B 
    ├─md126p1                                             0  65536 131072     512     512    1                 128 256    0B 17044796439083906037
    └─md126p9                                             0  65536 131072     512     512    1                 128 256    0B 
sdc                                                       0    512      0     512     512    1 mq-deadline      64 128    0B 
├─sdc1                                                    0    512      0     512     512    1 mq-deadline      64 128    0B 
├─sdc2                                                    0    512      0     512     512    1 mq-deadline      64 128    0B C0C9-70EA
├─sdc3                                                    0    512      0     512     512    1 mq-deadline      64 128    0B 9d11d677-faea-251f-0d8b-841c1496d2a7
│ └─md127                                                 0  65536 131072     512     512    1                 128 256    0B 
│   ├─md127p1                                             0  65536 131072     512     512    1                 128 256    0B 10221501392988466463
│   └─md127p9                                             0  65536 131072     512     512    1                 128 256    0B 
├─sdc4                                                    0    512      0     512     512    1 mq-deadline      64 128    0B 03f6c2b3-e894-f3e1-72c9-e1ee1d7cc11e
│ └─md125                                                 0  65536 131072     512     512    1                 128 256    0B a6431182-8735-7b14-0748-1c66848b028e
│   └─md124                                               0  65536 131072     512     512    1                 128 256    0B 4b5e64ff-ef6d-46ce-bf64-be4bc42e92e4
│     └─luks-4e70ad0c-e1a6-48bb-8b71-b1444bed2afc         0  65536 131072     512     512    1                 128 256    0B 69dfa4e5-aad8-4d4f-9619-58f0bcd5a0d8
└─sdc5                                                    0    512      0     512     512    1 mq-deadline      64 128    0B 84cceaac-c7ec-91cd-dc7d-11e02e7897f3
  └─md126                                                 0  65536 131072     512     512    1                 128 256    0B 
    ├─md126p1                                             0  65536 131072     512     512    1                 128 256    0B 17044796439083906037
    └─md126p9                                             0  65536 131072     512     512    1                 128 256    0B 
sr0                                                       0   2048      0    2048    2048    1 mq-deadline       2 128    0B 2023-05-28-11-02-36-00

In the grub.cfg there is

        set root='hd0,gpt3'
        if [ x$feature_platform_search_hint = xy ]; then
          search --no-floppy --fs-uuid --set=root --hint-bios=hd0,gpt3 --hint-efi=hd0,gpt3 --hint-baremetal=ahci0,gpt3  --hint='mduuid/9d11d677faea251f0d8b841c1496d2a7,gpt1'  8dda115c703ced1f
        else
          search --no-floppy --fs-uuid --set=root 8dda115c703ced1f
        fi

I understand this as it should search for /boot on filesystem with UUID 8dda115c703ced1f but I don’t see such UUID in lsblk’s output. IMO it should be 10221501392988466463.

samarium · July 9, 2023, 9:24am

If you have BIOS/legacy booting enabled with GPT, then you should have an EF02 type partition about a 1MB. I usually use sgdisk -a 128 -n1:128:+960k -t1:EF02 0c1:BIOS /dev/disk-device. Then you do grub-install --bios-specific-args-i-dont-remember /dev/disk-device. You probably have to make the BIOS partition bootable too. The grub-install loads the grub bootloader into that partition and it should know enough from the install to find grub.cfg. If I use mdadm for boot it is usually mdadm metadata=1.0 so the metadata is at the end of the partition, and the front of the partition looks just like a normal file system, and raid1 not raid0, so blocks are sequential.

For UEFI boot, grub bootloader is stored on EFI /ESP partition, type EF00, vfat/fat32, and some configuration pointing to the reset of the boot process.

I’m not even sure that grub supports boot device is raid0, which might be why you are having problems compared to raid1. Have you checked that mdraid raid0 is supported by grub as a boot device?

Ah … i see you are using luks encrypted boot too. I know nothing about how that interacts with grub, and it add another layer of complexity and potential confusion.

I would be just trying to get ext4 on mdadm raid1 working first, then after I had verified that grub supported mdadm raid0 boot, get ext4 on raid0 working, and only then try and get bpool running over raid0.

Nyyr · July 9, 2023, 9:38am

sda:

Number  Start   End     Size    File system  Name                 Flags
 1      24,6kB  1049kB  1024kB               BIOS boot partition  bios_grub
 2      1049kB  538MB   537MB   fat32                             msftdata
 3      538MB   1610MB  1072MB
 4      1610MB  2681MB  1072MB
 5      2681MB  10,2GB  7512MB

sdb:

Number  Start   End     Size    File system  Name                  Flags
 1      24,6kB  1049kB  1024kB               BIOS boot partition   bios_grub
 2      1049kB  538MB   537MB   fat32        EFI system partition  boot, esp
 3      538MB   1075MB  537MB                Linux RAID            raid
 4      1075MB  1612MB  537MB                Linux RAID            raid
 5      1612MB  5368MB  3756MB               Linux RAID            raid

sdc:

Number  Start   End     Size    File system  Name                  Flags
 1      24,6kB  1049kB  1024kB               BIOS boot partition   bios_grub
 2      1049kB  538MB   537MB   fat32        EFI system partition  boot, esp
 3      538MB   1075MB  537MB                Linux RAID            raid
 4      1075MB  1612MB  537MB                Linux RAID            raid
 5      1612MB  5369MB  3757MB               Linux RAID            raid

Partition flags/types on sda where set by calamares (well, I chose more flags (raid, root, legacy_boot, boot) but obviously not all were written (dunno where that msftdata comes from)), sdb and sdc manually by me.

samarium · July 9, 2023, 10:08am

so i guess raid = gpt type fd00
can you use sgdisk -p /dev/sda etc, as this will give sector counts, not rounded values

I would be interested in the preamble before the actual partition table too.

Nyyr · July 9, 2023, 10:28am

sda:

Number  Start (sector)    End (sector)  Size       Code  Name
   1              48            2047   1000.0 KiB  EF02  BIOS boot partition
   2            2048         1050623   512.0 MiB   0700  
   3         1050624         3143679   1022.0 MiB  8300  
   4         3143680         5236735   1022.0 MiB  8300  
   5         5236736        19908607   7.0 GiB     8300

sdb:

Number  Start (sector)    End (sector)  Size       Code  Name
   1              48            2047   1000.0 KiB  EF02  BIOS boot partition
   2            2048         1050623   512.0 MiB   EF00  EFI system partition
   3         1050624         2099199   512.0 MiB   FD00  Linux RAID
   4         2099200         3147775   512.0 MiB   FD00  Linux RAID
   5         3147776        10483711   3.5 GiB     FD00  Linux RAID

sdc:

Number  Start (sector)    End (sector)  Size       Code  Name
   1              48            2047   1000.0 KiB  EF02  BIOS boot partition
   2            2048         1050623   512.0 MiB   EF00  EFI system partition
   3         1050624         2099199   512.0 MiB   FD00  Linux RAID
   4         2099200         3147775   512.0 MiB   FD00  Linux RAID
   5         3147776        10485726   3.5 GiB     FD00  Linux RAID

I just noticed EFI partition on sda having 0700 type. I’d swear it wasn’t there. Fixed with gdisk:

Number  Start (sector)    End (sector)  Size       Code  Name
   1              48            2047   1000.0 KiB  EF02  BIOS boot partition
   2            2048         1050623   512.0 MiB   EF00  
   3         1050624         3143679   1022.0 MiB  8300  
   4         3143680         5236735   1022.0 MiB  8300  
   5         5236736        19908607   7.0 GiB     8300

NOTE: I am now trying to create MD RAID1 instead of ZPOOL for /boot to check if that RAID0 is the problem… So far

grub-install: warning: Couldn't find physical volume `(null)'. Some modules may be missing from core image..

I’m going to reboot, fingers crossed

Nyyr · July 9, 2023, 10:33am

So having MD RAID1 instead of ZPOOL for /boot resulted in booting without a glitch. So the RAID0 member doesn’t seem to be the problem. So probably ZPOOL mirror isn’t supported.

samarium · July 9, 2023, 10:34am

the alignment is better than I had expected.
48 is much better than 33 or 34 which is a common old value, and end sector 2047 means 2048 is next, meaning 1MB alignment which is good

I’m not sure FD00 is needed these says, I thought it was an artifact of kernel autodetect and mdraid metadata 0.90. You are loading mdraid1x which is metadata 1.0/1.1/1.2

Nyyr · July 9, 2023, 10:34am

yes, all MD RAIDs are 1.2.

samarium · July 9, 2023, 10:36am

Funny, I would not draw that conclusion.

OpenZFS Root on ZFS has plenty of examples of mirrored bpool.

Mirrored on top of mdraid device, not so much though.

Possibly worth trying mirrored bpool on top or md raid1, not that it makes sense, just that it give sequential blocks not a stripe.

Or you could try booting ext4 on raid0, and see if that works.

Maybe some problems wrt knowing exact which disk is being used, and this which blocks are being accessed. when you have multiple ESPs, and grub is referencing raid devices.

It would be so much easier to test if it were in a VM : - ) Could easily detach disks then.