My testing of bcachefs

dalto · January 2, 2024, 8:24pm

DISCLAIMER: I have spent limited time with bcachefs so it is possible I am misunderstanding some of it
DISCLAIMER 2: When I describe how things work I am referring to how they present in user space and am not implying anything about the underlying implementation
DISCLAIMER 3: The focus of my investigation was subvolume and snapshot handling because that is important to my use case. However, there are lots of other important things about a filesystem in addition to this one topic.

I recently took a look at bcachefs because I was hoping it might be an improvement over btrfs with subvolume and snapshot handing. Ultimately, I was wanting to see if it could be a replacement for zfs. If you want the short version, it isn’t. If you have time and interest for the longer version, feel free to keep reading.

First, I need to explain my personal challenges with btrfs snapshots to provide some perspective.

Btrfs snapshots store minimal metadata and are only loosely hierarchical. This presents several problems in practical use.

In many situations it is impossible to tell which subvolume the snapshot came from
There is no practical difference between a snapshot and subvolume, snapshots basically are subvolumes
snapshots can only be created at a location within the mounted filesystem
You can only take snapshots of mounted subvolumes

From a management perspective this creates challenges. For example

It is impossible to create generic tooling to restore snapshots(because there is no way to know where they should be restored to)
To take snapshots you may need to dynamically mount parts of the filesystem the user doesn’t want mounted

Just to get an idea of what I am referring to, consider a situation where there are two subvolumes, each with three snapshots

Now, let’s say we restore snapshot 2 of subvolume 1 and then take a new snapshot. You might be expecting the result would be something like this

Unfortunately, the end result in btrfs is this

Why is that? Because btrfs subvolumes only track their immediate parent. Once you remove the parent the children all become orphaned.

If we restore a snapshot in subvolume 2 we end up with this

It is still pretty clear what happened due to the color coding. However, that was just there to make the diagram easier to understand. In reality, that differentiation isn’t there, leaving us with this

Screenshot_20240102_141137

What are those snapshots of? No idea.

So what happens in practice? Each snapshot management solution for btrfs creates a different method for “tracking” snapshots. For example, Snapper creates a .snapshots subvolume and keeps all the snapshots for a subvolume inside it. It also adds a file with additional metadata along with each snapshot. Of course, every other solution does something completely different which only increases the divergence.

NOTE: In these diagrams I am assuming that we are using snapshots to do the restores but even if you use other methods, the diagrams end up just as messy. They are just messy in different ways.

I was hoping that bcachefs would put a modern spin on this and increase usability. Here is what I found.

First, let’s talk about subvolumes in bcachefs. Subvolumes present as directories in bcachefs which means they share all the disadvantages of btrfs subvolumes but also have some new ones. Subvolumes can’t be mounted and there is no user space tool to list subvolumes. That means:

Subvolumes are basically indistinguishable from directories. I hope you took good notes when you were creating them.
You can’t use flat subvolume layouts. The closest you can get is to mount the root somewhere and then use bind mounts. That will require the root to always be mounted somewhere though.

So what about snapshots? Well, since bcachefs snapshots are presented through subvolumes they effectively have all the same limitations as btrfs snapshots but they get some new ones added in.

You can’t get a list of subvolumes
There is no way to see how much space each snapshot is uniquely consuming
There is no way to see what even the immediate parent is
There is no way to directly boot off a snapshot since they can’t be mounted

Essentially, it is just an unmanaged mess. I always have thought that whoever designed btrfs snapshots didn’t really consider all the use cases for snapshots when they designed them. But bcachefs feels more like they focused on the efficiency of the implementation and completely ignored usability.

To be fair, some of the issues will probably be fixed as the tools mature. For example, I expect that there will eventually be ways to list subvolumes and snapshots. However, some of these issues seem more foundational. Not allowing subvolumes to be mounted directly and not storing good, usable metadata with snapshots seem like problems not likely to be solved unless the developers start thinking of things differently.

My hope that bcachefs could be a viable replacement for zfs is, unfortunately, not likely to be realized.

Of course, I would love to be wrong on any of this so if I have made mistakes on any of this, please let me know.

BluishHumility · January 2, 2024, 8:58pm

Wow, that was a really thoughtful explanation.

In what ways does zfs handle snapshotting better than btrfs or bcachefs? Sorry if that is a stupid question; I haven’t taken the time to learn much about zfs yet.

dalto · January 2, 2024, 9:47pm

First, there are nomenclature differences. The rough equivalent of a subvolume is a dataset in zfs. Although, zfs datasets are a more flexible concept than subvolumes.

In zfs, snapshots are part of the dataset they came from. There is never any doubt about where they came from. You also know when they were created. So if you want a list of snapshots with creation time of a given dataset you can always get that. You can also take snapshots of datasets that aren’t mounted.

If you want to “restore” a snapshot, there are two ways to do that depending on what you want the outcome to look like.

Rollback - This returns the filesystem to the point in time when the snapshot was taken. This also means that snapshots after that point are destroyed.
Clone/Promote - You can take a clone of a snapshot and then promote that clone. In this case, the promoted clone holds the cloned snapshot and any prior snapshots and the original dataset contains snapshots that occurred after that point.

Of course, you can also mount a snapshot and simply copy the data out of it. This is true of all the filesystems though.

Honestly, although I find this method far superior from a usability perspective, it isn’t perfect either. Zfs is a much older filesystem which is why I was hoping that bcachefs would improve on it.

BluishHumility · January 2, 2024, 11:19pm

Hmm, that does sound way better. Easier to understand, and more features. It seems like Btrfs should have been able to put together something more polished like this, but then again I am talking out of my hat a bit.

So…when is daltofs due to be released?

freebird54 · January 2, 2024, 11:22pm

Thanks for the investigation - and the explanation! I heard of it, and considered investigating, but gave up on it when I got lost! Now at least I know enough to drop the idea completely!

I guess I just creep along slowly trying to learn enough zfs to get by…
(currently using it, but avoiding the snaps entirely!)

Zesko · January 3, 2024, 8:56am

Btrfs subvolume list info shows uncleanly and confuses a lot of people. However, there are parent and child IDs that are important to the tree structure (Btree).

Each snapshot/subvolume always has two IDs: its own ID and Level ID ( = parent ID). This parent / level ID points to its parent-subvolume/snapshot.

You can check

$ sudo btrfs subvolume list / | grep <Level ID>

This tells you which subvolume many snapshots with the same level ID belongs to.

I don’t see parent ID being deleted after restore any snapshot. All snapshots do not lose parent ID at all. Their parent subvolumes are always there.

Edit://

My guess:
After restoring a snapshot, a new subvolume is created with its new ID and would take on the same level ID of the old parent subvolume when it is mounted. This old parent-subvolume would be changed to the invisible, hidden subvolume, but it must stay if it still depends on its children (snapshots / subvolumes) in the B-tree.
But btrfs subvolume list does not show you where the hidden original parent subvolume is located.

If you want to delete the hidden parent subvolume, you have to delete its all children (snapshots and sub-subvolumes), then run btrfs balance that would delete this unused parent subvolume. That’s my guess.

This is how B-tree works.

Zesko · January 3, 2024, 9:46am

If you know Git well, I would compare Btrfs- and Zfs-snapshot with Git. I did not try bcachefs. (it is still in experiment in development.)

The behaviour of Zfs snapshots is like Git commits.
The behaviour of Zfs dataset is like Git branch.

The behaviour of Btrfs subvolumes is like Git branches.
The behaviour of Btrfs snapshots is like Git branches, but “not” Git commits.

dalto · January 3, 2024, 12:32pm

This is not really true. As soon as you delete the parent subvolume, the parent IDs are no longer useful.

The restore method I am describing above is one where you delete the original subvol and replace it with a snapshot of the snapshot. There are other ways to do it, but unless you keep every snapshot and subvol forever, they are equally as messy over time.

Again, as I stated before I started, I am not talking about the underlying implementation, but how they present in user space. Even if it is theoretically possible to access the data, if there are no tools that expose it, it is irrelevant to my needs.

Zesko · January 3, 2024, 4:33pm

Zesko:

$ sudo btrfs subvolume list / | grep <Level ID>
This tells you which subvolume many snapshots with the same level ID belongs to.

I don’t see parent ID being deleted after restore any snapshot. All snapshots do not lose parent ID at all. Their parent subvolumes are always there.

Sorry for my unclear English,
I mean, level ID of any snapshot cannot be deleted after restoring, this level ID always points to the parent subvolume ID.

What I thought before:

Level ID of snapshot = ID of parent subvolume
Level ID of snapshot cannot be changed.

I could be wrong, because I didn’t test to restore any snapshot today.

I’m wondering what happens to the level ID of the snapshot after restoring if the old original parent subvolume is gone.

If the level ID of the snapshot is changed, then it is no long the original parent ID.

Zesko · January 6, 2024, 3:53pm

Btrfs does.

I have time today and created 3 subvolumes with their random snapshots on the same top level 5 without using any third party tool e.g. snapper.

ID 262 gen 1653 top level 5 path subvolume1
ID 263 gen 1656 top level 5 path subvolume2
ID 264 gen 1659 top level 5 path subvolume3
ID 265 gen 1651 top level 5 path snapshot_ba1
ID 266 gen 1652 top level 5 path snapshot_btk
ID 267 gen 1653 top level 5 path snapshot_7f2
ID 268 gen 1654 top level 5 path snapshot_9ge
ID 269 gen 1655 top level 5 path snapshot_b8i
ID 270 gen 1656 top level 5 path snapshot_b2z
ID 271 gen 1657 top level 5 path snapshot_nma
ID 272 gen 1658 top level 5 path snapshot_x6p
ID 273 gen 1690 top level 5 path snapshot_ha2

You do not know which snapshots belong to subvolume1.

A solution is here:

Run $ sudo btrfs subvolume list -uq / to tell you that parent-UUID of snapshot matches to UUID of subvolume.

ID 262 gen 1653 top level 5 parent_uuid -                                    uuid 5225a8e3-d2d3-6d4f-97af-0e2b6dc87d6a path subvolume1
ID 263 gen 1656 top level 5 parent_uuid -                                    uuid d7680198-06e3-1a4b-bb14-051986627f7d path subvolume2
ID 264 gen 1659 top level 5 parent_uuid -                                    uuid 837b6d00-8a78-fa4f-b847-8828cd1e09fd path subvolume3
ID 265 gen 1651 top level 5 parent_uuid 5225a8e3-d2d3-6d4f-97af-0e2b6dc87d6a uuid 68d371d3-4619-0940-a80c-6081a1113455 path snapshot_ba1
ID 266 gen 1652 top level 5 parent_uuid 5225a8e3-d2d3-6d4f-97af-0e2b6dc87d6a uuid a144c04a-5e0d-644c-ac3d-b1f76dbd383c path snapshot_btk
ID 267 gen 1653 top level 5 parent_uuid 5225a8e3-d2d3-6d4f-97af-0e2b6dc87d6a uuid 989c071a-1247-df40-a5e9-b177bcd4d8d4 path snapshot_7f2
ID 268 gen 1654 top level 5 parent_uuid d7680198-06e3-1a4b-bb14-051986627f7d uuid 424d3a9c-56c1-164f-9ccd-8fc203617f0d path snapshot_9ge
ID 269 gen 1655 top level 5 parent_uuid d7680198-06e3-1a4b-bb14-051986627f7d uuid 4bf0471d-209c-6741-8d67-026b30d4e9f7 path snapshot_b8i
ID 270 gen 1656 top level 5 parent_uuid d7680198-06e3-1a4b-bb14-051986627f7d uuid 5a26b420-8ca7-4f41-9e11-e2d26ae6fbca path snapshot_b2z
ID 271 gen 1657 top level 5 parent_uuid 837b6d00-8a78-fa4f-b847-8828cd1e09fd uuid e589d111-146b-db46-a064-b623d50270e8 path snapshot_nma
ID 272 gen 1658 top level 5 parent_uuid 837b6d00-8a78-fa4f-b847-8828cd1e09fd uuid 85795713-0df5-4e48-9812-5fdffaf36b34 path snapshot_x6p
ID 273 gen 1690 top level 5 parent_uuid 837b6d00-8a78-fa4f-b847-8828cd1e09fd uuid 1ea47aa6-f8f6-e447-94d6-5bf4aedb5616 path snapshot_ha2

I deleted the subvolume1, its UUID is still there and cannot be deleted because its snapshots need it.

ID 263 gen 1800 top level 5 parent_uuid -                                    uuid d7680198-06e3-1a4b-bb14-051986627f7d path subvolume2
ID 264 gen 1800 top level 5 parent_uuid -                                    uuid 837b6d00-8a78-fa4f-b847-8828cd1e09fd path subvolume3
ID 265 gen 1800 top level 5 parent_uuid 5225a8e3-d2d3-6d4f-97af-0e2b6dc87d6a uuid 68d371d3-4619-0940-a80c-6081a1113455 path snapshot_ba1
ID 266 gen 1824 top level 5 parent_uuid 5225a8e3-d2d3-6d4f-97af-0e2b6dc87d6a uuid a144c04a-5e0d-644c-ac3d-b1f76dbd383c path snapshot_btk
ID 267 gen 1800 top level 5 parent_uuid 5225a8e3-d2d3-6d4f-97af-0e2b6dc87d6a uuid 989c071a-1247-df40-a5e9-b177bcd4d8d4 path snapshot_7f2
ID 268 gen 1800 top level 5 parent_uuid d7680198-06e3-1a4b-bb14-051986627f7d uuid 424d3a9c-56c1-164f-9ccd-8fc203617f0d path snapshot_9ge
ID 269 gen 1800 top level 5 parent_uuid d7680198-06e3-1a4b-bb14-051986627f7d uuid 4bf0471d-209c-6741-8d67-026b30d4e9f7 path snapshot_b8i
ID 270 gen 1800 top level 5 parent_uuid d7680198-06e3-1a4b-bb14-051986627f7d uuid 5a26b420-8ca7-4f41-9e11-e2d26ae6fbca path snapshot_b2z
ID 271 gen 1800 top level 5 parent_uuid 837b6d00-8a78-fa4f-b847-8828cd1e09fd uuid e589d111-146b-db46-a064-b623d50270e8 path snapshot_nma
ID 272 gen 1800 top level 5 parent_uuid 837b6d00-8a78-fa4f-b847-8828cd1e09fd uuid 85795713-0df5-4e48-9812-5fdffaf36b34 path snapshot_x6p
ID 273 gen 1800 top level 5 parent_uuid 837b6d00-8a78-fa4f-b847-8828cd1e09fd uuid 1ea47aa6-f8f6-e447-94d6-5bf4aedb5616 path snapshot_ha2

dalto · January 6, 2024, 4:11pm

I think you are missing the point.

The point is that we want to know that those snapshots belong to “subvolume1”.

Even if those snapshots have the UUID that formerly belonged to subvolume1, since there is no longer an association between subvolume1 and that UUID, it is data that isn’t really very useful.

What you are describing here is exactly what I have described above. Snapshots have a relationship to their immediate parent.

I don’t think this is quite right. The subvolume that had that UUID was deleted. However, the children are still holding a reference to that now invalid UUID.

mihalycsaba · January 6, 2024, 10:13pm

I have a really surface level understanding of these filesystems.

But why isn’t there registry of these deleted subvolumes and their UUIDs? Or a helper script at least.

dalto · January 6, 2024, 10:26pm

My personal feeling it while I suppose that would help a little, the real problem is that architecturally, snapshots are separated from their original subvolumes.

Schlaefer · March 8, 2024, 5:49pm

I dabbled a little bit with it today setting up a simple two device SSD-cache frontend/HDD-backend combo. Nothing important on it.

One noteworthy and very cool feature is that you can set cache/backend configs on the filesystem level. So you e.g can say “the SSD should cache the HDD but not this folder”. That’s very handy.