"PSA: Linux 5.16 has major regression in btrfs causing extreme IO load"

The BTRFS file system and Linux 5.16 may cause significant ssd disk wear if the autodefrag mount option is enabled.

Linux 5.16 saw some changes in the btrfs defragmentation code which introduced a regression that causes extreme write IO (and associated single-core 100% CPU load by a kernel thread named btrfs-cleaner). Apparently it’s defragmenting in an infinite loop.

It’s affecting volumes mounted with -o ssd,autodefrag, which is the default in most distributions.

This has the potential to wear out an SSD in a matter of weeks: on my Samsung PM981 Polaris 512GB this lead to 188 TB of writes in 10 days or so. That’s several years of endurance gone. 370 full drive overwrites.

The workaround is to disable autodefrag until this is resolved: mount -o remount,noautodefrag / and change autodefrag to noautodefrag in your /etc/fstab.

Relevant linux mailing list thread: https://lore.kernel.org/linux-btrfs/CAEwRaO4y3PPPUdwYjNDoB9m9CLzfd3DFFk2iK1X6OyyEWG5-mg@mail.gmail.com/

Original post: www.reddit.com

16 Likes

:clown_face: fs

5 Likes

Well, now I’m happy I was too lazy changing to btrfs on friday. Being lazy sometimes works out…

4 Likes

Thanks for posting this!
I have been watching the terminal output of sudo iotop -a on and off for a past couple of weeks.
I did notice an unusually high write rate associated with another btrfs process as well, namely btrfs-transaction (?).
I wanted to post about it on the forum and ask about it but I didn’t get around doing it.

Below is an screenshot of the terminal output of sudo iotop -a after 5 minutes uptime with nothing else but the system running:

here is another one after 20 minutes uptime, with Firefox (one tab) running in the background:

Does anyone know it this is also related to the above mentioned regression?

I need to add that this is after I applied the workaround posted above.

1 Like

Thx, i was looking for similar tool on Linux for long! :+1:

2 Likes

I found this post from almost 5 years (!) ago:

So, it seems the issue with autodefrag and btrfs-transacti is not a new one?

I would appreciate if forummates with more knowledge about the ins and outs of BTRFS could shed some light on this.

I checked the sudo iotop -a, and indeed btrfs transaction and btrfs cleaner were there…
Then, removed ssd,autodefrag from btrfs fs in fstab, rebooted and now, no more btrfs transaction/cleaner are showing up…

Thirty seconds is the default btrfs commit interval, I assume this is what this user was observing and curious about. These issues are definitely new in 5.16.

1 Like

I did change autodefrag to noautodefrag as mentioned in OP. After reboot I still got btrfs-transacti running and writing :thinking:

systemd-journald seems to do a whole lot of writing as was mentioned in the post I linked to above.

The proposed solution:

One thing that Fedora does but I’m not sure about Ubuntu, is it marks the /var/log/journal directory and all subdirectories and files with chattr +C . The ‘C’ attribute marks it as not copy-on-write, which means btrfs reads and writes it much as ext4 or xfs would.

Is this the thing that needs to be done then?

There’s some good detail about the increased IO in the kernel mailing list thread.

There are two main issues:

  1. Defragmenting 1 byte files will cause an infinite loop;
  2. The algorithm looks for candidates more correctly than in 5.15, therefore identifying more opportunities for correct defragmentation.

Case 1 is obviously a bug. Case 2 is less of a bug, and an older BTRFS filesystem will see much more defragmentation activity because it hasn’t been defragmented fully since changes in 5.10 vastly reduced the scope of candidate data.

There are some patches in the mailing list thread that should be making their way into the kernel at some point; they aren’t present as of 5.15.4.

5 Likes

Does noautodefrag has the same effect as removing autodefrag from the fstab ?

It will depend on the defaults. noautodefrag will make the option explicit.

2 Likes

It seems removing autodefrag is enough for me, thanks jonathon.

$ sudo mount | grep btrfs
/dev/sda2 on / type btrfs (rw,noatime,compress=zstd:3,ssd,discard=async,space_cache=v2,subvolid=256,subvol=/@)
2 Likes

Don’t know.
I wonder how come in your case btrfs-transacti is also gone.
So if I interpret @jonathon correctly, both removing and make the option explicit should have the same result. :thinking:

1 Like

Thanks for the information, turned off autodefrag

If journald writes a lot of log data I would look into why it writes so much data. One can make a case to handle /var/log somewhat differently - and some installations do - but on a normal desktop system that’s more of an advanced optimization than a necessity. If you have never tinkered with btrfs subvolumes and no reason otherwise to do so I would recommend not to bother.

1 Like

The impact of this is more dramatic than the title suggests. From the reddit note:

This has the potential to wear out an SSD in a matter of weeks: on my Samsung PM981 Polaris 512GB this lead to 188 TB of writes in 10 days or so. That’s several years of endurance gone. 370 full drive overwrites.

Potential hardware damage and complete data loss. This should not go unnoticed by any EndeavourOS user. But I am afraid that many of them do not follow this forum.

I am wondering if there is any good way to fix all(!) endeavouros installations with a mandatory package. Or at least a note in BIG LETTERS when a 5.16 is installed or updated.

6 Likes

The only way is not to use :clown_face: fs i’m afraid (since it was known to having that kind of problems from time to time), and even if there were some centralized backdoor in EndeavourOS (@joekamprad :shushing_face:), i doubt it would be good idea to knowingly have such ability, even if it were used for good intention…