From 9fd2a3f381162b1ce8a0b458962aaa9985a4acea Mon Sep 17 00:00:00 2001
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Date: Thu, 7 Sep 2023 11:59:34 -0400
Subject: [PATCH] btrfs: fix max_stripe_len for regular (non-zoned) chunk
 allocation

In f6fca3917b4d "btrfs: store chunk size in space-info struct", the
num_bytes parameter given to find_free_dev_extent for data block groups
in non-zoned filesystems changed from SZ_1G to space_info->chunk_size,
which is 10 GiB by default.

This is a unit conformability error, because max_stripe_len is
measured in physical bytes in the device address space (dup is an
exception here, but that is already handled in find_free_dev_extent),
while space_info->chunk_size is measured in logical bytes in the btrfs
virtual address space.  This change results in some undesirable behavior
from functions that try to use alloc_ctl->max_stripe_len.  Some of the
problems have already been addressed in commits such as 5da431b71d4b9
"btrfs: fix the max chunk size and stripe length calculation" which
restores the size of data block groups to 1 GiB.

find_free_dev_extent will now search for 10 GiB dev_extent holes by
default to satisfy new chunk allocations, but the allocator will only
ever allocate the first 1 GiB of the space that is found.  That mismatch
results in a regression for multi-device filesystems.

If a small device has large contiguous holes, it will steal allocations
from large devices that have smaller holes which are still 1 GiB or
larger.  The filesystem will allocate space sub-optimally, and run out
of space too early.

Consider a filesystem using raid1 profile with 3 devices.  After some
balances, it has the following contiguous unallocated spaces:

        Device 1:  10x  1 GiB
        Device 2:  10x  1.01 GiB
        Device 3:  10x  1.01 GiB

Before kernel v6.0, the allocator would fill these optimally by
allocating:

	5 pairs of 1.0 GiB on device 1,2
	5 pairs of 1.0 GiB on device 2,3
	5 pairs of 1.0 GiB on device 3,1
	10 pairs of 0.1 GiB on device 2,3 (after all holes 1.0 GiB or
	longer are filled)

This allocates all of the unallocated space.

After kernel 6.0, the allocator prefers the larger extents over the
emptier devices, so it allocates all of those first:

	10 pairs of 1 GiB on device 2,3
	10x pairs of 0.1 GiB on device 1,2
	10x pairs of 0.1 GiB on device 1,3
	9.8 GiB unallocated on device 1

and the filesystem is now full, with 9.8 GiB of space wasted.

To fix the regression, we can simply set ctl->max_stripe_len back to
the original SZ_1G, or space_info->chunk size if that's smaller, when
allocating chunks for regular filesystems.

I'm pretty sure chunk_size shouldn't be settable on regular filesystems
at all.  The size of a chunk is a computed value based on the number of
data devices and max_stripe_len, which has a hardcoded maximum of 1 GiB.
Setting max_stripe_len to values larger than 1G, or in any case to the
same values as the metadata and system chunk size, would be very useful
as a larger, uniform size of stripe_len could cut down on block group
counts on large filesystems and prevent explosive dev_extent fragmentation
when balancing filesystems using striped profiles.  Setting a limit on
the number of devices used in striped RAID chunks could also be useful.
These would have to be expressed as two separate sysfs parameters, neither
of which is chunk_size nor a value that can be computed from chunk_size.

Fixes: f6fca3917b4d "btrfs: store chunk size in space-info struct"
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
(cherry picked from commit 6f2b19b175c0507e72da6c20f63e68da7387d394)
---
 fs/btrfs/volumes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 422b26d92c1f..3bfb3d402f19 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5138,7 +5138,7 @@ static void init_alloc_chunk_ctl_policy_regular(
 	ASSERT(space_info);
 
 	ctl->max_chunk_size = READ_ONCE(space_info->chunk_size);
-	ctl->max_stripe_size = ctl->max_chunk_size;
+	ctl->max_stripe_size = min_t(u64, ctl->max_chunk_size, SZ_1G);
 
 	if (ctl->type & BTRFS_BLOCK_GROUP_SYSTEM)
 		ctl->devs_max = min_t(int, ctl->devs_max, BTRFS_MAX_DEVS_SYS_CHUNK);
-- 
2.39.5