From 9fd2a3f381162b1ce8a0b458962aaa9985a4acea Mon Sep 17 00:00:00 2001 From: Zygo Blaxell Date: Thu, 7 Sep 2023 11:59:34 -0400 Subject: [PATCH] btrfs: fix max_stripe_len for regular (non-zoned) chunk allocation In f6fca3917b4d "btrfs: store chunk size in space-info struct", the num_bytes parameter given to find_free_dev_extent for data block groups in non-zoned filesystems changed from SZ_1G to space_info->chunk_size, which is 10 GiB by default. This is a unit conformability error, because max_stripe_len is measured in physical bytes in the device address space (dup is an exception here, but that is already handled in find_free_dev_extent), while space_info->chunk_size is measured in logical bytes in the btrfs virtual address space. This change results in some undesirable behavior from functions that try to use alloc_ctl->max_stripe_len. Some of the problems have already been addressed in commits such as 5da431b71d4b9 "btrfs: fix the max chunk size and stripe length calculation" which restores the size of data block groups to 1 GiB. find_free_dev_extent will now search for 10 GiB dev_extent holes by default to satisfy new chunk allocations, but the allocator will only ever allocate the first 1 GiB of the space that is found. That mismatch results in a regression for multi-device filesystems. If a small device has large contiguous holes, it will steal allocations from large devices that have smaller holes which are still 1 GiB or larger. The filesystem will allocate space sub-optimally, and run out of space too early. Consider a filesystem using raid1 profile with 3 devices. After some balances, it has the following contiguous unallocated spaces: Device 1: 10x 1 GiB Device 2: 10x 1.01 GiB Device 3: 10x 1.01 GiB Before kernel v6.0, the allocator would fill these optimally by allocating: 5 pairs of 1.0 GiB on device 1,2 5 pairs of 1.0 GiB on device 2,3 5 pairs of 1.0 GiB on device 3,1 10 pairs of 0.1 GiB on device 2,3 (after all holes 1.0 GiB or longer are filled) This allocates all of the unallocated space. After kernel 6.0, the allocator prefers the larger extents over the emptier devices, so it allocates all of those first: 10 pairs of 1 GiB on device 2,3 10x pairs of 0.1 GiB on device 1,2 10x pairs of 0.1 GiB on device 1,3 9.8 GiB unallocated on device 1 and the filesystem is now full, with 9.8 GiB of space wasted. To fix the regression, we can simply set ctl->max_stripe_len back to the original SZ_1G, or space_info->chunk size if that's smaller, when allocating chunks for regular filesystems. I'm pretty sure chunk_size shouldn't be settable on regular filesystems at all. The size of a chunk is a computed value based on the number of data devices and max_stripe_len, which has a hardcoded maximum of 1 GiB. Setting max_stripe_len to values larger than 1G, or in any case to the same values as the metadata and system chunk size, would be very useful as a larger, uniform size of stripe_len could cut down on block group counts on large filesystems and prevent explosive dev_extent fragmentation when balancing filesystems using striped profiles. Setting a limit on the number of devices used in striped RAID chunks could also be useful. These would have to be expressed as two separate sysfs parameters, neither of which is chunk_size nor a value that can be computed from chunk_size. Fixes: f6fca3917b4d "btrfs: store chunk size in space-info struct" Signed-off-by: Zygo Blaxell (cherry picked from commit 6f2b19b175c0507e72da6c20f63e68da7387d394) --- fs/btrfs/volumes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 422b26d92c1f..3bfb3d402f19 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5138,7 +5138,7 @@ static void init_alloc_chunk_ctl_policy_regular( ASSERT(space_info); ctl->max_chunk_size = READ_ONCE(space_info->chunk_size); - ctl->max_stripe_size = ctl->max_chunk_size; + ctl->max_stripe_size = min_t(u64, ctl->max_chunk_size, SZ_1G); if (ctl->type & BTRFS_BLOCK_GROUP_SYSTEM) ctl->devs_max = min_t(int, ctl->devs_max, BTRFS_MAX_DEVS_SYS_CHUNK); -- 2.39.5