Filipe Manana [Fri, 6 Dec 2019 11:13:31 +0000 (11:13 +0000)]
Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues
When a tree mod log user no longer needs to use the tree it calls
btrfs_put_tree_mod_seq() to remove itself from the list of users and
delete all no longer used elements of the tree's red black tree, which
should be all elements with a sequence number less then our equals to
the caller's sequence number. However the logic is broken because it
can delete and free elements from the red black tree that have a
sequence number greater then the caller's sequence number:
1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the
tree mod log;
2) The task which got assigned the sequence number 1 calls
btrfs_put_tree_mod_seq();
3) Sequence number 1 is deleted from the list of sequence numbers;
4) The current minimum sequence number is computed to be the sequence
number 2;
5) A task using sequence number 2 is at tree_mod_log_rewind() and gets
a pointer to one of its elements from the red black tree through
a call to tree_mod_log_search();
6) The task with sequence number 1 iterates the red black tree of tree
modification elements and deletes (and frees) all elements with a
sequence number less then or equals to 2 (the computed minimum sequence
number) - it ends up only leaving elements with sequence numbers of 3
and 4;
7) The task with sequence number 2 now uses the pointer to its element,
already freed by the other task, at __tree_mod_log_rewind(), resulting
in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces
a trace like the following:
Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that
have a sequence number equals to the computed minimum sequence number, and
not just elements with a sequence number greater then that minimum.
Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions") Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 58630a8c28664b8e826b2a096412107ca77d4c90)
Zygo Blaxell [Fri, 6 Dec 2019 16:46:26 +0000 (11:46 -0500)]
zygo: btrfs: more debugging from Qu
Mind to apply this snippet?
It's not a poking around fixes, but a debug output for:
- Which block group is being balanced, and when it finishes
- Which subvolume (including reloc tree) is being dropped and when it
finishes
It's going to hit all those bugs again, but with the debug patch, I can
see how impossible things happens.
After some discussion in SUSE internal IRC, it proves one nightmare:
- root->reloc_root assignment is not atomic
- fs_info->reloc_cntl assigment is not atomic
So despite that, would you like to test it on a single core VM or
disabling all other cores of a baremetal to see if it can be reproduced?
Qu Wenruo [Tue, 3 Dec 2019 06:42:52 +0000 (14:42 +0800)]
btrfs: relocation: Check cancel request after each data page read
When relocating a data extents with large large data extents, we spend
most of our time in relocate_file_extent_cluster() at stage "moving data
extents":
1) | btrfs_relocate_block_group [btrfs]() {
1) | relocate_file_extent_cluster [btrfs]() {
1) $ 6586769 us | }
1) + 18.260 us | relocate_file_extent_cluster [btrfs]();
1) + 15.770 us | relocate_file_extent_cluster [btrfs]();
1) $ 8916340 us | }
1) | btrfs_relocate_block_group [btrfs]() {
1) | relocate_file_extent_cluster [btrfs]() {
1) $ 11611586 us | }
1) + 16.930 us | relocate_file_extent_cluster [btrfs]();
1) + 15.870 us | relocate_file_extent_cluster [btrfs]();
1) $ 14986130 us | }
So to make data relocation cancelling quicker, here add extra balance
cancelling check after each page read in relocate_file_extent_cluster().
Qu Wenruo [Fri, 29 Nov 2019 04:40:59 +0000 (12:40 +0800)]
btrfs: relocation: Output current relocation stage at btrfs_relocate_block_group()
There are several reports of hanging relocation, populating the dmesg
with things like:
BTRFS info (device dm-5): found 1 extents
The investigation is still on going, but will never hurt to output a
little more info.
This patch will also output the current relocation stage, making that
output something like:
BTRFS info (device dm-5): balance: start -d -m -s
BTRFS info (device dm-5): relocating block group 30408704 flags metadata|dup
BTRFS info (device dm-5): found 2 extents, stage: move data extents
BTRFS info (device dm-5): relocating block group 22020096 flags system|dup
BTRFS info (device dm-5): found 1 extents, stage: move data extents
BTRFS info (device dm-5): relocating block group 13631488 flags data
BTRFS info (device dm-5): found 1 extents, stage: move data extents
BTRFS info (device dm-5): found 1 extents, stage: update data pointers
BTRFS info (device dm-5): balance: ended with status: 0
This patch will not increase the number of lines, but with extra info
for us to debug the reported problem.
(Although it's very likely the bug is sticking at "update data pointers"
stage, even without the patch)
Josef Bacik [Thu, 5 Dec 2019 14:38:18 +0000 (06:38 -0800)]
btrfs: make the extent buffer leak check per fs info
I'm going to make the entire destruction of btrfs_root's controlled by
their refcount, so it will be helpful to notice if we're leaking their
eb's on umount.
Josef Bacik [Fri, 22 Nov 2019 20:28:14 +0000 (12:28 -0800)]
btrfs: add a leak check for roots
Now that we're going to start relying on getting ref counting right for
roots, add a list to track allocated roots and print out any roots that
aren't free'd up at free_fs_info time. Hide this behind
CONFIG_BTRFS_DEBUG because this will just be used for developers to
verify they aren't breaking things.
Josef Bacik [Mon, 2 Dec 2019 17:09:14 +0000 (09:09 -0800)]
btrfs: move fs_info init work into it's own helper function
open_ctree mixes initialization of fs stuff and fs_info stuff, which
makes it confusing when doing things like adding the root leak
detection. Make a separate function that init's all the static
structures inside of the fs_info needed for the fs to operate, and then
call that before we start setting up the fs_info to be mounted.
Josef Bacik [Mon, 2 Dec 2019 16:56:06 +0000 (08:56 -0800)]
btrfs: free more things in btrfs_free_fs_info
Things like the percpu_counters, the mapping_tree, and the csum hash can
all be free'd at btrfs_free_fs_info time, since the helpers all check if
the structure has been init'ed already. This significantly cleans up
the error cases in open_ctree.
Josef Bacik [Wed, 20 Nov 2019 20:49:19 +0000 (12:49 -0800)]
btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root
Now that all callers of btrfs_get_fs_root are subsequently calling
btrfs_grab_fs_root and handling dropping the ref when they are done
appropriately, go ahead and push btrfs_grab_fs_root up into
btrfs_get_fs_root.
Josef Bacik [Thu, 21 Nov 2019 21:47:53 +0000 (13:47 -0800)]
btrfs: hold a ref on the root in open_ctree
We lookup the fs_root and put it in our fs_info directly, we should hold
a ref on this root for the lifetime of the fs_info. Rework the free'ing
function so that it calls btrfs_put_fs_root() on the root at free time
with all of the other global trees.
Josef Bacik [Wed, 20 Nov 2019 19:57:47 +0000 (11:57 -0800)]
btrfs: hold a ref on the root in btrfs_recover_relocation
We look up the fs root in various places in here when recovering from a
crashed relcoation. Make sure we hold a ref on the root whenever we
look them up.
Josef Bacik [Wed, 20 Nov 2019 19:42:34 +0000 (11:42 -0800)]
btrfs: hold a ref on the root in build_backref_tree
This is trickier than the previous conversions. We have backref_node's
that need to hold onto their root for their lifetime. Do the read of
the root and grab the ref. If at any point we don't use the root we
discard it, however if we use it in our backref node we don't free it
until we free the backref node. Any time we switch the root's for the
backref node we need to drop our ref on the old root and grab the ref on
the new root, and if we dupe a node we need to get a ref on the root
there as well.
Josef Bacik [Wed, 20 Nov 2019 18:50:21 +0000 (10:50 -0800)]
btrfs: hold a ref on the root in btrfs_search_path_in_tree_user
We can wander into a different root, so grab a ref on the root we look
up. Later on we make root = fs_info->tree_root so we need this separate
out label to make sure we do the right cleanup only in the case we're
looking up a different root.
Josef Bacik [Wed, 20 Nov 2019 18:41:45 +0000 (10:41 -0800)]
btrfs: hold a ref on the root in search_ioctl
We lookup a arbitrary fs root, we need to hold a ref on that root. If
we're using our own inodes root then grab a ref on that as well to make
the cleanup easier.
Josef Bacik [Wed, 20 Nov 2019 15:47:14 +0000 (07:47 -0800)]
btrfs: hold a ref on fs roots while they're in the radix tree
If we're sitting in the radix tree, we should probably have a ref for
the radix tree. Grab a ref on the root when we insert it, and drop it
when it gets deleted.
Josef Bacik [Thu, 21 Nov 2019 16:21:25 +0000 (08:21 -0800)]
btrfs: handle NULL roots in btrfs_put/btrfs_grab_fs_root
We want to use this for dropping all roots, and in some error cases we
may not have a root, so handle this to make the cleanup code easier.
Make btrfs_grab_fs_root the same so we can use it in cases where the
root may not exist (like the quota root).
Josef Bacik [Wed, 20 Nov 2019 14:56:46 +0000 (06:56 -0800)]
btrfs: kill the btrfs_read_fs_root_no_name helper
All this does is call btrfs_get_fs_root() with check_ref == true. Just
use btrfs_get_fs_root() so we don't have a bunch of different helpers
that do the same thing.
Josef Bacik [Wed, 20 Nov 2019 14:37:00 +0000 (06:37 -0800)]
btrfs: make relocation use btrfs_read_tree_root()
Relocation has it's special roots, we don't want to save these in the
root cache either, so swap it to use btrfs_read_tree_roo(). However the
reloc root does need REF_COWS set, so make sure we set it everywhere we
use this helper, as it no longer does the REF_COWS setting.
Josef Bacik [Wed, 20 Nov 2019 14:33:49 +0000 (06:33 -0800)]
btrfs: export and use btrfs_read_tree_root
log-tree uses btrfs_read_fs_root to load its log, but this just calls
btrfs_read_tree_root. We don't save the log roots in our root cache, so
just export this helper and use it in the logging code.
Josef Bacik [Wed, 20 Nov 2019 14:24:27 +0000 (06:24 -0800)]
btrfs: make btrfs_find_orphan_roots use btrfs_get_fs_root
btrfs_find_orphan_roots has this weird thing where it looks up the root
in cache to see if it is there before just reading the root. But the
read it uses just reads the root, it doesn't do any of the init work, we
do that by hand here. But this is unnecessary, all we really want is to
see if the root still exists and add it to the dead roots list to be
cleaned up, otherwise we delete the orphan item.
Fix this by just using btrfs_get_fs_root directly with check_ref set to
false so we get the orphan root items. Then we just handle in cache and
out of cache roots the same, add them to the dead roots list and carry
on.
Josef Bacik [Wed, 20 Nov 2019 14:13:23 +0000 (06:13 -0800)]
btrfs: move fs root init stuff into btrfs_init_fs_root
We have a helper for reading fs roots that just reads the fs root off
the disk and then sets REF_COWS and init's the inheritable flags. Move
this into btrfs_init_fs_root so we can later get rid of this helper and
consolidate all of the fs root reading into one helper.
Josef Bacik [Tue, 26 Nov 2019 19:24:59 +0000 (11:24 -0800)]
btrfs: push __setup_root into btrfs_alloc_root
There's no reason to not init the root at alloc time, and with later
patches it actually causes problems if we error out mounting the fs
before the tree_root is init'ed because we expect it to have a valid ref
count. Fix this by pushing __setup_root into btrfs_alloc_root.
Josef Bacik [Fri, 22 Nov 2019 18:21:24 +0000 (10:21 -0800)]
btrfs: do not leak reloc root if we fail to read the fs root
If we fail to read the fs root corresponding with a reloc root we'll
just break out and free the reloc roots. But we remove our current
reloc_root from this list higher up, which means we'll leak this
reloc_root. Fix this by adding ourselves back to the reloc_roots list
so we are properly cleaned up.
Josef Bacik [Mon, 2 Dec 2019 19:12:13 +0000 (11:12 -0800)]
btrfs: skip log replay on orphaned roots
My fsstress modifications coupled with generic/475 uncovered a failure
to mount and replay the log if we hit a orphaned root. We do not want
to replay the log for an orphan root, but it's completely legitimate to
have an orphaned root with a log attached. Fix this by simply skipping
replaying the log. We still need to pin it's root node so that we do
not overwrite it while replaying other logs, as we re-read the log root
at every stage of the replay.
Josef Bacik [Wed, 4 Dec 2019 16:18:08 +0000 (08:18 -0800)]
btrfs: handle ENOENT in btrfs_uuid_tree_iterate
If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the
uuid tree we'll just continue and do btrfs_next_item(). However we've
done a btrfs_release_path() at this point and no longer have a valid
path. So increment the key and go back and do a normal search.
Josef Bacik [Wed, 4 Dec 2019 14:08:54 +0000 (06:08 -0800)]
btrfs: drop log root for dropped roots
If we fsync on a subvolume and create a log root for that volume, and
then later delete that subvolume we'll never clean up its log root. Fix
this by making switch_commit_roots free the log for any dropped roots we
encounter.
Josef Bacik [Tue, 19 Nov 2019 18:45:22 +0000 (10:45 -0800)]
btrfs: do not call synchronize_srcu() in inode_tree_del
Testing with the new fsstress uncovered a pretty nasty deadlock with
lookup and snapshot deletion.
Process A
unlink
-> final iput
-> inode_tree_del
-> synchronize_srcu(subvol_srcu)
Process B
btrfs_lookup <- srcu_read_lock() acquired here
-> btrfs_iget
-> find inode that has I_FREEING set
-> __wait_on_freeing_inode()
We're holding the srcu_read_lock() while doing the iget in order to make
sure our fs root doesn't go away, and then we are waiting for the inode
to finish freeing. However because the free'ing process is doing a
synchronize_srcu() we deadlock.
Fix this by dropping the synchronize_srcu() in inode_tree_del(). We
don't need people to stop accessing the fs root at this point, we're
only adding our empty root to the dead roots list.
A larger much more invasive fix is forthcoming to address how we deal
with fs roots, but this fixes the immediate problem.
Josef Bacik [Tue, 19 Nov 2019 16:13:31 +0000 (08:13 -0800)]
btrfs: don't double lock the subvol_sem
If we're rename exchanging two subvols we'll try to lock this lock
twice, which is bad. Just lock once if either of the ino's are subvols.
cc: stable@vger.kernel.org Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Josef Bacik [Mon, 18 Nov 2019 16:19:30 +0000 (08:19 -0800)]
btrfs: handle error in btrfs_cache_block_group
We have a BUG_ON(ret < 0) in find_free_extent from
btrfs_cache_block_group. If we fail to allocate our ctl we'll just
panic, which is not good. Instead just go on to another block group.
If we fail to find a block group we don't want to return ENOSPC, because
really we got a ENOMEM and that's the root of the problem. Save our
return from btrfs_cache_block_group(), and then if we still fail to make
our allocation return that ret so we get the right error back.
Josef Bacik [Fri, 15 Nov 2019 20:43:06 +0000 (15:43 -0500)]
btrfs: do not corrupt the fs with rename exchange on a subvol
Testing with the new fsstress support for subvolumes uncovered a pretty
bad problem with rename exchange on subvolumes. We're modifying two
different subvolumes, but we only start the transaction on one of them,
so the other one is not added to the dirty root list. This is caught by
btrfs_cow_block() with a warning because the root has not been updated,
however if we do not modify this root again we'll end up pointing at an
invalid root because the root item is never updated.
Fix this by making sure we add the destination root to the trans list,
the same as we do with normal renames. This fixes the corruption.
Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com>
David Sterba [Tue, 29 Oct 2019 18:20:18 +0000 (19:20 +0100)]
btrfs: rename btrfs_block_group_cache
The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was quite long, making function prototypes harder to format.
Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 5 Nov 2019 01:35:35 +0000 (09:35 +0800)]
btrfs: block-group: Reuse the item key from caller of read_one_block_group()
For read_one_block_group(), its only caller has already got the item key
to search next block group item.
So we can use that key directly without doing our own convertion on
stack.
Also, since that key used in btrfs_read_block_groups() is vital for
block group item search, add 'const' keyword for that parameter to
prevent read_one_block_group() to modify it.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 10 Oct 2019 22:03:14 +0000 (00:03 +0200)]
btrfs: access eb::blocking_writers according to ACCESS_ONCE policies
A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access
once policies can be found here
https://lwn.net/Articles/799218/#Access-Marking%20Policies .
The locked and unlocked access to eb::blocking_writers should be
annotated accordingly, following this:
Writes:
- locked write must use ONCE, may use plain read
- unlocked write must use ONCE
Reads:
- unlocked read must use ONCE
- locked read may use plain read iff not mixed with unlocked read
- unlocked read then locked must use ONCE
There's one difference on the assembly level, where
btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached
value and did not reevaluate it after taking the lock. This could have
missed some opportunities to take the lock in case blocking writers
changed between the calls, but the window is just a few instructions
long. As this is in try-lock, the callers handle that.
David Sterba [Thu, 10 Oct 2019 21:31:19 +0000 (23:31 +0200)]
btrfs: set blocking_writers directly, no increment or decrement
The increment and decrement was inherited from previous version that
used atomics, switched in commit 06297d8cefca ("btrfs: switch
extent_buffer blocking_writers from atomic to int"). The only possible
values are 0 and 1 so we can set them directly.
The generated assembly (gcc 9.x) did the direct value assignment in
btrfs_set_lock_blocking_write (asm diff after change in 06297d8cefca):
5d: test %eax,%eax
5f: je 62 <btrfs_set_lock_blocking_write+0x22>
61: retq
The part in btrfs_tree_unlock did a decrement because
BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but
otherwise the output looks safe:
David Sterba [Thu, 10 Oct 2019 21:29:21 +0000 (23:29 +0200)]
btrfs: merge blocking_writers branches in btrfs_tree_read_lock
There are two ifs that use eb::blocking_writers. As this is a variable
modified inside and outside of locks, we could minimize number of
accesses to avoid problems with getting different results at different
times.
The access here is locked so this can only race with btrfs_tree_unlock
that sets blocking_writers to 0 without lock and unsets the lock owner.
The first branch is taken only if the same thread already holds the
lock, the second if checks for blocking writers. Here we'd either unlock
and wait, or proceed. Both are valid states of the locking protocol.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 31 Oct 2019 14:52:01 +0000 (15:52 +0100)]
btrfs: drop incompat bit for raid1c34 after last block group is gone
When there are no raid1c3 or raid1c4 block groups left after balance
(either convert or with other filters applied), remove the incompat bit.
This is already done for RAID56, do the same for RAID1C34.
David Sterba [Tue, 10 Jul 2018 16:15:05 +0000 (18:15 +0200)]
btrfs: add incompat for raid1 with 3, 4 copies
The new raid1c3 and raid1c4 profiles are backward incompatible and the
name shall be 'raid1c34', the status can be found in the global
supported features in /sys/fs/btrfs/features or in the per-filesystem
directory.
David Sterba [Fri, 2 Mar 2018 21:56:53 +0000 (22:56 +0100)]
btrfs: add support for 4-copy replication (raid1c4)
Add new block group profile to store 4 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the 2-
and 3-copy RAID1.
The minimum number of devices is 4, the maximum number of devices/chunks
that can be lost/damaged is 3. There is no comparable traditional RAID
level, the profile is added for future needs to accompany triple-parity
and beyond.
David Sterba [Fri, 2 Mar 2018 21:56:53 +0000 (22:56 +0100)]
btrfs: add support for 3-copy replication (raid1c3)
Add new block group profile to store 3 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the
2-copy RAID1.
The minimum number of devices is 3, the maximum number of devices/chunks
that can be lost/damaged is 2. Like RAID6 but with 33% space
utilization.
David Sterba [Tue, 29 Oct 2019 17:28:57 +0000 (18:28 +0100)]
btrfs: sink write flags to cow_file_range_async
In commit "Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios",
cow_file_range_async gained wbc as a parameter and this makes passing
write flags redundant. Set it inside the function and remove the
parameter.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 30 Oct 2019 12:23:01 +0000 (12:23 +0000)]
Btrfs: send, skip backreference walking for extents with many references
Backreference walking, which is used by send to figure if it can issue
clone operations instead of write operations, can be very slow and use
too much memory when extents have many references. This change simply
skips backreference walking when an extent has more than 64 references,
in which case we fallback to a write operation instead of a clone
operation. This limit is conservative and in practice I observed no
signicant slowdown with up to 100 references and still low memory usage
up to that limit.
This is a temporary workaround until there are speedups in the backref
walking code, and as such it does not attempt to add extra interfaces or
knobs to tweak the threshold.
Filipe Manana [Wed, 30 Oct 2019 12:23:11 +0000 (12:23 +0000)]
Btrfs: send, allow clone operations within the same file
For send we currently skip clone operations when the source and
destination files are the same. This is so because clone didn't support
this case in its early days, but support for it was added back in May
2013 by commit a96fbc72884fcb ("Btrfs: allow file data clone within a
file"). This change adds support for it.
Example:
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt/sdd
Without this change file foobar at the destination has a single 128Kb
extent:
$ filefrag -v /mnt/sde/snap/foobar
Filesystem type is: 9123683e
File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 31: 0.. 31: 32: last,unknown_loc,delalloc,eof
/mnt/sde/snap/foobar: 1 extent found
With this we get a single 64Kb extent that is shared at file offsets 0
and 64K, just like in the source filesystem:
$ filefrag -v /mnt/sde/snap/foobar
Filesystem type is: 9123683e
File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 15: 3328.. 3343: 16: shared
1: 16.. 31: 3328.. 3343: 16: 3344: last,shared,eof
/mnt/sde/snap/foobar: 2 extents found
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 23 Oct 2019 13:57:27 +0000 (21:57 +0800)]
btrfs: Ensure we trim ranges across block group boundary
[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:
type: bbio->map_type, in above case, it's SINGLE DATA.
start: Logical address of this trim
len: Logical length of this trim
trimmed: Physically trimmed bytes
ratio: trimmed / len
Thus leaving some unused space not discarded.
[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:
However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.
When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:
|<--- BG1 --->|<--- BG2 --->|
|//////|<-- Range to discard --->|/////|
To discard that range, we have the following calls:
btrfs_discard_extent()
|- btrfs_map_block()
| Returned bbio will end at BG1's end. As btrfs_map_block()
| never returns result across block group boundary.
|- btrfs_issuse_discard()
Issue discard for each stripe.
So we will only discard the range in BG1, not the remaining part in BG2.
Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.
[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
btrfs_map_block() uses its @length paramter to notify the caller how
many bytes are mapped in current call.
With __btrfs_map_block_for_discard() also modifing the @length,
btrfs_discard_extent() now understands when to do extra trim.
- Call btrfs_map_block() in a loop until we hit the range end Since we
now know how many bytes are mapped each time, we can iterate through
each block group boundary and issue correct trim for each range.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>