git.hungrycats.org Git

Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues

When a tree mod log user no longer needs to use the tree it calls
btrfs_put_tree_mod_seq() to remove itself from the list of users and
delete all no longer used elements of the tree's red black tree, which
should be all elements with a sequence number less then our equals to
the caller's sequence number. However the logic is broken because it
can delete and free elements from the red black tree that have a
sequence number greater then the caller's sequence number:

1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the
   tree mod log;

2) The task which got assigned the sequence number 1 calls
   btrfs_put_tree_mod_seq();

3) Sequence number 1 is deleted from the list of sequence numbers;

4) The current minimum sequence number is computed to be the sequence
   number 2;

5) A task using sequence number 2 is at tree_mod_log_rewind() and gets
   a pointer to one of its elements from the red black tree through
   a call to tree_mod_log_search();

6) The task with sequence number 1 iterates the red black tree of tree
   modification elements and deletes (and frees) all elements with a
   sequence number less then or equals to 2 (the computed minimum sequence
   number) - it ends up only leaving elements with sequence numbers of 3
   and 4;

7) The task with sequence number 2 now uses the pointer to its element,
   already freed by the other task, at __tree_mod_log_rewind(), resulting
   in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces
   a trace like the following:

  [16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  [16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G        W         5.4.0-rc8-btrfs-next-51 #1
  [16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  [16804.548666] RIP: 0010:rb_next+0x16/0x50
  (...)
  [16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202
  [16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b
  [16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600
  [16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000
  [16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e
  [16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8
  [16804.554399] FS:  00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000
  [16804.555039] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0
  [16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [16804.557583] Call Trace:
  [16804.558207]  __tree_mod_log_rewind+0xbf/0x280 [btrfs]
  [16804.558835]  btrfs_search_old_slot+0x105/0xd00 [btrfs]
  [16804.559468]  resolve_indirect_refs+0x1eb/0xc70 [btrfs]
  [16804.560087]  ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs]
  [16804.560700]  find_parent_nodes+0x388/0x1120 [btrfs]
  [16804.561310]  btrfs_check_shared+0x115/0x1c0 [btrfs]
  [16804.561916]  ? extent_fiemap+0x59d/0x6d0 [btrfs]
  [16804.562518]  extent_fiemap+0x59d/0x6d0 [btrfs]
  [16804.563112]  ? __might_fault+0x11/0x90
  [16804.563706]  do_vfs_ioctl+0x45a/0x700
  [16804.564299]  ksys_ioctl+0x70/0x80
  [16804.564885]  ? trace_hardirqs_off_thunk+0x1a/0x20
  [16804.565461]  __x64_sys_ioctl+0x16/0x20
  [16804.566020]  do_syscall_64+0x5c/0x250
  [16804.566580]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [16804.567153] RIP: 0033:0x7f4b1ba2add7
  (...)
  [16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7
  [16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003
  [16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44
  [16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48
  [16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50
  (...)
  [16804.575623] ---[ end trace 87317359aad4ba50 ]---

Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that
have a sequence number equals to the computed minimum sequence number, and
not just elements with a sequence number greater then that minimum.

Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 58630a8c28664b8e826b2a096412107ca77d4c90)

Revert "Btrfs: do not start a transaction at iterate_extent_inodes()"

This reverts commit bfc61c36260ca990937539cd648ede3cd749bc10.

(cherry picked from commit 9ad6e14f3f1ad6a90c8530b8c8f925e99dd4062f)

zygo: btrfs: more debugging from Qu

Mind to apply this snippet?

It's not a poking around fixes, but a debug output for:
- Which block group is being balanced, and when it finishes
- Which subvolume (including reloc tree) is being dropped and when it
finishes

It's going to hit all those bugs again, but with the debug patch, I can
see how impossible things happens.

After some discussion in SUSE internal IRC, it proves one nightmare:
- root->reloc_root assignment is not atomic
- fs_info->reloc_cntl assigment is not atomic

So despite that, would you like to test it on a single core VM or
disabling all other cores of a baremetal to see if it can be reproduced?

Thanks,
Qu

(cherry picked from commit 77fc783e4d70918b453c7f2ba03f5d18d3a31ba3)

zygo: btrfs: get rid of this WARN_ON, it just triggers useless KASANs and slab-oobs

(cherry picked from commit e211f1a58b80b8b752f0e6054af945d20f1a2f86)

zygo: btrfs: get rid of this WARN_ON, it just triggers useless KASANs

(cherry picked from commit 5ff6289d3448990376cff625882caff854f13363)

btrfs: relocation: Introduce error injection points for cancelling balance

Introduce a new error injection point, should_cancel_balance().

It's just a wrapper of atomic_read(&fs_info->balance_cancel_req), but
allows us to override the return value.

Currently there are only one locations using this function:
- btrfs_balance()
It checks cancel before each block group.

There are other locations checking fs_info->balance_cancel_req, but they
are not used as an indicator to exit, so there is no need to use the
wrapper.

But there will be more locations coming, and some locations can cause
kernel panic if not handled properly.

So introduce this error injection to provide better test interface.

Signed-off-by: Qu Wenruo <wqu@suse.com>
(cherry picked from commit 6d2751364569f885ca795e37882690a78ac9f149)

btrfs: relocation: Work around dead relocation stage loop

There are some reports of dead relocation stage loop, where dmesg is
flooded by "Found X extents".

The root cause of it is still uncertain, but we can work around such bug
by checking cancelling request so user can at least cancel such dead
loop.

Signed-off-by: Qu Wenruo <wqu@suse.com>
(cherry picked from commit 51e1987039e8138615d7da1e52aa53a3fb332612)

btrfs: relocation: Check cancel request after each extent found

When relocating data block groups with tons of small extents, or
large metadata block groups, there can be over 200,000 extents.

We will iterate all extents of such block group in relocate_block_group(),
where iteration itself can be kinda time-consuming.

So when user want to cancel the balance, the extent iteration loop can
be another target.

This patch will add the cancelling check in the extent iteration loop of
relocate_block_group() to make balance cancelling faster.

Signed-off-by: Qu Wenruo <wqu@suse.com>
(cherry picked from commit 400faee53af2f9f1d0d4a0e0f92ea3d08fe53d59)

btrfs: relocation: Check cancel request after each data page read

When relocating a data extents with large large data extents, we spend
most of our time in relocate_file_extent_cluster() at stage "moving data
extents":
1)               |  btrfs_relocate_block_group [btrfs]() {
1)               |    relocate_file_extent_cluster [btrfs]() {
1) $ 6586769 us  |    }
1) + 18.260 us   |    relocate_file_extent_cluster [btrfs]();
1) + 15.770 us   |    relocate_file_extent_cluster [btrfs]();
1) $ 8916340 us  |  }
1)               |  btrfs_relocate_block_group [btrfs]() {
1)               |    relocate_file_extent_cluster [btrfs]() {
1) $ 11611586 us |    }
1) + 16.930 us   |    relocate_file_extent_cluster [btrfs]();
1) + 15.870 us   |    relocate_file_extent_cluster [btrfs]();
1) $ 14986130 us |  }

So to make data relocation cancelling quicker, here add extra balance
cancelling check after each page read in relocate_file_extent_cluster().

Signed-off-by: Qu Wenruo <wqu@suse.com>
(cherry picked from commit 34bb874930d1d6fecec7927101041884fe695aa7)

zygo: btrfs: check for DEAD_RELOC_TREE in fs/btrfs/relocation.c

(cherry picked from commit de97fa43b71a6515cb574419e97c921878547752)
(cherry picked from commit 92038017ebf650b7dacce333ff7e6b91d296997c)

btrfs: relocation: Output current relocation stage at btrfs_relocate_block_group()

There are several reports of hanging relocation, populating the dmesg
with things like:
  BTRFS info (device dm-5): found 1 extents

The investigation is still on going, but will never hurt to output a
little more info.

This patch will also output the current relocation stage, making that
output something like:

  BTRFS info (device dm-5): balance: start -d -m -s
  BTRFS info (device dm-5): relocating block group 30408704 flags metadata|dup
  BTRFS info (device dm-5): found 2 extents, stage: move data extents
  BTRFS info (device dm-5): relocating block group 22020096 flags system|dup
  BTRFS info (device dm-5): found 1 extents, stage: move data extents
  BTRFS info (device dm-5): relocating block group 13631488 flags data
  BTRFS info (device dm-5): found 1 extents, stage: move data extents
  BTRFS info (device dm-5): found 1 extents, stage: update data pointers
  BTRFS info (device dm-5): balance: ended with status: 0

This patch will not increase the number of lines, but with extra info
for us to debug the reported problem.
(Although it's very likely the bug is sticking at "update data pointers"
stage, even without the patch)

Signed-off-by: Qu Wenruo <wqu@suse.com>
(cherry picked from commit b04bb3a0ee5916885cb989f904d0168ac181e649)

zygo: btrfs: reloc_root might be NULL, adapt, #2

(cherry picked from commit 3acb2fb64807af3fddc5a5bccb86449b6d2bbac3)

zygo: btrfs: reloc_root might be NULL, adapt

(cherry picked from commit 33ef2d3ce4c1830291a9bcbee499230b1d5937e8)

zygo: btrfs: check for dead roots

(cherry picked from commit 1d9c5ddd8e5e71c9aeeca8cc516ed713ce0281b9)

zygo: btrfs: add some WARN_ONs to chase KASAN stack traces

(cherry picked from commit 7bc46196ca4d31da287f151b905babbdb429c444)

zygo: config: add KASAN and page poisoning

(cherry picked from commit 31ab8afa446ef1504d95ec2cb8997f717ea059f3)

zygo: move VIRTIO_FS to builtin

(cherry picked from commit 4319e8c0671717369d9e266760defc384228f40a)

zygo: make oldconfig for v5.4

(cherry picked from commit cb3c402a018565579c86dec3f4c8249edabc37a7)

zygo: add 'config-kernel' script to run 'make oldconfig' with KCONFIG_NOTIMESTAMP

(cherry picked from commit 4cc38d103bc42929ef42f3df5a344b27ab3fda45)

zygo: make oldconfig for v5.3.11

(cherry picked from commit 611173fe9f440ddc1e685590a1d9ae6605c7feff)
(cherry picked from commit 6e759716c74585121d6bf5d16196f3595d499207)
(cherry picked from commit 78f29c5cce696e0bce8eac427295c5a3baecec4f)

zygo: make oldconfig for v5.3.0

(cherry picked from commit 1c95c736e902a7000cf56e54a601ec9c582eceeb)

zygo: make oldconfig for 5.3.11

(cherry picked from commit 40d85aaa82b74e98ae431b0f861e918a4f9f4162)
(cherry picked from commit 72baa51e35dc3560c14af9248b19f11533651c27)

zygo: silence WARN_ON(!rwsem_is_locked(&sb->s_umount));

(cherry picked from commit f4220b603990b1404d138ba3e99af7992d6def82)
(cherry picked from commit df85af7244c128d115ff46a669e51f3923ede522)
(cherry picked from commit 6fa149dea2f57bff5c5c4416aa42f8089e743d17)
(cherry picked from commit 056b77c2bd566a07e9c4dcb12c1304d1f01d63c5)
(cherry picked from commit 69602a72b10db4dc103c977629557ba9526db6ad)

Revert "docs: Makefile: remove no-ops targets"

This reverts commit 18afab8c1d3c2a463eece561e9f15a1704b5eff9.

(cherry picked from commit 697451f51e136db9a8d373726701ffbe2440ecf0)
(cherry picked from commit 250887f1296220d5dd87eb312de6bbe9083d8c18)

zygo: btrfs: get a call trace when we hit ghost parent transid verify failures

(cherry picked from commit 5abbed1af5570f1317f31736e3862e8b7df1ca8b)
(cherry picked from commit bdfb7a5429475b12a8a1c817b4e9e452c1fc9bea)
(cherry picked from commit abe0b4452dd22410eeda535bcfab25f5a6e8b6de)
(cherry picked from commit 983d9f3319e1e38dd85bcc66cfc04fdbf7627fc0)

zygo: btrfs: set super_num_devices

(cherry picked from commit ac5c7d500c60c460ebc659bb794c2b46c1a31497)
(cherry picked from commit 253cbcb7e19750240362a156c517b6213b958d4d)
(cherry picked from commit 8fa40c4f82c637969b2bac80ef5c524017ce054e)
(cherry picked from commit edf2d8af02e7fb727b466e1018b5c5d2fc48f5a0)
(cherry picked from commit 62f701457f2a1f4cad83353cde134caefbb1a354)

zygo: from zygo-5.0.x-zb64

zygo: from zygo-4.17.x-zb64

(cherry picked from commit c45b6fb9ce9315c056288bfb5a755fd7eec9a197)

zygo: make oldconfig for v4.18

(cherry picked from commit cc3d6d815e6415c1157435e3f63270e4137b6608)
(cherry picked from commit f44fa222fc6a7df75b02a77e1d663bd8bee67fc5)

zygo: make oldconfig for 4.19.1

(cherry picked from commit 9c6ad550267932f5a4992c5d1f46980793c6a9a6)

zygo: make oldconfig for 4.19.7

(cherry picked from commit 002c486977a9a83a1e1385ec6de2746668de02f9)
(cherry picked from commit 63f6b068f316598eade19e31a1095ad7dbc700ba)

zygo: make oldconfig for v4.20

(cherry picked from commit 9a1ea99f7eada69738307d2898161a82e27310e2)
(cherry picked from commit db51f3f667ac3d5fa0851e4a49fce64f2a46ebf6)
(cherry picked from commit e0d2aa5082fcad29770f5786ae2a41fd100bdcde)
(cherry picked from commit d117937ed23d27b41faf7a4c2d16ebe463459d21)
(cherry picked from commit fedf2d15fd89d82cd73d431cae24f9318be6f869)
(cherry picked from commit d70c0dcaa12a9f5e1f483367b5798136a2d4dff5)

btrfs: make the extent buffer leak check per fs info

I'm going to make the entire destruction of btrfs_root's controlled by
their refcount, so it will be helpful to notice if we're leaking their
eb's on umount.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: rename btrfs_put_fs_root and btrfs_grab_fs_root

We are now using these for all roots, rename them to btrfs_put_root()
and btrfs_grab_root();

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: add a leak check for roots

Now that we're going to start relying on getting ref counting right for
roots, add a list to track allocated roots and print out any roots that
aren't free'd up at free_fs_info time. Hide this behind
CONFIG_BTRFS_DEBUG because this will just be used for developers to
verify they aren't breaking things.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: move fs_info init work into it's own helper function

open_ctree mixes initialization of fs stuff and fs_info stuff, which
makes it confusing when doing things like adding the root leak
detection. Make a separate function that init's all the static
structures inside of the fs_info needed for the fs to operate, and then
call that before we start setting up the fs_info to be mounted.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: free more things in btrfs_free_fs_info

Things like the percpu_counters, the mapping_tree, and the csum hash can
all be free'd at btrfs_free_fs_info time, since the helpers all check if
the structure has been init'ed already. This significantly cleans up
the error cases in open_ctree.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root

Now that all callers of btrfs_get_fs_root are subsequently calling
btrfs_grab_fs_root and handling dropping the ref when they are done
appropriately, go ahead and push btrfs_grab_fs_root up into
btrfs_get_fs_root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: use btrfs_put_fs_root to free roots always

If we are going to track leaked roots we need to free them all the same
way, so don't kfree() roots directly, use btrfs_put_fs_root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in open_ctree

We lookup the fs_root and put it in our fs_info directly, we should hold
a ref on this root for the lifetime of the fs_info. Rework the free'ing
function so that it calls btrfs_put_fs_root() on the root at free time
with all of the other global trees.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in btrfs_check_uuid_tree_entry

We lookup the uuid of arbitrary subvolumes, hold a ref on the root while
we're doing this.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in btrfs_recover_log_trees

We replay the log into arbitrary fs roots, hold a ref on the root while
we're doing this.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in create_pending_snapshot

We create the snapshot and then use it for a bunch of things, we need to
hold a ref on it while we're messing with it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in get_subvol_name_from_objectid

We lookup the name of a subvol which means we'll cross into different
roots. Hold a ref while we're doing the look ups in the fs_root we're
searching.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in btrfs_ioctl_send

We lookup all the clone roots and the parent root for send, so we need
to hold refs on all of these roots while we're processing them.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in scrub_print_warning_inode

We look up the root for the bytenr that is failing, so we need to hold a
ref on the root for that operation.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref for the root in btrfs_find_orphan_roots

We lookup roots for every orphan item we have, we need to hold a ref on
the root while we're doing this work.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: push grab_fs_root into read_fs_root

All of relocation uses read_fs_root to lookup fs roots, so push the
btrfs_grab_fs_root() up into that helper and remove the individual
calls.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in btrfs_recover_relocation

We look up the fs root in various places in here when recovering from a
crashed relcoation. Make sure we hold a ref on the root whenever we
look them up.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in create_reloc_inode

We're creating a reloc inode in the data reloc tree, we need to hold a
ref on the root while we're doing that.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in find_data_references

We're looking up the data references for the bytenr in a root, we need
to hold a ref on that root while we're doing that.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in record_reloc_root_in_trans

We are recording this root in the transaction, so we need to hold a ref
on it until we do that.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in merge_reloc_roots

We look up the corresponding root for the reloc root, we need to hold a
ref while we're messing with it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in prepare_to_merge

We look up the reloc roots corresponding root, we need to hold a ref on
that root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in build_backref_tree

This is trickier than the previous conversions.  We have backref_node's
that need to hold onto their root for their lifetime.  Do the read of
the root and grab the ref.  If at any point we don't use the root we
discard it, however if we use it in our backref node we don't free it
until we free the backref node.  Any time we switch the root's for the
backref node we need to drop our ref on the old root and grab the ref on
the new root, and if we dupe a node we need to get a ref on the root
there as well.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold ref on root in btrfs_ioctl_default_subvol

We look up an arbitrary fs root here, we need to hold a ref on the root
for the duration.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in btrfs_ioctl_get_subvol_info

We look up whatever root userspace has given us, we need to hold a ref
throughout this operation.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in btrfs_search_path_in_tree_user

We can wander into a different root, so grab a ref on the root we look
up. Later on we make root = fs_info->tree_root so we need this separate
out label to make sure we do the right cleanup only in the case we're
looking up a different root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in btrfs_search_path_in_tree

We look up an arbitrary fs root, we need to hold a ref on it while we're
doing our search.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in search_ioctl

We lookup a arbitrary fs root, we need to hold a ref on that root. If
we're using our own inodes root then grab a ref on that as well to make
the cleanup easier.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in create_subvol

We're creating the new root here, but we should hold the ref until after
we've initialized the inode for it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in fixup_tree_root_location

Looking up the inode from an arbitrary tree means we need to hold a ref
on that root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: grab a ref on the root in relink_extent_backref

We're doing an arbitrary lookup on an fs_root, we need to hold a ref on
that root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref for the root in record_one_backref

We're looking up in an arbitrary root, we need to hold a ref on that
root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in __btrfs_run_defrag_inode

We are looking up an arbitrary inode, we need to hold a ref on the root
while we're doing this.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a root ref in btrfs_get_dentry

Looking up the inode we need to search the root, make sure we hold a
reference on that root while we're doing the lookup.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in resolve_indirect_ref

We're looking up a random root, we need to hold a ref on it while we're
using it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on fs roots while they're in the radix tree

If we're sitting in the radix tree, we should probably have a ref for
the radix tree. Grab a ref on the root when we insert it, and drop it
when it gets deleted.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: handle NULL roots in btrfs_put/btrfs_grab_fs_root

We want to use this for dropping all roots, and in some error cases we
may not have a root, so handle this to make the cleanup code easier.
Make btrfs_grab_fs_root the same so we can use it in cases where the
root may not exist (like the quota root).

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: make the fs root init functions static

Now that the orphan cleanup stuff doesn't use this directly we can just
make them static.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: kill the btrfs_read_fs_root_no_name helper

All this does is call btrfs_get_fs_root() with check_ref == true. Just
use btrfs_get_fs_root() so we don't have a bunch of different helpers
that do the same thing.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: kill btrfs_read_fs_root

All helpers should either be using btrfs_get_fs_root() or
btrfs_read_tree_root().

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: make relocation use btrfs_read_tree_root()

Relocation has it's special roots, we don't want to save these in the
root cache either, so swap it to use btrfs_read_tree_roo(). However the
reloc root does need REF_COWS set, so make sure we set it everywhere we
use this helper, as it no longer does the REF_COWS setting.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: export and use btrfs_read_tree_root

log-tree uses btrfs_read_fs_root to load its log, but this just calls
btrfs_read_tree_root. We don't save the log roots in our root cache, so
just export this helper and use it in the logging code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: make btrfs_find_orphan_roots use btrfs_get_fs_root

btrfs_find_orphan_roots has this weird thing where it looks up the root
in cache to see if it is there before just reading the root.  But the
read it uses just reads the root, it doesn't do any of the init work, we
do that by hand here.  But this is unnecessary, all we really want is to
see if the root still exists and add it to the dead roots list to be
cleaned up, otherwise we delete the orphan item.

Fix this by just using btrfs_get_fs_root directly with check_ref set to
false so we get the orphan root items.  Then we just handle in cache and
out of cache roots the same, add them to the dead roots list and carry
on.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: move fs root init stuff into btrfs_init_fs_root

We have a helper for reading fs roots that just reads the fs root off
the disk and then sets REF_COWS and init's the inheritable flags. Move
this into btrfs_init_fs_root so we can later get rid of this helper and
consolidate all of the fs root reading into one helper.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: push __setup_root into btrfs_alloc_root

There's no reason to not init the root at alloc time, and with later
patches it actually causes problems if we error out mounting the fs
before the tree_root is init'ed because we expect it to have a valid ref
count. Fix this by pushing __setup_root into btrfs_alloc_root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: do not leak reloc root if we fail to read the fs root

If we fail to read the fs root corresponding with a reloc root we'll
just break out and free the reloc roots. But we remove our current
reloc_root from this list higher up, which means we'll leak this
reloc_root. Fix this by adding ourselves back to the reloc_roots list
so we are properly cleaned up.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: skip log replay on orphaned roots

My fsstress modifications coupled with generic/475 uncovered a failure
to mount and replay the log if we hit a orphaned root.  We do not want
to replay the log for an orphan root, but it's completely legitimate to
have an orphaned root with a log attached.  Fix this by simply skipping
replaying the log.  We still need to pin it's root node so that we do
not overwrite it while replaying other logs, as we re-read the log root
at every stage of the replay.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: handle ENOENT in btrfs_uuid_tree_iterate

If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the
uuid tree we'll just continue and do btrfs_next_item(). However we've
done a btrfs_release_path() at this point and no longer have a valid
path. So increment the key and go back and do a normal search.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: don't BUG_ON in create_subvol

We can just abort the transaction here, and in fact do that for every
other failure in this function except these two cases.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: drop log root for dropped roots

If we fsync on a subvolume and create a log root for that volume, and
then later delete that subvolume we'll never clean up its log root. Fix
this by making switch_commit_roots free the log for any dropped roots we
encounter.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: do not call synchronize_srcu() in inode_tree_del

Testing with the new fsstress uncovered a pretty nasty deadlock with
lookup and snapshot deletion.

Process A
unlink
-> final iput
   -> inode_tree_del
     -> synchronize_srcu(subvol_srcu)

Process B
btrfs_lookup  <- srcu_read_lock() acquired here
  -> btrfs_iget
    -> find inode that has I_FREEING set
      -> __wait_on_freeing_inode()

We're holding the srcu_read_lock() while doing the iget in order to make
sure our fs root doesn't go away, and then we are waiting for the inode
to finish freeing.  However because the free'ing process is doing a
synchronize_srcu() we deadlock.

Fix this by dropping the synchronize_srcu() in inode_tree_del().  We
don't need people to stop accessing the fs root at this point, we're
only adding our empty root to the dead roots list.

A larger much more invasive fix is forthcoming to address how we deal
with fs roots, but this fixes the immediate problem.

Fixes: 76dda93c6ae2 ("Btrfs: add snapshot/subvolume destroy ioctl")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: don't double lock the subvol_sem

If we're rename exchanging two subvols we'll try to lock this lock
twice, which is bad. Just lock once if either of the ino's are subvols.

cc: stable@vger.kernel.org
Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>

btrfs: handle error in btrfs_cache_block_group

We have a BUG_ON(ret < 0) in find_free_extent from
btrfs_cache_block_group.  If we fail to allocate our ctl we'll just
panic, which is not good.  Instead just go on to another block group.
If we fail to find a block group we don't want to return ENOSPC, because
really we got a ENOMEM and that's the root of the problem.  Save our
return from btrfs_cache_block_group(), and then if we still fail to make
our allocation return that ret so we get the right error back.

Tested with inject-error.py from bcc.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Revert "btrfs: drop bio_set_dev where not needed"

This reverts commit 9016e5bc872d52f56093fcb09cca965d7124853b.

Revert "btrfs: remove extent_map::bdev"

This reverts commit 1b58ca8cc85f1aabb1b7d723d32d861771a3a414.

Revert "btrfs: drop bdev argument from submit_extent_page"

This reverts commit 7f0788834c9803deaa861edfe2ec726572cdc38c.

btrfs: do not corrupt the fs with rename exchange on a subvol

Testing with the new fsstress support for subvolumes uncovered a pretty
bad problem with rename exchange on subvolumes.  We're modifying two
different subvolumes, but we only start the transaction on one of them,
so the other one is not added to the dirty root list.  This is caught by
btrfs_cow_block() with a warning because the root has not been updated,
however if we do not modify this root again we'll end up pointing at an
invalid root because the root item is never updated.

Fix this by making sure we add the destination root to the trans list,
the same as we do with normal renames.  This fixes the corruption.

Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>

btrfs: change btrfs_fs_devices::rotating to bool

struct btrfs_fs_devices::rotating currently is declared as an integer
variable but only used as a boolean.

Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: change btrfs_fs_devices::seeding to bool

struct btrfs_fs_devices::seeding currently is declared as an integer
variable but only used as a boolean.

Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename btrfs_block_group_cache

The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was quite long, making function prototypes harder to format.

Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: block-group: Reuse the item key from caller of read_one_block_group()

For read_one_block_group(), its only caller has already got the item key
to search next block group item.

So we can use that key directly without doing our own convertion on
stack.

Also, since that key used in btrfs_read_block_groups() is vital for
block group item search, add 'const' keyword for that parameter to
prevent read_one_block_group() to modify it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: block-group: Refactor btrfs_read_block_groups()

Refactor the work inside the loop of btrfs_read_block_groups() into one
separate function, read_one_block_group().

This allows read_one_block_group to be reused for later BG_TREE feature.

The refactor does the following extra fix:
- Use btrfs_fs_incompat() to replace open-coded feature check

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: document extent buffer locking

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: access eb::blocking_writers according to ACCESS_ONCE policies

A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access
once policies can be found here
https://lwn.net/Articles/799218/#Access-Marking%20Policies .

The locked and unlocked access to eb::blocking_writers should be
annotated accordingly, following this:

Writes:

- locked write must use ONCE, may use plain read
- unlocked write must use ONCE

Reads:

- unlocked read must use ONCE
- locked read may use plain read iff not mixed with unlocked read
- unlocked read then locked must use ONCE

There's one difference on the assembly level, where
btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached
value and did not reevaluate it after taking the lock. This could have
missed some opportunities to take the lock in case blocking writers
changed between the calls, but the window is just a few instructions
long. As this is in try-lock, the callers handle that.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: set blocking_writers directly, no increment or decrement

The increment and decrement was inherited from previous version that
used atomics, switched in commit 06297d8cefca ("btrfs: switch
extent_buffer blocking_writers from atomic to int"). The only possible
values are 0 and 1 so we can set them directly.

The generated assembly (gcc 9.x) did the direct value assignment in
btrfs_set_lock_blocking_write (asm diff after change in 06297d8cefca):

     5d:   test   %eax,%eax
     5f:   je     62 <btrfs_set_lock_blocking_write+0x22>
     61:   retq

  -  62:   lock incl 0x44(%rdi)
  -  66:   add    $0x50,%rdi
  -  6a:   jmpq   6f <btrfs_set_lock_blocking_write+0x2f>

  +  62:   movl   $0x1,0x44(%rdi)
  +  69:   add    $0x50,%rdi
  +  6d:   jmpq   72 <btrfs_set_lock_blocking_write+0x32>

The part in btrfs_tree_unlock did a decrement because
BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but
otherwise the output looks safe:

  - lock decl 0x44(%rdi)

  + sub    $0x1,%eax
  + mov    %eax,0x44(%rdi)

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: merge blocking_writers branches in btrfs_tree_read_lock

There are two ifs that use eb::blocking_writers. As this is a variable
modified inside and outside of locks, we could minimize number of
accesses to avoid problems with getting different results at different
times.

The access here is locked so this can only race with btrfs_tree_unlock
that sets blocking_writers to 0 without lock and unsets the lock owner.

The first branch is taken only if the same thread already holds the
lock, the second if checks for blocking writers. Here we'd either unlock
and wait, or proceed. Both are valid states of the locking protocol.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: drop incompat bit for raid1c34 after last block group is gone

When there are no raid1c3 or raid1c4 block groups left after balance
(either convert or with other filters applied), remove the incompat bit.
This is already done for RAID56, do the same for RAID1C34.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add incompat for raid1 with 3, 4 copies

The new raid1c3 and raid1c4 profiles are backward incompatible and the
name shall be 'raid1c34', the status can be found in the global
supported features in /sys/fs/btrfs/features or in the per-filesystem
directory.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add support for 4-copy replication (raid1c4)

Add new block group profile to store 4 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the 2-
and 3-copy RAID1.

The minimum number of devices is 4, the maximum number of devices/chunks
that can be lost/damaged is 3. There is no comparable traditional RAID
level, the profile is added for future needs to accompany triple-parity
and beyond.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add support for 3-copy replication (raid1c3)

Add new block group profile to store 3 copies in a simliar way that
current RAID1 does. The profile attributes and constraints are defined
in the raid table and used by the same code that already handles the
2-copy RAID1.

The minimum number of devices is 3, the maximum number of devices/chunks
that can be lost/damaged is 2. Like RAID6 but with 33% space
utilization.

Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: sink write flags to cow_file_range_async

In commit "Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios",
cow_file_range_async gained wbc as a parameter and this makes passing
write flags redundant. Set it inside the function and remove the
parameter.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: sink write_flags to __extent_writepage_io

__extent_writepage reads write flags from wbc and passes both to
__extent_writepage_io. This makes write_flags redundant and we can
remove it.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: send, skip backreference walking for extents with many references

Backreference walking, which is used by send to figure if it can issue
clone operations instead of write operations, can be very slow and use
too much memory when extents have many references. This change simply
skips backreference walking when an extent has more than 64 references,
in which case we fallback to a write operation instead of a clone
operation. This limit is conservative and in practice I observed no
signicant slowdown with up to 100 references and still low memory usage
up to that limit.

This is a temporary workaround until there are speedups in the backref
walking code, and as such it does not attempt to add extra interfaces or
knobs to tweak the threshold.

Reported-by: Atemu <atemu.main@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

Btrfs: send, allow clone operations within the same file

For send we currently skip clone operations when the source and
destination files are the same. This is so because clone didn't support
this case in its early days, but support for it was added back in May
2013 by commit a96fbc72884fcb ("Btrfs: allow file data clone within a
file"). This change adds support for it.

Example:

  $ mkfs.btrfs -f /dev/sdd
  $ mount /dev/sdd /mnt/sdd

  $ xfs_io -f -c "pwrite -S 0xab -b 64K 0 64K" /mnt/sdd/foobar
  $ xfs_io -c "reflink /mnt/sdd/foobar 0 64K 64K" /mnt/sdd/foobar

  $ btrfs subvolume snapshot -r /mnt/sdd /mnt/sdd/snap

  $ mkfs.btrfs -f /dev/sde
  $ mount /dev/sde /mnt/sde

  $ btrfs send /mnt/sdd/snap | btrfs receive /mnt/sde

Without this change file foobar at the destination has a single 128Kb
extent:

  $ filefrag -v /mnt/sde/snap/foobar
  Filesystem type is: 9123683e
  File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
   ext:     logical_offset:        physical_offset: length:   expected: flags:
     0:        0..      31:          0..        31:     32:             last,unknown_loc,delalloc,eof
  /mnt/sde/snap/foobar: 1 extent found

With this we get a single 64Kb extent that is shared at file offsets 0
and 64K, just like in the source filesystem:

  $ filefrag -v /mnt/sde/snap/foobar
  Filesystem type is: 9123683e
  File size of /mnt/sde/snap/foobar is 131072 (32 blocks of 4096 bytes)
   ext:     logical_offset:        physical_offset: length:   expected: flags:
     0:        0..      15:       3328..      3343:     16:             shared
     1:       16..      31:       3328..      3343:     16:       3344: last,shared,eof
  /mnt/sde/snap/foobar: 2 extents found

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: Ensure we trim ranges across block group boundary

[BUG]
When deleting large files (which cross block group boundary) with
discard mount option, we find some btrfs_discard_extent() calls only
trimmed part of its space, not the whole range:

  btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%

type: bbio->map_type, in above case, it's SINGLE DATA.
start: Logical address of this trim
len: Logical length of this trim
trimmed: Physically trimmed bytes
ratio: trimmed / len

Thus leaving some unused space not discarded.

[CAUSE]
When discard mount option is specified, after a transaction is fully
committed (super block written to disk), we begin to cleanup pinned
extents in the following call chain:

btrfs_commit_transaction()
|- btrfs_finish_extent_commit()
   |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
   |- btrfs_discard_extent()

However, pinned extents are recorded in an extent_io_tree, which can
merge adjacent extent states.

When a large file gets deleted and it has adjacent file extents across
block group boundary, we will get a large merged range like this:

      |<---    BG1    --->|<---      BG2     --->|
      |//////|<--   Range to discard   --->|/////|

To discard that range, we have the following calls:

  btrfs_discard_extent()
  |- btrfs_map_block()
  |  Returned bbio will end at BG1's end. As btrfs_map_block()
  |  never returns result across block group boundary.
  |- btrfs_issuse_discard()
     Issue discard for each stripe.

So we will only discard the range in BG1, not the remaining part in BG2.

Furthermore, this bug is not that reliably observed, for above case, if
there is no other extent in BG2, BG2 will be empty and btrfs will trim
all space of BG2, covering up the bug.

[FIX]
- Allow __btrfs_map_block_for_discard() to modify @length parameter
  btrfs_map_block() uses its @length paramter to notify the caller how
  many bytes are mapped in current call.
  With __btrfs_map_block_for_discard() also modifing the @length,
  btrfs_discard_extent() now understands when to do extra trim.

- Call btrfs_map_block() in a loop until we hit the range end Since we
  now know how many bytes are mapped each time, we can iterate through
  each block group boundary and issue correct trim for each range.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>