git.hungrycats.org Git

]> git.hungrycats.org Git - linux/log

Zygo Blaxell [Sat, 25 Apr 2015 05:36:10 +0000 (01:36 -0400)]

zygo: cherry-picked up to linus/master --remotes=btrfs-next/* --remotes=mason/* --after=now - 90 days -- fs/btrfs

Commits included (++++) and excluded (----):

---- 9f43323 Btrfs: fix inode cache writeout
---- 85db36c Btrfs: fix inode cache writeout
---- a3bdccc Btrfs: prevent list corruption during free space cache processing

commit | commitdiff | tree

Zygo Blaxell [Sat, 25 Apr 2015 04:23:50 +0000 (00:23 -0400)]

zygo: allow access to /dev/mem to chase bugs

commit | commitdiff | tree

Zygo Blaxell [Sat, 25 Apr 2015 03:36:06 +0000 (23:36 -0400)]

zygo: try with memcg and cpusets disabled

commit | commitdiff | tree

Chris Mason [Fri, 24 Apr 2015 18:00:00 +0000 (11:00 -0700)]

Btrfs: prevent list corruption during free space cache processing

__btrfs_write_out_cache is holding the ctl->tree_lock while it prepares
a list of bitmaps to record in the free space cache. It was dropping
the lock while it worked on other components, which made a window for
free_bitmap() to free the bitmap struct without removing it from the
list.

This changes things to hold the lock the whole time, and also makes sure
we hold the lock during enospc cleanup.

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>

commit | commitdiff | tree

Chris Mason [Thu, 23 Apr 2015 15:02:49 +0000 (08:02 -0700)]

Btrfs: fix inode cache writeout

The code to fix stalls during free spache cache IO wasn't using
the correct root when waiting on the IO for inode caches. This
is only a problem when the inode cache is enabled with

mount -o inode_cache

This fixes the inode cache writeout to preserve any error values and
makes sure not to override the root when inode cache writeout is done.

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>

commit | commitdiff | tree

Chris Mason [Thu, 23 Apr 2015 15:02:49 +0000 (08:02 -0700)]

commit | commitdiff | tree

Zygo Blaxell [Thu, 23 Apr 2015 02:30:36 +0000 (22:30 -0400)]

zygo: cherry-picked up to linus/master --remotes=btrfs-next/* --remotes=mason/* --after=now - 90 days -- fs/btrfs

Commits included (++++) and excluded (----):

++++ c1e31ff Btrfs: don't check for delalloc_bytes in cache_save_setup
++++ e47aabed Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
---- e98fb71 btrfs: handle ENOMEM in btrfs_alloc_tree_block
++++ 6e8f302 btrfs: fix race on ENOMEM in alloc_extent_buffer
---- 2f23947 btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache
++++ f5b5132 btrfs: unlock i_mutex after attempting to delete subvolume during send
++++ c7dd5f4 Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.

commit | commitdiff | tree

Yang Dongsheng [Thu, 9 Apr 2015 04:08:43 +0000 (12:08 +0800)]

Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.

We need to fill inode when we found a node for it in delayed_nodes_tree.
But we did not fill the ->last_trans currently, it will cause the test
of xfstest/generic/311 fail. Scenario of the 311 is shown as below:

Problem:
(1). test_fd = open(fname, O_RDWR|O_DIRECT)
(2). pwrite(test_fd, buf, 4096, 0)
(3). close(test_fd)
(4). drop_all_caches() <-------- "echo 3 > /proc/sys/vm/drop_caches"
(5). test_fd = open(fname, O_RDWR|O_DIRECT)
(6). fsync(test_fd);
<-------- we did not get the correct log entry for the file
Reason:
When we re-open this file in (5), we would find a node
in delayed_nodes_tree and fill the inode we are lookup with the
information. But the ->last_trans is not filled, then the fsync()
will check the ->last_trans and found it's 0 then say this inode
is already in our tree which is commited, not recording the extents
for it.

Fix:
This patch fill the ->last_trans properly and set the
runtime_flags if needed in this situation. Then we can get the
log entries we expected after (6) and generic/311 passed.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit c7dd5f4d39048dc075796b53aede1c5d7711c561)

commit | commitdiff | tree

Omar Sandoval [Fri, 10 Apr 2015 21:20:40 +0000 (14:20 -0700)]

btrfs: unlock i_mutex after attempting to delete subvolume during send

Whenever the check for a send in progress introduced in commit
521e0546c970 (btrfs: protect snapshots from deleting during send) is
hit, we return without unlocking inode->i_mutex. This is easy to see
with lockdep enabled:

[  +0.000059] ================================================
[  +0.000028] [ BUG: lock held when returning to user space! ]
[  +0.000029] 4.0.0-rc5-00096-g3c435c1 #93 Not tainted
[  +0.000026] ------------------------------------------------
[  +0.000029] btrfs/211 is leaving the kernel with locks still held!
[  +0.000029] 1 lock held by btrfs/211:
[  +0.000023]  #0:  (&type->i_mutex_dir_key){+.+.+.}, at: [<ffffffff8135b8df>] btrfs_ioctl_snap_destroy+0x2df/0x7a0

Make sure we unlock it in the error path.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit f5b5132ae00b3d62c31532064c0995ad02d92787)

commit | commitdiff | tree

Omar Sandoval [Tue, 24 Feb 2015 10:47:05 +0000 (02:47 -0800)]

btrfs: fix race on ENOMEM in alloc_extent_buffer

Consider the following interleaving of overlapping calls to
alloc_extent_buffer:

Call 1:

- Successfully allocates a few pages with find_or_create_page
- find_or_create_page fails, goto free_eb
- Unlocks the allocated pages

Call 2:
- Calls find_or_create_page and gets a page in call 1's extent_buffer
- Finds that the page is already associated with an extent_buffer
- Grabs a reference to the half-written extent_buffer and calls
mark_extent_buffer_accessed on it

mark_extent_buffer_accessed will then try to call mark_page_accessed on
a null page and panic.

The fix is to decrement the reference count on the half-written
extent_buffer before unlocking the pages so call 2 won't use it. We
should also set exists = NULL in the case that we don't use exists to
avoid accidentally returning a freed extent_buffer in an error case.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 6e8f3022bf1c226e1325b81ff19d0640bcafa204)

commit | commitdiff | tree

Forrest Liu [Mon, 9 Feb 2015 09:30:47 +0000 (17:30 +0800)]

Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole

If device tree has hole, find_free_dev_extent() cannot find available
address properly.

The problem can be reproduce by following script.

    mntpath=/btrfs
    loopdev=/dev/loop0
    filepath=/home/forrest/image

    umount $mntpath
    losetup -d $loopdev
    truncate --size 100g $filepath
    losetup $loopdev $filepath
    mkfs.btrfs -f $loopdev
    mount $loopdev $mntpath

    # make device tree with one big hole
    for i in `seq 1 1 100`; do
        fallocate -l 1g $mntpath/$i
    done
    sync
    for i in `seq 1 1 95`; do
        rm $mntpath/$i
    done
    sync

    # wait cleaner thread remove unused block group
    sleep 300

    fallocate -l 1g $mntpath/aaa

    # failed to allocate new chunk
    fallocate -l 1g $mntpath/bbb

Above script will make device tree with one big hole, and can only allocate
just one chunk in a transaction, so failed to allocate new chunk for $mntpath/bbb

    item 8 key (1 DEV_EXTENT 2185232384) itemoff 15859 itemsize 48
        dev extent chunk_tree 3
        chunk objectid 256 chunk offset 106292051968 length 1073741824
    item 9 key (1 DEV_EXTENT 104190705664) itemoff 15811 itemsize 48
        dev extent chunk_tree 3
        chunk objectid 256 chunk offset 103108575232 length 1073741824

Signed-off-by: Forrest Liu <forrestl@synology.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit e47aabedc3cabec5178d0f0ae83a6fbe7bd26cad)

commit | commitdiff | tree

Chris Mason [Sat, 18 Apr 2015 12:22:48 +0000 (05:22 -0700)]

Btrfs: don't check for delalloc_bytes in cache_save_setup

Now that we're doing free space cache writeback outside the critical
section in the commit, there is a bigger window for delalloc_bytes to
be added after a cache has been written. find_free_extent may do this
without putting the block group back into the dirty list, and also
without a transaction running.

Checking for delalloc_bytes in cache_save_setup means we might leave the
cache marked as written without invalidating it. Consistency checks
during mount will toss the cache, but it's better to get rid of the
check in cache_save_setup and let it get invalidated by the checks
already done during cache write out.

Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit c1e31ffc317e4c28d242b1d961c9c6fe673c0377)

commit | commitdiff | tree

Zygo Blaxell [Thu, 23 Apr 2015 02:28:58 +0000 (22:28 -0400)]

Merge tag 'v3.18.12' into zygo-3.18.12-zb64-btrfs

Linux 3.18.12

# gpg: Signature made Mon Apr 20 15:48:50 2015 EDT using RSA key ID 97772CDC
# gpg: Can't check signature: public key not found

commit | commitdiff | tree

Zygo Blaxell [Sun, 19 Apr 2015 16:10:40 +0000 (12:10 -0400)]

zygo: printf uses %llu for type u64
(cherry picked from commit bfdec521bc2b0f75c2fd4678936ec923ec60f17a)

zygo: printk needs \n
(cherry picked from commit efdf5d8e19e6fa3a8fe0cf032ff8ef6beb6075f2)

zygo: OK we do not need all the E2BIG reports
(cherry picked from commit 686e2e428fee51b9515ca6a1323431031c6fad1a)

commit | commitdiff | tree

Yang Dongsheng [Thu, 9 Apr 2015 04:08:43 +0000 (12:08 +0800)]

commit | commitdiff | tree

Omar Sandoval [Fri, 10 Apr 2015 21:20:40 +0000 (14:20 -0700)]

commit | commitdiff | tree

Omar Sandoval [Tue, 24 Feb 2015 10:47:06 +0000 (02:47 -0800)]

btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache

If io_ctl_prepare_pages fails, the pages in io_ctl.pages are not valid.
When we try to access them later, things will blow up in various ways.

Also fix the comment about the return value, which is an errno on error,
not -1, and update the cases where it was not.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>

commit | commitdiff | tree

Omar Sandoval [Tue, 24 Feb 2015 10:47:05 +0000 (02:47 -0800)]

commit | commitdiff | tree

Omar Sandoval [Tue, 24 Feb 2015 10:47:04 +0000 (02:47 -0800)]

btrfs: handle ENOMEM in btrfs_alloc_tree_block

This is one of the first places to give out when memory is tight. Handle
it properly rather than with a BUG_ON.

Also fix the comment about the return value, which is an ERR_PTR, not
NULL, on error.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>

commit | commitdiff | tree

Forrest Liu [Mon, 9 Feb 2015 09:30:47 +0000 (17:30 +0800)]

commit | commitdiff | tree

Zygo Blaxell [Mon, 20 Apr 2015 01:54:36 +0000 (21:54 -0400)]

zygo: pick-kernel: fix pickList

commit | commitdiff | tree

Sasha Levin [Mon, 20 Apr 2015 19:48:02 +0000 (15:48 -0400)]

Linux 3.18.12

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>

commit | commitdiff | tree

Zygo Blaxell [Sun, 19 Apr 2015 20:00:27 +0000 (16:00 -0400)]

zygo: built virtio devices into kernel

commit | commitdiff | tree

Zygo Blaxell [Sun, 19 Apr 2015 04:20:26 +0000 (00:20 -0400)]

zygo: fs/btrfs/zlib: and where do we get EIO?

commit | commitdiff | tree

Zygo Blaxell [Sun, 19 Apr 2015 03:48:32 +0000 (23:48 -0400)]

zygo: fs/btrfs/zlib: but why do we get EIO?

commit | commitdiff | tree

Zygo Blaxell [Sun, 19 Apr 2015 03:41:13 +0000 (23:41 -0400)]

zygo: printks should have a KERN_* label

commit | commitdiff | tree

Zygo Blaxell [Sat, 18 Apr 2015 23:58:54 +0000 (19:58 -0400)]

zygo: printk does not supply newlines

commit | commitdiff | tree

Zygo Blaxell [Sat, 18 Apr 2015 23:58:26 +0000 (19:58 -0400)]

zygo: suppress EIO from zlib_compress_pages

commit | commitdiff | tree

Zygo Blaxell [Sat, 18 Apr 2015 18:31:48 +0000 (14:31 -0400)]

zygo: trace where EIO comes from

commit | commitdiff | tree

Chris Mason [Sat, 18 Apr 2015 12:22:48 +0000 (05:22 -0700)]

commit | commitdiff | tree

Zygo Blaxell [Fri, 17 Apr 2015 16:32:59 +0000 (12:32 -0400)]

Merge commit '77f51b4266f94a1d72e00a64458d39aefd701985' into zygo-3.18.11-zb64-btrfs

This commit seems to disappear:

commit ac4b500278bff89780c9fa01f2d6eb3053ad7e51
Author: David Sterba <dsterba@suse.cz>
Date:   Fri Jan 2 19:36:14 2015 +0100

    btrfs: expand btrfs_find_item if found_key is NULL

    If the found_key is NULL, then btrfs_find_item becomes a verbose wrapper
    for simple btrfs_search_slot.

    After we've removed all such callers, passing a NULL key is not valid
    anymore.

Signed-off-by: David Sterba <dsterba@suse.cz>
    (cherry picked from commit 1d4c08e0a60be356134d0c466744d0d4e16ebab0)

commit | commitdiff | tree

Zygo Blaxell [Fri, 17 Apr 2015 16:29:50 +0000 (12:29 -0400)]

zygo: pull config from 77f51b4266f94a1d72e00a64458d39aefd701985 too

commit | commitdiff | tree

Zygo Blaxell [Fri, 17 Apr 2015 16:22:19 +0000 (12:22 -0400)]

Revert "btrfs: fix raid56 scrub failed in xfstests btrfs/072"

This reverts commit 3434fa5e07d61bf7fcb8e24609e6d622dc20d6dc.

commit | commitdiff | tree

Zygo Blaxell [Fri, 17 Apr 2015 16:21:51 +0000 (12:21 -0400)]

Revert "Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)"

This reverts commit ed08da33bcd6741bfbdecd8bd1aa0c29cb507fe4.

commit | commitdiff | tree

Zygo Blaxell [Fri, 17 Apr 2015 16:21:15 +0000 (12:21 -0400)]

Revert "btrfs: Fix NO_SPACE bug caused by delayed-iput"

This reverts commit 8a143d4748470acc8b7bb9fb4c820cb46cc3fbd4.

commit | commitdiff | tree

Zygo Blaxell [Fri, 17 Apr 2015 16:20:41 +0000 (12:20 -0400)]

Revert "Btrfs: incremental send, don't rename a directory too soon"

This reverts commit 885816eb5326ea776902896e24fb539aea4bab98.

commit | commitdiff | tree

Zygo Blaxell [Fri, 17 Apr 2015 16:20:40 +0000 (12:20 -0400)]

Revert "Btrfs: incremental send, remove dead code"

This reverts commit 42122522086cd5d151a6a1bd78968e7768f77f8d.

commit | commitdiff | tree

Zygo Blaxell [Fri, 17 Apr 2015 16:16:19 +0000 (12:16 -0400)]

Revert "Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs"

This reverts commit baf8df51f7a6b36222a37b3c9b03a34ccddfac3d.

commit | commitdiff | tree

Zygo Blaxell [Fri, 17 Apr 2015 16:12:08 +0000 (12:12 -0400)]

Revert "Btrfs: prepare block group cache before writing"

This reverts commit 6272a918e8c8176312015598e99e67559dbfbb44.

commit | commitdiff | tree

Zygo Blaxell [Fri, 17 Apr 2015 16:01:27 +0000 (12:01 -0400)]

zygo: cherry-picked up to 4eb221b e082f56 e1cbbfa 464919a 83b879f e4a58b7 7031479 d17d264 3c46900 d05a2b4 a0f3ef5 babe65a 3ba9cf2 eb97581 bbdfab9 c4dc06b bc18922 0530403 1f44513 3d93ece 115930c f9d31a3 fcc3b2b 1cf3368 43dbbbc e4eae51 90e8d1f 6792139 72d0cef 64218e7 3c78455 f21a908
Git-Log-Args: linus/master --remotes=btrfs-next/* --remotes=mason/* --after=now - 90 days -- fs/btrfs

Commits included:

++++ e082f56 btrfs: quota: Update quota tree after qgroup relationship change.

commit | commitdiff | tree

Qu Wenruo [Fri, 27 Feb 2015 08:24:28 +0000 (16:24 +0800)]

btrfs: quota: Update quota tree after qgroup relationship change.

Previous patch modified the in memory struct but it's not written in
quota tree until next commit.
So user will still get old data using "btrfs qgroup show" after
assign/remove.

This patch will call btrfs_run_qgroups in assign ioctl so it will be
updated to in memory quota trees and user will get up-to-date results.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit e082f56313f374d723b0366978ddb062c8fe79ea)

commit | commitdiff | tree

Dongsheng Yang [Fri, 27 Feb 2015 08:24:26 +0000 (16:24 +0800)]

btrfs: qgroup: clear STATUS_FLAG_ON in disabling quota.

we forgot to clear STATUS_FLAG_ON in quota_disable(), it
will cause a problem shown as below:

# mount /dev/sdc /mnt
# btrfs quota enable /mnt
# btrfs quota disable /mnt
# btrfs quota rescan /mnt
quota rescan started <--- expecting it fail here.
# echo $?
0

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 8ea0ec9e011eb542a3e7b1171776aa4877cf8a90)

commit | commitdiff | tree

Qu Wenruo [Fri, 27 Feb 2015 08:24:25 +0000 (16:24 +0800)]

btrfs: Update btrfs qgroup status item when rescan is done.

Update qgroup status when rescan is done.

Before this patch, status item is not updated on rescan finish, which
causing the RESCAN and INCONSISTENT flags never cleared.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 53b7cde9d5aa58cf7605664f0e34419156b02698)

commit | commitdiff | tree

Qu Wenruo [Fri, 27 Feb 2015 08:24:24 +0000 (16:24 +0800)]

btrfs: qgroup: Fix dead judgement on qgroup_rescan_leaf() return value.

Old qgroup_rescan_leaf() comment indicates ret == 2 as complete and
cleared INCONSISTENT flag.

This is not true since it will never return 2, and inside it no codes
will clear INCONSISTENT flag.
The flag clearance is done in btrfs_qgroup_rescan_work().
This caused the bug that INCONSISTENT flag is never cleared.

So change the comment and fix the dead judgment.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 3393168d22fd5f1be5b5429a818c10f642e88ae3)

commit | commitdiff | tree

Qu Wenruo [Fri, 27 Feb 2015 08:24:23 +0000 (16:24 +0800)]

btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created

Btrfs will create qgroup on subvolume creation if quota is enabled, but
qgroup uses the high bits(currently 16 bits) as level, to build the
inheritance.

However it is fully possible a subvolume can be created with a
subvolumeid larger than 1 << BTRFS_QGROUP_LEVEL_SHIFT, so it will be
considered as level 1 and can't be assigned to other qgroup in level 1.

This patch will prevent such things so qgroup inheritance will not be
screwed up.
The downside is very clear, btrfs subvolume number limit will decrease
from (u64 max - 256(fisrt free objectid) - 256(last free objectid)) to
(u48 max -256(first free objectid)).
But we still have near u48(that's 15 digits in dec), so that should not
be a huge problem.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit e09fe2d2119800e6060f9b8ba71e072a0eb0fa4d)

commit | commitdiff | tree

Qu Wenruo [Fri, 27 Feb 2015 08:24:22 +0000 (16:24 +0800)]

btrfs: Check qgroup level in kernel qgroup assign.

Although we have qgroup level check in btrfs-progs, it's not enough
since other programe may still call ioctl directly not using
btrfs-progs. For example, systemd.

But it's btrfs-progs to be blame since we don't provide a
full-function(like subvolume create things) btrfs library with enough
check, and only rely on kernel ioctl.

So Add level checks in kernel too.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 8465ecec9611d60cbbc8e374ecf68453e0dd5b50)

commit | commitdiff | tree

Dongsheng Yang [Mon, 24 Nov 2014 15:27:09 +0000 (10:27 -0500)]

btrfs: qgroup: allow to remove qgroup which has parent but no child.

When a qgroup has parents but no child, it should be removable in
Theory I think. But currently, we can not remove it when it has
either parent or child.

Example:
# btrfs quota enable /mnt
# btrfs qgroup create 1/0 /mnt
# btrfs qgroup create 2/0 /mnt
# btrfs qgroup assign 1/0 2/0 /mnt
# btrfs qgroup show -pcre /mnt
qgroupid rfer  excl  max_rfer max_excl parent  child
-------- ----  ----  -------- -------- ------  -----
0/5      16384 16384 0        0        ---     ---
1/0      0     0     0        0        2/0     ---
2/0      0     0     0        0        ---     1/0

At this time, there is no subvol or qgroup depending on it.
Just a qgroup 2/0 is its parent, but 2/0 can work well without
1/0. So I think 1/0 should be removalbe. But:
# btrfs qgroup destroy 1/0 /mnt
ERROR: unable to destroy quota group: Device or resource busy

This patch remove the check of qgroup->parent in removing it,
then we can remove a qgroup when it has a parent.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit f5a6b1c53bdd44f79e3904c0f5e59f956b49b2c8)

commit | commitdiff | tree

Dongsheng Yang [Tue, 11 Nov 2014 12:18:22 +0000 (07:18 -0500)]

btrfs: qgroup: return EINVAL if level of parent is not higher than child's.

When we create a subvol inheriting a qgroup, we need to check the level
of them. Otherwise, there is a chance a qgroup can inherit another qgroup
at the same level.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 09870d2772b284d0061a5e4d1e1cdf6fb6764344)

commit | commitdiff | tree

Dongsheng Yang [Mon, 29 Dec 2014 11:23:05 +0000 (06:23 -0500)]

Btrfs: qgroup, Account data space in more proper timings.

Currenly, in data writing, ->reserved is accounted in
fill_delalloc(), but ->may_use is released in clear_bit_hook()
which is called by btrfs_finish_ordered_io(). That's too late,
that said, between fill_delalloc() and btrfs_finish_ordered_io(),
the data is doublely accounted by qgroup. It will cause some
unexpected -EDQUOT.

Example:
# btrfs quota enable /root/btrfs-auto-test/
# btrfs subvolume create /root/btrfs-auto-test//sub
Create subvolume '/root/btrfs-auto-test/sub'
# btrfs qgroup limit 1G /root/btrfs-auto-test//sub
dd if=/dev/zero of=/root/btrfs-auto-test//sub/file bs=1024 count=1500000
dd: error writing '/root/btrfs-auto-test//sub/file': Disk quota exceeded
681353+0 records in
681352+0 records out
697704448 bytes (698 MB) copied, 8.15563 s, 85.5 MB/s
It's (698 MB) when we got an -EDQUOT, but we limit it by 1G.

This patch move the btrfs_qgroup_reserve/free() for data from
btrfs_delalloc_reserve/release_metadata() to btrfs_check_data_free_space()
and btrfs_free_reserved_data_space(). Then the accounter in qgroup
will be updated at the same time with the accounter in space_info updated.
In this way, the unexpected -EDQUOT will be killed.

Reported-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 237c0e9f1fbfdca7287f3539f1fa73e5063156b5)

commit | commitdiff | tree

Dongsheng Yang [Fri, 12 Dec 2014 08:44:34 +0000 (16:44 +0800)]

Btrfs: qgroup: free reserved in exceeding quota.

When we exceed quota limit in writing, we will free
some reserved extent when we need to drop but not free
account in qgroup. It means, each time we exceed quota
in writing, there will be some remain space in qg->reserved
we can not use any more. If things go on like this, the
all space will be ate up.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 804ca127fb93988c6a9d5f2bf4a8f1a780c9a2d0)

commit | commitdiff | tree

Dongsheng Yang [Sun, 18 Jan 2015 15:59:23 +0000 (10:59 -0500)]

Btrfs: qgroup: cleanup, remove an unsued parameter in btrfs_create_qgroup().

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 4087cf24ae2af17f7dd9fd34e22fde816952d421)

commit | commitdiff | tree

Dongsheng Yang [Fri, 6 Feb 2015 16:06:25 +0000 (11:06 -0500)]

btrfs: qgroup: fix limit args override whole limit struct

btrfs_limit_group use arg limit to override the old qgroup_limit of
corresponding qgroup. However, we should override part of old qgroup_limit
according to the bit which has been set in arg limit.

Signed-off-by: Fan Chengniang <fancn.fnst@cn.fujitsu.com>
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 03477d945f13a284d35a757b2c2323d165d5cd81)

commit | commitdiff | tree

Dongsheng Yang [Fri, 21 Nov 2014 02:04:56 +0000 (21:04 -0500)]

btrfs: qgroup: update limit info in function btrfs_run_qgroups().

When we commit_transaction(), qgroups in btree should be updated.
But, limit info is not considered currently. It will cause a problem
when a qgroup of a snapshot inherit the limit info from srcqgroup,
then there is an inconsistency.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit d3001ed3a82ec2696bb13c78092d0a3460003fd7)

commit | commitdiff | tree

Dongsheng Yang [Fri, 21 Nov 2014 02:01:41 +0000 (21:01 -0500)]

btrfs: qgroup: consolidate the parameter of fucntion update_qgroup_limit_item().

Cleanup: Change the parameter of update_qgroup_limit_item() to the family of
update_qgroup_xxx_item().

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 1510e71c620c27ffc7540176a0689f70d6915e28)

commit | commitdiff | tree

Dongsheng Yang [Fri, 21 Nov 2014 01:58:34 +0000 (20:58 -0500)]

btrfs: qgroup: update qgroup in memory at the same time when we update it in btree.

When we call btrfs_qgroup_inherit() with BTRFS_QGROUP_INHERIT_SET_LIMITS,
btrfs will update the limit info of qgroup in btree but forget to update
the qgroup in rbtree at the same time. It obviousely will cause an inconsistency.

This patch fix it by updating the rbtree at the same time.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit e8c8541ac379709db8d2339e1cb720469fc2cd8f)

commit | commitdiff | tree

Dongsheng Yang [Fri, 21 Nov 2014 01:14:38 +0000 (20:14 -0500)]

btrfs: qgroup: inherit limit info from srcgroup in creating snapshot.

Currently, when we snapshot a subvol, snapshot will not copy the limits
from srcqgroup.

This patch make the qgroup in snapshot inherit the limit info when create
a snapshot.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 3eeb4d597efc9e068902057f1bd040cffc69e9e6)

commit | commitdiff | tree

Zhao Lei [Mon, 2 Mar 2015 11:32:20 +0000 (19:32 +0800)]

btrfs: Support busy loop of write and delete

Reproduce:
while true; do
dd if=/dev/zero of=/mnt/btrfs/file count=[75% fs_size]
rm /mnt/btrfs/file
done
Then we can see above loop failed on NO_SPACE.

It it long-term problem since very beginning, because delayed-iput
after rm are not run.

We already have commit_transaction() in alloc_space code, but it is
not triggered in above case.
This patch trigger commit_transaction() to run delayed-iput and
reflash pinned-space to to make write success.

It is based on previous fix of delayed-iput in commit_transaction(),
need to be applied on top of:
btrfs: Fix NO_SPACE bug caused by delayed-iput

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit c99f1b0c6c45d1621f08afb1352689e24a627844)

commit | commitdiff | tree

Zhao Lei [Thu, 26 Feb 2015 02:49:20 +0000 (10:49 +0800)]

btrfs: Fix NO_SPACE bug caused by delayed-iput

Steps to reproduce:
  while true; do
    dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
    rm /btrfs_dir/file
    sync
  done

  And we'll see dd failed because btrfs return NO_SPACE.

Reason:
  Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
  in end to free fs space for next write, but sometimes it hadn't
  done work on time, because btrfs-cleaner thread get delayed-iputs
  from list before, but do iput() after next write.

  This is log:
  [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin

  [ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs()
  [ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs()
  [ 2569.087554] comm=sync func=btrfs_commit_transaction() end

  [ 2569.191081] comm=dd begin
  [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28

  [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888
  [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512
  ...
  [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496
  [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end

Fix:
  Make btrfs_commit_transaction() wait current running btrfs-cleaner's
  delayed-iputs() done in end.

Test:
  Use script similar to above(more complex),
  before patch:
    7 failed in 100 * 20 loop.
  after patch:
    0 failed in 100 * 20 loop.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit d7c151717a1efe289aec29fb9f94485f64262c0b)

commit | commitdiff | tree

Zhao Lei [Wed, 25 Feb 2015 06:17:20 +0000 (14:17 +0800)]

btrfs: Set relative data on clear btrfs_block_group_cache->pinned

Bug1:
  space_info->bytes_readonly was set to very large(negative) value in
  btrfs_remove_block_group().

Reason:
  Current code set block_group_cache->pinned = 0 in btrfs_delete_unused_bgs(),
  but above space was not counted to space_info->bytes_readonly.

  Then in btrfs_remove_block_group():
    block_group->space_info->bytes_readonly -= block_group->key.offset;
  We can see following value in trace:
    btrfs_remove_block_group: pid=2677 comm=btrfs-cleaner WARNING: bytes_readonly=12582912, key.offset=134217728

Bug2:
  space_info->total_bytes_pinned grow to value larger than fs size.
  In a 1.2G fs, we can get following trace log:
  at first:
    ZL_DEBUG: add_pinned_bytes: pid=2710 comm=sync change total_bytes_pinned flags=1 869793792 + 95944704 = 965738496
  after some op:
    ZL_DEBUG: add_pinned_bytes: pid=2770 comm=sync change total_bytes_pinned flags=1 1780178944 + 95944704 = 1876123648
  after some op:
    ZL_DEBUG: add_pinned_bytes: pid=3193 comm=sync change total_bytes_pinned flags=1 2924568576 + 95551488 = 3020120064
  ...

Reason:
  Similar to bug1, we also need to adjust space_info->total_bytes_pinned
  in above code block.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit c30666d466c70a30491e45dd8d068360f9dd7693)

commit | commitdiff | tree

Zhao Lei [Tue, 17 Feb 2015 09:25:51 +0000 (17:25 +0800)]

btrfs: Adjust commit-transaction condition to avoid NO_SPACE more

If we have any chance to make a successful write, we should not give up.

This patch adjust commit-transaction condition from:
pinned >= wanted
to
left + pinned >= wanted

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 264ca0f60becac34395f3b42171ad0cc2e53ea6e)

commit | commitdiff | tree

Zhao Lei [Mon, 16 Feb 2015 10:52:17 +0000 (18:52 +0800)]

btrfs: Fix tail space processing in find_free_dev_extent()

It is another reason for NO_SPACE case.

When we found enough free space in loop and saved them to
max_hole_start/size before, and tail space contains pending extent,
origional innocent max_hole_start/size are reset in retry.

As a result, find_free_dev_extent() returns less space than it can,
and cause NO_SPACE in user program.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit f2ab76188ec185dde84e7fe7c533ef2f5d668a32)

commit | commitdiff | tree

Zhao Lei [Sat, 14 Feb 2015 05:23:45 +0000 (13:23 +0800)]

btrfs: fix condition of commit transaction

Old code bypass commit transaction when we don't have enough
pinned space, but another case is there exist freed bgs in current
transction, it have possibility to make alloc_chunk success.

This patch modify the condition to:
if (have_free_bg || have_pinned_space) commit_transaction()

Confirmed above action by printk before and after patch.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 94b947b2f3f84f3bba25d34c4e2a229fc2276830)

commit | commitdiff | tree

Chris Mason [Sat, 11 Apr 2015 12:09:06 +0000 (05:09 -0700)]

Btrfs: fix uninit variable in clone ioctl

Commit 0d97a64e0 creates a new variable but doesn't always set it up.
This puts it back to the original method (key.offset + 1) for the cases
not covered by Filipe's new logic.

Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit de249e66a73d696666281cd812087979c6fae552)

commit | commitdiff | tree

Filipe Manana [Mon, 30 Mar 2015 17:23:59 +0000 (18:23 +0100)]

Btrfs: fix inode eviction infinite loop after cloning into it

If we attempt to clone a 0 length region into a file we can end up
inserting a range in the inode's extent_io tree with a start offset
that is greater then the end offset, which triggers immediately the
following warning:

[ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
[ 3914.620886] BTRFS: end < start 4095 4096
(...)
[ 3914.638093] Call Trace:
[ 3914.638636]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
[ 3914.639620]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
[ 3914.640789]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
[ 3914.642041]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
[ 3914.643236]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
[ 3914.644441]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
[ 3914.645711]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
[ 3914.646914]  [<ffffffff8142b2fb>] ? _raw_spin_unlock+0x28/0x33
[ 3914.648058]  [<ffffffffa03cbac4>] ? test_range_bit+0xcc/0xde [btrfs]
[ 3914.650105]  [<ffffffffa03cb3c3>] lock_extent+0x13/0x15 [btrfs]
[ 3914.651361]  [<ffffffffa03db39e>] lock_extent_range+0x3d/0xcd [btrfs]
[ 3914.652761]  [<ffffffffa03de1fe>] btrfs_ioctl_clone+0x278/0x388 [btrfs]
[ 3914.654128]  [<ffffffff811226dd>] ? might_fault+0x58/0xb5
[ 3914.655320]  [<ffffffffa03e0909>] btrfs_ioctl+0xb51/0x2195 [btrfs]
(...)
[ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]---

This later makes the inode eviction handler enter an infinite loop that
keeps dumping the following warning over and over:

[ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]()
[ 3915.119913] BTRFS: end < start 4095 4096
(...)
[ 3915.137394] Call Trace:
[ 3915.137913]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
[ 3915.139154]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
[ 3915.140316]  [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs]
[ 3915.141505]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
[ 3915.142709]  [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs]
[ 3915.143849]  [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs]
[ 3915.145120]  [<ffffffffa038c1e3>] ? btrfs_kill_super+0x17/0x23 [btrfs]
[ 3915.146352]  [<ffffffff811548f6>] ? deactivate_locked_super+0x3b/0x50
[ 3915.147565]  [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs]
[ 3915.148785]  [<ffffffff8142b7e2>] ? _raw_write_unlock+0x28/0x33
[ 3915.149931]  [<ffffffffa03bc325>] btrfs_evict_inode+0x196/0x482 [btrfs]
[ 3915.151154]  [<ffffffff81168904>] evict+0xa0/0x148
[ 3915.152094]  [<ffffffff811689e5>] dispose_list+0x39/0x43
[ 3915.153081]  [<ffffffff81169564>] evict_inodes+0xdc/0xeb
[ 3915.154062]  [<ffffffff81154418>] generic_shutdown_super+0x49/0xef
[ 3915.155193]  [<ffffffff811546d1>] kill_anon_super+0x13/0x1e
[ 3915.156274]  [<ffffffffa038c1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
(...)
[ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]---

So just bail out of the clone ioctl if the length of the region to clone
is zero, without locking any extent range, in order to prevent this issue
(same behaviour as a pwrite with a 0 length for example).

This is trivial to reproduce. For example, the steps for the test I just
made for fstests:

  mkfs.btrfs -f SCRATCH_DEV
  mount SCRATCH_DEV $SCRATCH_MNT

  touch $SCRATCH_MNT/foo
  touch $SCRATCH_MNT/bar

  $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
  umount $SCRATCH_MNT

A test case for fstests follows soon.

CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit ccccf3d67294714af2d72a6fd6fd7d73b01c9329)

commit | commitdiff | tree

Filipe Manana [Mon, 30 Mar 2015 17:26:47 +0000 (18:26 +0100)]

Btrfs: fix inode eviction infinite loop after extent_same ioctl

If we pass a length of 0 to the extent_same ioctl, we end up locking an
extent range with a start offset greater then its end offset (if the
destination file's offset is greater than zero). This results in a warning
from extent_io.c:insert_state through the following call chain:

  btrfs_extent_same()
    btrfs_double_lock()
      lock_extent_range()
        lock_extent(inode->io_tree, offset, offset + len - 1)
          lock_extent_bits()
            __set_extent_bit()
              insert_state()
                --> WARN_ON(end < start)

This leads to an infinite loop when evicting the inode. This is the same
problem that my previous patch titled
"Btrfs: fix inode eviction infinite loop after cloning into it" addressed
but for the extent_same ioctl instead of the clone ioctl.

CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 113e8283869b9855c8b999796aadd506bbac155f)

commit | commitdiff | tree

Filipe Manana [Tue, 31 Mar 2015 13:56:46 +0000 (14:56 +0100)]

Btrfs: fix range cloning when same inode used as source and destination

While searching for extents to clone we might find one where we only use
a part of it coming from its tail. If our destination inode is the same
the source inode, we end up removing the tail part of the extent item and
insert after a new one that point to the same extent with an adjusted
key file offset and data offset. After this we search for the next extent
item in the fs/subvol tree with a key that has an offset incremented by
one. But this second search leaves us at the new extent item we inserted
previously, and since that extent item has a non-zero data offset, it
it can make us call btrfs_drop_extents with an empty range (start == end)
which causes the following warning:

[23978.537119] WARNING: CPU: 6 PID: 16251 at fs/btrfs/file.c:550 btrfs_drop_extent_cache+0x43/0x385 [btrfs]()
(...)
[23978.557266] Call Trace:
[23978.557978]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
[23978.559191]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
[23978.560699]  [<ffffffffa047f0ea>] ? btrfs_drop_extent_cache+0x43/0x385 [btrfs]
[23978.562389]  [<ffffffff8104544d>] warn_slowpath_null+0x1a/0x1c
[23978.563613]  [<ffffffffa047f0ea>] btrfs_drop_extent_cache+0x43/0x385 [btrfs]
[23978.565103]  [<ffffffff810e3a18>] ? time_hardirqs_off+0x15/0x28
[23978.566294]  [<ffffffff81079ff8>] ? trace_hardirqs_off+0xd/0xf
[23978.567438]  [<ffffffffa047f73d>] __btrfs_drop_extents+0x6b/0x9e1 [btrfs]
[23978.568702]  [<ffffffff8107c03f>] ? trace_hardirqs_on+0xd/0xf
[23978.569763]  [<ffffffff811441c0>] ? ____cache_alloc+0x69/0x2eb
[23978.570817]  [<ffffffff81142269>] ? virt_to_head_page+0x9/0x36
[23978.571872]  [<ffffffff81143c15>] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb
[23978.573466]  [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
[23978.574962]  [<ffffffffa0480d07>] btrfs_drop_extents+0x66/0x7f [btrfs]
[23978.576179]  [<ffffffffa049aa35>] btrfs_clone+0x516/0xaf5 [btrfs]
[23978.577311]  [<ffffffffa04983dc>] ? lock_extent_range+0x7b/0xcd [btrfs]
[23978.578520]  [<ffffffffa049b2a2>] btrfs_ioctl_clone+0x28e/0x39f [btrfs]
[23978.580282]  [<ffffffffa049d9ae>] btrfs_ioctl+0xb51/0x219a [btrfs]
(...)
[23978.591887] ---[ end trace 988ec2a653d03ed3 ]---

Then we attempt to insert a new extent item with a key that already
exists, which makes btrfs_insert_empty_item return -EEXIST resulting in
abortion of the current transaction:

[23978.594355] WARNING: CPU: 6 PID: 16251 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
(...)
[23978.622589] Call Trace:
[23978.623181]  [<ffffffff81425fd9>] dump_stack+0x4c/0x65
[23978.624359]  [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb
[23978.625573]  [<ffffffffa044ab6c>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[23978.626971]  [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48
[23978.628003]  [<ffffffff8108a6c8>] ? vprintk_default+0x1d/0x1f
[23978.629138]  [<ffffffffa044ab6c>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[23978.630528]  [<ffffffffa049ad1b>] btrfs_clone+0x7fc/0xaf5 [btrfs]
[23978.631635]  [<ffffffffa04983dc>] ? lock_extent_range+0x7b/0xcd [btrfs]
[23978.632886]  [<ffffffffa049b2a2>] btrfs_ioctl_clone+0x28e/0x39f [btrfs]
[23978.634119]  [<ffffffffa049d9ae>] btrfs_ioctl+0xb51/0x219a [btrfs]
(...)
[23978.647714] ---[ end trace 988ec2a653d03ed4 ]---

This is wrong because we should not process the extent item that we just
inserted previously, and instead process the extent item that follows it
in the tree

For example for the test case I wrote for fstests:

   bs=$((64 * 1024))
   mkfs.btrfs -f -l $bs -O ^no-holes /dev/sdc
   mount /dev/sdc /mnt

   xfs_io -f -c "pwrite -S 0xaa $(($bs * 2)) $(($bs * 2))" /mnt/foo

   $CLONER_PROG -s $((3 * $bs)) -d $((267 * $bs)) -l 0 /mnt/foo /mnt/foo
   $CLONER_PROG -s $((217 * $bs)) -d $((95 * $bs)) -l 0 /mnt/foo /mnt/foo

The second clone call fails with -EEXIST, because when we process the
first extent item (offset 262144), we drop part of it (counting from the
end) and then insert a new extent item with a key greater then the key we
found. The next time we search the tree we search for a key with offset
262144 + 1, which leaves us at the new extent item we have just inserted
but we think it refers to an extent that we need to clone.

Fix this by ensuring the next search key uses an offset corresponding to
the offset of the key we found previously plus the data length of the
corresponding extent item. This ensures we skip new extent items that we
inserted and works for the case of implicit holes too (NO_HOLES feature).

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit df858e76723ace61342b118aa4302bd09de4e386)

commit | commitdiff | tree

Chris Mason [Tue, 7 Apr 2015 01:17:00 +0000 (18:17 -0700)]

Btrfs: fix use after free when close_ctree frees the orphan_rsv

Near the end of close_ctree, we're calling btrfs_free_block_rsv
to free up the orphan rsv. The problem is this call updates the
space_info, which has already been freed.

This adds a new __ function that directly calls kfree instead of trying
to update the space infos.

Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit cdfb080e1853660952db5e5332727e59427856df)

commit | commitdiff | tree

Chris Mason [Mon, 6 Apr 2015 14:48:20 +0000 (07:48 -0700)]

Btrfs: don't use highmem for free space cache pages

In order to create the free space cache concurrently with FS modifications,
we need to take a few block group locks.

The cache code also does kmap, which would schedule with the locks held.
Instead of going through kmap_atomic, lets just use lowmem for the cache
pages.

Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 2b108268006e06d57ec9810f4ccf5d99d7e5b598)

commit | commitdiff | tree

Chris Mason [Mon, 6 Apr 2015 20:17:20 +0000 (13:17 -0700)]

btrfs: move struct io_ctl into ctree.h and rename it

We'll need to put the io_ctl into the block_group cache struct, so
name it struct btrfs_io_ctl and move it into ctree.h

Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 4c6d1d85ad89fd8e32dc9204b7f944854399bda9)

commit | commitdiff | tree

Josef Bacik [Tue, 24 Feb 2015 20:35:51 +0000 (12:35 -0800)]

Btrfs: don't steal from the global reserve if we don't have the space

btrfs_evict_inode() needs to be more careful about stealing from the
global_rsv. We dont' want to end up aborting commit with ENOSPC just
because the evict_inode code was too greedy.

Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 3bce876fd58a745b8a1bc0bd8325c3e5b4cebeb0)

commit | commitdiff | tree

Josef Bacik [Wed, 18 Feb 2015 21:58:15 +0000 (13:58 -0800)]

Btrfs: don't commit the transaction in the async space flushing

We're triggering a huge number of commits from
btrfs_async_reclaim_metadata_space. These aren't really requried,
because everyone calling the async reclaim code is going to end up
triggering a commit on their own.

Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 365c5313776730acf433d54226f0216a1d075e9a)

commit | commitdiff | tree

Chris Mason [Wed, 17 Dec 2014 17:41:04 +0000 (09:41 -0800)]

btrfs: actively run the delayed refs while deleting large files

When we are deleting large files with large extents, we are building up
a huge set of delayed refs for processing. Truncate isn't checking
often enough to see if we need to back off and process those, or let
a commit proceed.

The end result is long stalls after the rm, and very long commit times.
During the commits, other processes back up waiting to start new
transactions and we get into trouble.

Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 28ed1345a50491d78e1454ad4005dc5d3557a69e)

commit | commitdiff | tree

Guenter Roeck [Fri, 13 Mar 2015 08:58:46 +0000 (01:58 -0700)]

fs: btrfs: Add missing include file

Building alpha:allmodconfig fails with

fs/btrfs/inode.c: In function 'check_direct_IO':
fs/btrfs/inode.c:8050:2: error: implicit declaration of function 'iov_iter_alignment'

due to a missing include file.

Fixes: 3737c63e1fb0 ("fs: move struct kiocb to fs.h")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 4a3d1caf8a2c16c55424a0768eade54ee0922341)

commit | commitdiff | tree

Chris Mason [Wed, 1 Apr 2015 15:36:05 +0000 (08:36 -0700)]

Btrfs: free and unlock our path before btrfs_free_and_pin_reserved_extent()

The error handling path for alloc_reserved_tree_block is calling
btrfs_free_and_pin_reserved_extent with a spinning tree lock held.  This
might sleep as we allocate extent_state objects:

BUG: sleeping function called from invalid context at mm/slub.c:1268
in_atomic(): 1, irqs_disabled(): 0, pid: 11093, name: kworker/u4:7
5 locks held by kworker/u4:7/11093:
  #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520
  #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520
  #2:  (sb_internal){++++.+}, at: [<ffffffffa003a70e>] start_transaction+0x43e/0x590 [btrfs]
  #3:  (&head_ref->mutex){+.+...}, at: [<ffffffffa0089f8c>] btrfs_delayed_ref_lock+0x4c/0x240 [btrfs]
  #4:  (btrfs-extent-00){++++..}, at: [<ffffffffa007697b>] btrfs_clear_lock_blocking_rw+0x9b/0x150 [btrfs]
CPU: 0 PID: 11093 Comm: kworker/u4:7 Tainted: G        W 4.0.0-rc6-default+ #246
Hardware name: Intel Corporation Santa Rosa platform/Matanzas, BIOS TSRSCRB1.86C.0047.B00.0610170821 10/17/06
Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
  00000000000004f4 ffff88006dd17848 ffffffff81ab0e3b ffff88006dd17848
  ffff88007a944760 ffff88006dd17868 ffffffff8109d516 ffff88006dd17898
  0000000000000000 ffff88006dd17898 ffffffff8109d5b2 ffffffff81aba2bb
Call Trace:
  [<ffffffff81ab0e3b>] dump_stack+0x4f/0x6c
  [<ffffffff8109d516>] ___might_sleep+0xf6/0x140
  [<ffffffff8109d5b2>] __might_sleep+0x52/0x90
  [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34
  [<ffffffff81196363>] kmem_cache_alloc+0x163/0x1b0
  [<ffffffffa0056f31>] ? alloc_extent_state+0x31/0x150 [btrfs]
  [<ffffffffa0056f20>] ? alloc_extent_state+0x20/0x150 [btrfs]
  [<ffffffffa0056f31>] alloc_extent_state+0x31/0x150 [btrfs]
  [<ffffffffa005805b>] __set_extent_bit+0x37b/0x5d0 [btrfs]
  [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34
  [<ffffffffa005888d>] ? set_extent_bit+0xd/0x30 [btrfs]
  [<ffffffffa00588a3>] set_extent_bit+0x23/0x30 [btrfs]
  [<ffffffffa0058e80>] set_extent_dirty+0x20/0x30 [btrfs]
  [<ffffffffa00195ba>] pin_down_extent+0xaa/0x170 [btrfs]
  [<ffffffffa001d8ef>] __btrfs_free_reserved_extent+0xcf/0x160 [btrfs]
  [<ffffffffa0023856>] btrfs_free_and_pin_reserved_extent+0x16/0x20 [btrfs]
  [<ffffffffa002482a>] __btrfs_run_delayed_refs+0xfca/0x1290 [btrfs]
  [<ffffffffa0026eae>] btrfs_run_delayed_refs+0x6e/0x2e0 [btrfs]
  [<ffffffffa0027378>] delayed_ref_async_start+0x48/0xb0 [btrfs]
  [<ffffffffa006c883>] normal_work_helper+0x83/0x350 [btrfs]
  [<ffffffffa006cd79>] ? btrfs_extent_refs_helper+0x9/0x20 [btrfs]
  [<ffffffffa006cd82>] btrfs_extent_refs_helper+0x12/0x20 [btrfs]
  [<ffffffff81091dcb>] process_one_work+0x1cb/0x520
  [<ffffffff81091d51>] ? process_one_work+0x151/0x520
  [<ffffffff811c7abf>] ? seq_read+0x3f/0x400
  [<ffffffff8109260b>] worker_thread+0x5b/0x4e0
  [<ffffffff81097be2>] ? __kthread_parkme+0x12/0xa0
  [<ffffffff810925b0>] ? rescuer_thread+0x450/0x450
  [<ffffffff81098686>] kthread+0xf6/0x120
  [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0
  [<ffffffff81ab8088>] ret_from_fork+0x58/0x90
  [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0
------------[ cut here ]------------

This changes things to free the path first, which will also unlock the
extent buffer.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Dave Sterba <dsterba@suse.cz>
Tested-by: Dave Sterba <dsterba@suse.cz>
(cherry picked from commit dd82525956744c3a5e6b7275cd468b6180cc5b72)

commit | commitdiff | tree

Liu Bo [Tue, 17 Mar 2015 06:34:16 +0000 (14:34 +0800)]

Btrfs: Remove the check for old-style mkfs

This was used to make sure that a fresh btrfs from an older mkfs.btrfs,
but it also allows us to mount a buggy btrfs if this btrfs has the right
superblock head part but has something wrong with chunk tree part[1], and
after that we can hit BUG_ON()s set in the code to prevent something
impossible.

Since David has released "Btrfs progs v3.19-rc2", just remove the check,
if anyone who wants to make a fresh btrfs, please use the latest one.

[1]: http://www.spinics.net/lists/linux-btrfs/msg42358.html

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Omar Sandoval <osandov@osandov.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit e56a951e01bf55f49533c47ad2ce61dbd613a3f3)

commit | commitdiff | tree

Jeff Mahoney [Fri, 20 Mar 2015 18:02:09 +0000 (14:02 -0400)]

btrfs: cleanup orphans while looking up default subvolume

Orphans in the fs tree are cleaned up via open_ctree and subvolume
orphans are cleaned via btrfs_lookup_dentry -- except when a default
subvolume is in use. The name for the default subvolume uses a manual
lookup that doesn't trigger orphan cleanup and needs to trigger it
manually as well. This doesn't apply to the remount case since the
subvolumes are cleaned up by walking the root radix tree.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 727b9784b6085c99c2f836bf4fcc2848dc9cf904)

commit | commitdiff | tree

Tom Van Braeckel [Tue, 24 Mar 2015 15:35:49 +0000 (16:35 +0100)]

btrfs: explicitly set control file's private_data

The private_data member of the Btrfs control device file
(/dev/btrfs-control) is used to hold the current transaction and needs
to be initialized to NULL to signify that no transaction is in progress.

We explicitly set the control file's private_data to NULL to be
independent of whatever value the misc subsystem initializes it to.

Backstory:
----------

The misc subsystem (which is used by /dev/btrfs-control) initializes
a file's private_data to point to the misc device when a driver has
registered a custom open file operation and initializes it to NULL
when a custom open file operation has *not* been provided.

This subtle quirk is confusing, to the point where kernel code registers
*empty* file open operations to have private_data point to the misc
device structure.

And it leads to bugs, where the addition or removal of a custom open
file operation surprisingly changes the initial contents of a file's
private_data structure.

To simplify things in the misc subsystem, a patch [1] has been proposed
to *always* set private_data to point to the misc device instead of
only doing this when a custom open file operation has been registered.

But before we can fix this in the misc subsystem itself, we need to
modify the (few) drivers that rely on this very subtle behavior.

[1] https://lkml.org/lkml/2014/12/4/939

Signed-off-by: Martin Kepplinger <martink@posteo.de>
Signed-off-by: Tom Van Braeckel <tomvanbraeckel@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit d8620958296e4fa61afde421f1de16a5c2234b28)

commit | commitdiff | tree

Chengyu Song [Tue, 24 Mar 2015 22:12:56 +0000 (18:12 -0400)]

btrfs: incorrect handling for fiemap_fill_next_extent return

fiemap_fill_next_extent returns 0 on success, -errno on error, 1 if this was
the last extent that will fit in user array. If 1 is returned, the return
value may eventually returned to user space, which should not happen, according
to manpage of ioctl.

Signed-off-by: Chengyu Song <csong84@gatech.edu>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 26e726afe01c1c82072cf23a5ed89ce25f39d9f2)

commit | commitdiff | tree

David Sterba [Wed, 25 Mar 2015 18:26:41 +0000 (19:26 +0100)]

btrfs: don't accept bare namespace as a valid xattr

Due to insufficient check in btrfs_is_valid_xattr, this unexpectedly
works:

$ touch file
$ setfattr -n user. -v 1 file
$ getfattr -d file
user.="1"

ie. the missing attribute name after the namespace.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=94291
Reported-by: William Douglas <william.douglas@intel.com>
CC: <stable@vger.kernel.org> # 2.6.29+
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 3c3b04d10ff1811a27f86684ccd2f5ba6983211d)

commit | commitdiff | tree

Filipe Manana [Mon, 23 Mar 2015 14:07:40 +0000 (14:07 +0000)]

Btrfs: fix log tree corruption when fs mounted with -o discard

While committing a transaction we free the log roots before we write the
new super block. Freeing the log roots implies marking the disk location
of every node/leaf (metadata extent) as pinned before the new super block
is written. This is to prevent the disk location of log metadata extents
from being reused before the new super block is written, otherwise we
would have a corrupted log tree if before the new super block is written
a crash/reboot happens and the location of any log tree metadata extent
ended up being reused and rewritten.

Even though we pinned the log tree's metadata extents, we were issuing a
discard against them if the fs was mounted with the -o discard option,
resulting in corruption of the log tree if a crash/reboot happened before
writing the new super block - the next time the fs was mounted, during
the log replay process we would find nodes/leafs of the log btree with
a content full of zeroes, causing the process to fail and require the
use of the tool btrfs-zero-log to wipeout the log tree (and all data
previously fsynced becoming lost forever).

Fix this by not doing a discard when pinning an extent. The discard will
be done later when it's safe (after the new super block is committed) at
extent-tree.c:btrfs_finish_extent_commit().

Fixes: e688b7252f78 (Btrfs: fix extent pinning bugs in the tree log)
CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit dcc82f4783ad91d4ab654f89f37ae9291cdc846a)

commit | commitdiff | tree

Filipe Manana [Fri, 20 Mar 2015 17:19:46 +0000 (17:19 +0000)]

Btrfs: fix metadata inconsistencies after directory fsync

We can get into inconsistency between inodes and directory entries
after fsyncing a directory. The issue is that while a directory gets
the new dentries persisted in the fsync log and replayed at mount time,
the link count of the inode that directory entries point to doesn't
get updated, staying with an incorrect link count (smaller then the
correct value). This later leads to stale file handle errors when
accessing (including attempt to delete) some of the links if all the
other ones are removed, which also implies impossibility to delete the
parent directories, since the dentries can not be removed.

Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2),
when fsyncing a directory, new files aren't logged (their metadata and
dentries) nor any child directories. So this patch fixes this issue too,
since it has the same resolution as the incorrect inode link count issue
mentioned before.

This is very easy to reproduce, and the following excerpt from my test
case for xfstests shows how:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our main test file and directory.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io
  mkdir $SCRATCH_MNT/mydir

  # Make sure all metadata and data are durably persisted.
  sync

  # Add a hard link to 'foo' inside our test directory and fsync only the
  # directory. The btrfs fsync implementation had a bug that caused the new
  # directory entry to be visible after the fsync log replay but, the inode
  # of our file remained with a link count of 1.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2

  # Add a few more links and new files.
  # This is just to verify nothing breaks or gives incorrect results after the
  # fsync log is replayed.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3
  $XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io
  ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2

  # Add some subdirectories and new files and links to them. This is to verify
  # that after fsyncing our top level directory 'mydir', all the subdirectories
  # and their files/links are registered in the fsync log and exist after the
  # fsync log is replayed.
  mkdir -p $SCRATCH_MNT/mydir/x/y/z
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link
  touch $SCRATCH_MNT/mydir/x/y/z/qwerty

  # Now fsync only our top directory.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir

  # And fsync now our new file named 'hello', just to verify later that it has
  # the expected content and that the previous fsync on the directory 'mydir' had
  # no bad influence on this fsync.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Verify the content of our file 'foo' remains the same as before, 8192 bytes,
  # all with the value 0xaa.
  echo "File 'foo' content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  # Remove the first name of our inode. Because of the directory fsync bug, the
  # inode's link count was 1 instead of 5, so removing the 'foo' name ended up
  # deleting the inode and the other names became stale directory entries (still
  # visible to applications). Attempting to remove or access the remaining
  # dentries pointing to that inode resulted in stale file handle errors and
  # made it impossible to remove the parent directories since it was impossible
  # for them to become empty.
  echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)"
  rm -f $SCRATCH_MNT/foo

  # Now verify that all files, links and directories created before fsyncing our
  # directory exist after the fsync log was replayed.
  [ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing"
  [ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing"
  [ -f $SCRATCH_MNT/hello ] || echo "File hello is missing"
  [ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing"
  [ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \
      echo "Link mydir/x/y/foo_y_link is missing"
  [ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \
      echo "Link mydir/x/y/z/foo_z_link is missing"
  [ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \
      echo "File mydir/x/y/z/qwerty is missing"

  # We expect our file here to have a size of 64Kb and all the bytes having the
  # value 0xff.
  echo "file 'hello' content after log replay:"
  od -t x1 $SCRATCH_MNT/hello

  # Now remove all files/links, under our test directory 'mydir', and verify we
  # can remove all the directories.
  rm -f $SCRATCH_MNT/mydir/x/y/z/*
  rmdir $SCRATCH_MNT/mydir/x/y/z
  rm -f $SCRATCH_MNT/mydir/x/y/*
  rmdir $SCRATCH_MNT/mydir/x/y
  rmdir $SCRATCH_MNT/mydir/x
  rm -f $SCRATCH_MNT/mydir/*
  rmdir $SCRATCH_MNT/mydir

  # An fsck, run by the fstests framework everytime a test finishes, also detected
  # the inconsistency and printed the following error message:
  #
  # root 5 inode 257 errors 2001, no inode item, link count wrong
  #    unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
  #    unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref

  status=0
  exit

The expected golden output for the test is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 65536/65536 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File 'foo' content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000
  file 'foo' link count after log replay: 5
  file 'hello' content after log replay:
  0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  *
  0200000

Which is the output after this patch and when running the test against
ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's
output is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 65536/65536 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File 'foo' content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000
  file 'foo' link count after log replay: 1
  Link mydir/foo_2 is missing
  Link mydir/foo_3 is missing
  Link mydir/x/y/foo_y_link is missing
  Link mydir/x/y/z/foo_z_link is missing
  File mydir/x/y/z/qwerty is missing
  file 'hello' content after log replay:
  0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  *
  0200000
  rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory
  rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory
  rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory
  rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle
  rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle
  rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty

Fsck, without this fix, also complains about the wrong link count:

  root 5 inode 257 errors 2001, no inode item, link count wrong
      unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
      unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref

So fix this by logging the inodes that the dentries point to when
fsyncing a directory.

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 2f2ff0ee5e4303e727cfd7abd4133d1a8ee68394)

commit | commitdiff | tree

Filipe Manana [Sat, 14 Mar 2015 07:03:27 +0000 (07:03 +0000)]

Btrfs: change the insertion criteria for the qgroup operations rbtree

After looking at Liu Bo's recent patch (titled
"Btrfs: fix comp_oper to get right order") I realized the search made by
qgroup_oper_exists() was buggy because its rbtree navigation comparison
function, comp_oper_exist(), only looks at the fields bytenr and ref_root
of a tree node, ignoring the seq field completely. This was wrong because
when we insert a node into the rbtree we use comp_oper(), which takes a
decision based first on bytenr, then on seq and then on the ref_root field.
That means qgroup_oper_exists() could miss the fact that at least one
operation with given bytenr and ref_root exists.

Consider the following simple example of a 3 nodes qgroup operations
rbtree (created using comp_oper before this patch), where each node's key
is a tuple with the shape (bytenr, seq, ref_root, op):

                          [ (4096, 2, 20, op X) ]
                         /                       \
                        /                         \
   [ (4096, 1, 5, op Y) ]                         [ (4096, 3, 10, op Z) ]

qgroup_oper_exists() when called to search for an existing operation for
bytenr 4096 and ref root 10 wouldn't find anything because it would go to
the left subtree instead of the right subtree, since comp_oper_exits()
ignores the seq field completely.

Fix this by changing the insertion navigation function to use the ref_root
field right after using the bytenr field and before using the seq field,
so that qgroup_oper_exists() / comp_oper_exist() work as expected.

This patch applies on top of the patch mentioned above from Liu.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit bf69196045a8c5c42b10493a26ed45c33014371e)

commit | commitdiff | tree

Filipe Manana [Thu, 12 Mar 2015 23:23:13 +0000 (23:23 +0000)]

Btrfs: add missing inode item update in fallocate()

If we fallocate(), without the keep size flag, into an area already covered
by an extent previously fallocated, we were updating the inode's i_size but
we weren't updating the inode item in the fs/subvol tree. A following umount
+ mount would result in a loss of the inode's size (and an fsync would miss
too the fact that the inode changed).

Reproducer:

  $ mkfs.btrfs -f /dev/sdd
  $ mount /dev/sdd /mnt
  $ fallocate -n -l 1M /mnt/foobar
  $ fallocate -l 512K /mnt/foobar
  $ umount /mnt
  $ mount /dev/sdd /mnt
  $ od -t x1 /mnt/foobar
  0000000

The expected result is:

  $ od -t x1 /mnt/foobar
  0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  2000000

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 3d850dd44889d3aa67d0b8007c2cdd259bff7da4)

commit | commitdiff | tree

Filipe Manana [Thu, 12 Mar 2015 16:04:50 +0000 (16:04 +0000)]

Btrfs: incremental send, remove dead code

The logic to detect path loops when attempting to apply a pending
directory rename, introduced in commit
f959492fc15b (Btrfs: send, fix more issues related to directory renames)
is no longer needed, and the respective fstests test case for that commit,
btrfs/045, now passes without this code (as well as all the other test
cases for send/receive).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 5f806c3ae2ff6263a10a6901f97abb74dac03d36)

commit | commitdiff | tree

Filipe Manana [Thu, 12 Mar 2015 15:16:20 +0000 (15:16 +0000)]

Btrfs: incremental send, clear name from cache after orphanization

If a directory's reference ends up being orphanized, because the inode
currently being processed has a new path that matches that directory's
path, make sure we evict the name of the directory from the name cache.
This is because there might be descendent inodes (either directories or
regular files) that will be orphanized later too, and therefore the
orphan name of the ancestor must be used, otherwise we send issue rename
operations with a wrong path in the send stream.

Reproducer:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir -p /mnt/data/n1/n2/p1/p2
  $ mkdir /mnt/data/n4
  $ mkdir -p /mnt/data/p1/p2

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1

  $ mv /mnt/data/p1/p2 /mnt/data
  $ mv /mnt/data/n1/n2/p1/p2 /mnt/data/p1
  $ mv /mnt/data/p2 /mnt/data/n1/n2/p1
  $ mv /mnt/data/n1/n2 /mnt/data/p1
  $ mv /mnt/data/p1 /mnt/data/n4
  $ mv /mnt/data/n4/p1/n2/p1 /mnt/data

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2

  $ btrfs send /mnt/snap1 -f /tmp/1.send
  $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt2
  $ btrfs receive /mnt2 -f /tmp/1.send
  $ btrfs receive /mnt2 -f /tmp/2.send
  ERROR: rename data/p1/p2 -> data/n4/p1/p2 failed. no such file or directory

Directories data/p1 (inode 263) and data/p1/p2 (inode 264) in the parent
snapshot are both orphanized during the incremental send, and as soon as
data/p1 is orphanized, we must make sure that when orphanizing data/p1/p2
we use a source path of o263-6-o/p2 for the rename operation instead of
the old path data/p1/p2 (the one before the orphanization of inode 263).

A test case for xfstests follows soon.

Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 8996a48c0a8ed01c55f52e6794625c95a9977ee6)

commit | commitdiff | tree

Filipe Manana [Mon, 2 Mar 2015 20:53:53 +0000 (20:53 +0000)]

Btrfs: send, don't leave without decrementing clone root's send_progress

If the clone root was not readonly or the dead flag was set on it, we were
leaving without decrementing the root's send_progress counter (and before
we just incremented it). If a concurrent snapshot deletion was in progress
and ended up being aborted, it would be impossible to later attempt to
delete again the snapshot, since the root's send_in_progress counter could
never go back to 0.

We were also setting clone_sources_to_rollback to i + 1 too early - if we
bailed out because the clone root we got is not readonly or flagged as dead
we ended up later derreferencing a null pointer because we didn't assign
the clone root to sctx->clone_roots[i].root:

for (i = 0; sctx && i < clone_sources_to_rollback; i++)
btrfs_root_dec_send_in_progress(
sctx->clone_roots[i].root);

So just don't increment the send_in_progress counter if the root is readonly
or flagged as dead.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 2f1f465ae6da244099af55c066e5355abd8ff620)

commit | commitdiff | tree

Filipe Manana [Mon, 2 Mar 2015 20:53:52 +0000 (20:53 +0000)]

Btrfs: send, add missing check for dead clone root

After we locked the root's root item, a concurrent snapshot deletion
call might have set the dead flag on it. So check if the dead flag
is set and abort if it is, just like we do for the parent root.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 5cc2b17e80cf5770f2e585c2d90fd8af1b901258)

commit | commitdiff | tree

Filipe Manana [Mon, 23 Feb 2015 19:53:35 +0000 (19:53 +0000)]

Btrfs: remove deleted xattrs on fsync log replay

If we deleted xattrs from a file and fsynced the file, after a log replay
the xattrs would remain associated to the file. This was an unexpected
behaviour and differs from what other filesystems do, such as for example
xfs and ext3/4.

Fix this by, on fsync log replay, check if every xattr in the fs/subvol
tree (that belongs to a logged inode) has a matching xattr in the log,
and if it does not, delete it from the fs/subvol tree. This is a similar
approach to what we do for dentries when we replay a directory from the
fsync log.

This issue is trivial to reproduce, and the following excerpt from my
test for xfstests triggers the issue:

  _crash_and_mount()
  {
       # Simulate a crash/power loss.
       _load_flakey_table $FLAKEY_DROP_WRITES
       _unmount_flakey
       _load_flakey_table $FLAKEY_ALLOW_WRITES
       _mount_flakey
  }

  rm -f $seqres.full

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create out test file and add 3 xattrs to it.
  touch $SCRATCH_MNT/foobar
  $SETFATTR_PROG -n user.attr1 -v val1 $SCRATCH_MNT/foobar
  $SETFATTR_PROG -n user.attr2 -v val2 $SCRATCH_MNT/foobar
  $SETFATTR_PROG -n user.attr3 -v val3 $SCRATCH_MNT/foobar

  # Make sure everything is durably persisted.
  sync

  # Now delete the second xattr and fsync the inode.
  $SETFATTR_PROG -x user.attr2 $SCRATCH_MNT/foobar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar

  _crash_and_mount

  # After the fsync log is replayed, the file should have only 2 xattrs, the ones
  # named user.attr1 and user.attr3. The btrfs fsync log replay bug left the file
  # with the 3 xattrs that we had before deleting the second one and fsyncing the
  # file.
  echo "xattr names and values after first fsync log replay:"
  $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foobar | _filter_scratch

  # Now write some data to our file, fsync it, remove the first xattr, add a new
  # hard link to our file and commit the fsync log by fsyncing some other new
  # file. This is to verify that after log replay our first xattr does not exist
  # anymore.
  echo "hello world!" >> $SCRATCH_MNT/foobar
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
  $SETFATTR_PROG -x user.attr1 $SCRATCH_MNT/foobar
  ln $SCRATCH_MNT/foobar $SCRATCH_MNT/foobar_link
  touch $SCRATCH_MNT/qwerty
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/qwerty

  _crash_and_mount

  # Now only the xattr with name user.attr3 should be set in our file.
  echo "xattr names and values after second fsync log replay:"
  $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foobar | _filter_scratch

  status=0
  exit

The expected golden output, which is produced with this patch applied or
when testing against xfs or ext3/4, is:

  xattr names and values after first fsync log replay:
  # file: SCRATCH_MNT/foobar
  user.attr1="val1"
  user.attr3="val3"

  xattr names and values after second fsync log replay:
  # file: SCRATCH_MNT/foobar
  user.attr3="val3"

Without this patch applied, the output is:

  xattr names and values after first fsync log replay:
  # file: SCRATCH_MNT/foobar
  user.attr1="val1"
  user.attr2="val2"
  user.attr3="val3"

  xattr names and values after second fsync log replay:
  # file: SCRATCH_MNT/foobar
  user.attr1="val1"
  user.attr2="val2"
  user.attr3="val3"

A patch with a test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 4f764e5153616fccaed38e7524d6f5f89a49a060)

commit | commitdiff | tree

Josef Bacik [Tue, 17 Mar 2015 14:52:28 +0000 (10:52 -0400)]

Btrfs: fix outstanding_extents accounting in DIO

We are keeping track of how many extents we need to reserve properly based on
the amount we want to write, but we were still incrementing outstanding_extents
if we wrote less than what we requested.  This isn't quite right since we will
be limited to our max extent size.  So instead lets do something horrible!  Keep
track of how many outstanding_extents we reserved, and decrement each time we
allocate an extent.  If we use our entire reserve make sure to jack up
outstanding_extents on the inode so the accounting works out properly.  Thanks,

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
(cherry picked from commit e1cbbfa5f5aaf40a1fe70856fac4dfcc33e0e651)

commit | commitdiff | tree

Josef Bacik [Fri, 13 Mar 2015 19:01:24 +0000 (15:01 -0400)]

Btrfs: account merges/splits properly

My fix

Btrfs: fix merge delalloc logic

only fixed half of the problems, it didn't fix the case where we have two large
extents on either side and then join them together with a new small extent.  We
need to instead keep track of how many extents we have accounted for with each
side of the new extent, and then see how many extents we need for the new large
extent.  If they match then we know we need to keep our reservation, otherwise
we need to drop our reservation.  This shows up with a case like this

[BTRFS_MAX_EXTENT_SIZE+4K][4K HOLE][BTRFS_MAX_EXTENT_SIZE+4K]

Previously the logic would have said that the number extents required for the
new size (3) is larger than the number of extents required for the largest side
(2) therefore we need to keep our reservation.  But this isn't the case, since
both sides require a reservation of 2 which leads to 4 for the whole range
currently reserved, but we only need 3, so we need to drop one of the
reservations.  The same problem existed for splits, we'd think we only need 3
extents when creating the hole but in reality we need 4.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
(cherry picked from commit ba117213554bc747561c5b7bf274d60ac93b8598)

commit | commitdiff | tree

Josef Bacik [Mon, 16 Mar 2015 21:38:52 +0000 (17:38 -0400)]

Btrfs: add sanity test for outstanding_extents accounting

I introduced a regression wrt outstanding_extents accounting. These are tricky
areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
at any time. So add sanity tests to cover the various conditions that are
tricky in order to make sure we don't introduce regressions in the future.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
(cherry picked from commit 464919a88ec615bf7f22085730ac72cfe735df0d)

commit | commitdiff | tree

Josef Bacik [Mon, 16 Mar 2015 21:38:02 +0000 (17:38 -0400)]

Btrfs: just free dummy extent buffers

If we fail during our sanity tests we could get NULL deref's because we unload
the module before the dummy extent buffers are free'd via RCU. So check for
this case and just free the things directly. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
(cherry picked from commit 2fdbf36f587e956eee26e193264b0e8d86d2fd08)

commit | commitdiff | tree

Josef Bacik [Fri, 13 Mar 2015 19:12:23 +0000 (15:12 -0400)]

Btrfs: account for the correct number of extents for delalloc reservations

Direct IO can easily pass in an buffer that is greater than
BTRFS_MAX_EXTENT_SIZE, so take this into account when reserving extents in the
delalloc reservation code. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 6a41dd0922e3c63e677c2d8f7906ce6a3e097af1)

commit | commitdiff | tree

Josef Bacik [Fri, 13 Mar 2015 19:12:08 +0000 (15:12 -0400)]

Btrfs: fix merge delalloc logic

My patch to properly count outstanding extents wrt MAX_EXTENT_SIZE introduced a
regression when re-dirtying already dirty areas.  We have logic in split to make
sure we are taking the largest space into account but didn't have it for merge,
so it was sometimes making us think we were turning a tiny extent into a huge
extent, when in reality we already had a huge extent and needed to use the other
side in our logic.  This fixes the regression that was reported by a user on
list.  Thanks,

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 8461a3de770477a9a7b8eeaebcc4804dbc26ca38)

commit | commitdiff | tree

Liu Bo [Fri, 13 Mar 2015 06:24:38 +0000 (14:24 +0800)]

Btrfs: fix comp_oper to get right order

Case (oper1->seq > oper2->seq) should differ with case (oper1->seq < oper2->seq).

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 48da5f0a4cb2b6b44579f5737e8be888c0d02526)

commit | commitdiff | tree

Liu Bo [Fri, 6 Mar 2015 12:23:44 +0000 (20:23 +0800)]

Btrfs: catch transaction abortion after waiting for it

This problem is uncovered by a test case: http://patchwork.ozlabs.org/patch/244297.

Fsync() can report success when it actually doesn't. When we
have several threads running fsync() at the same tiem and in one fsync() we
get a transaction abortion due to some problems(in the test case it's disk
failures), and other fsync()s may return successfully which makes userspace
programs think that data is now safely flushed into disk.

It's because that after fsyncs() fail btrfs_sync_log() due to disk failures,
they get to try btrfs_commit_transaction() where it finds that there is
already a transaction being committed, and they'll just call wait_for_commit()
and return. Note that we actually check "trans->aborted" in btrfs_end_transaction,
but it's likely that the error message is still not yet throwed out and only after
wait_for_commit() we're sure whether the transaction is committed successfully.

This add the necessary check and it now passes the test.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit b4924a0fa18d7f69bde3a84521258e7a55828186)

commit | commitdiff | tree

Josef Bacik [Mon, 2 Mar 2015 21:37:31 +0000 (16:37 -0500)]

Btrfs: prepare block group cache before writing

Writing the block group cache will modify the extent tree quite a bit because it
truncates the old space cache and pre-allocates new stuff. To try and cut down
on the churn lets do the setup dance first, then later on hopefully we can avoid
looping with newly dirtied roots. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
(cherry picked from commit aa4661f6a7aee90423d1c77ee4d492e7b1a99883)

commit | commitdiff | tree

Josef Bacik [Mon, 2 Mar 2015 17:41:33 +0000 (12:41 -0500)]

Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)

Dave could hit this assert consistently running btrfs/078.  This is because
when we update the block groups we could truncate the free space, which would
try to delete the csums for that range and dirty the csum root.  For this to
happen we have to have already written out the csum root so it's kind of hard to
hit this case.  This patch fixes this by changing the logic to only write the
dirty block groups if the dirty_cowonly_roots list is empty.  This will get us
the same effect as before since we add the extent root last, and will cover the
case that we dirty some other root again but not the extent root.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
(cherry picked from commit 9266ff8a5ad2be9f04c194f742e87964ee15bab1)

commit | commitdiff | tree

David Sterba [Tue, 24 Feb 2015 18:40:41 +0000 (19:40 +0100)]

btrfs: switch helper macros to static inlines in sysfs.h

The conversion macros use nested container_of that leads to a warning

fs/btrfs/sysfs.c: In function 'btrfs_feature_visible':
fs/btrfs/sysfs.c:183:8: warning: declaration of '__mptr' shadows a previous local
fs/btrfs/sysfs.c:183:8: warning: shadowed declaration is here

Use of functions will add proper type checking.

Signed-off-by: David Sterba <dsterba@suse.cz>
(cherry picked from commit 093adbcedf123f366e5eef0c4ccd815920f725f3)

commit | commitdiff | tree

David Sterba [Wed, 25 Feb 2015 14:47:32 +0000 (15:47 +0100)]

btrfs: use explicit initializer for seq_elem

Using {} as initializer for struct seq_elem does not properly initialize
the list_head member, but it currently works because it gets set through
btrfs_get_tree_mod_seq if 'seq' is 0.

Signed-off-by: David Sterba <dsterba@suse.cz>
(cherry picked from commit 3284da7b7b585e6e8e98f374a51d234d14c7a0a2)

Linux kernels and configurations used by Zygo.

RSS Atom