Josef Bacik [Fri, 19 Dec 2014 18:24:26 +0000 (13:24 -0500)]
vfs: make inode_sb_list_lock per sb
When doing fs_mark tests I was noticing other things getting starved out of
doing operations while the fs was unmounting. This is because we protect all
super_block's s_inodes list with a global lock, which is kind of a bummer.
There doesn't seem to be any reason we do this so make it a per-sb lock. This
makes sure that we don't add latency to anybody trying to add/remove inodes from
the per sb list while somebody else is unmounting or evicting inodes. Thanks,
Josef Bacik [Thu, 18 Dec 2014 16:34:23 +0000 (11:34 -0500)]
fs: don't softlockup when evicting inodes
If I run an fs_mark job that creates millions of empty files and then
immediately unmount the file system I will get a softlockup during unmount.
This box has ~140gb of RAM so we never hit sufficient memory pressure to evict
enough inodes during the runtime of the benchmark, which means I see around 80
million inodes being evicted at unmount time. With this patch my box no longer
softlocks up. Thanks,
Qu Wenruo [Mon, 19 Jan 2015 07:42:41 +0000 (15:42 +0800)]
btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.
Commit 6b5fe46dfa52 (btrfs: do commit in sync_fs if there are pending
changes) will call btrfs_start_transaction() in sync_fs(), to handle
some operations needed to be done in next transaction.
However this can cause deadlock if the filesystem is frozen, with the
following sys_r+w output:
[ 143.255932] Call Trace:
[ 143.255936] [<ffffffff816c0e09>] schedule+0x29/0x70
[ 143.255939] [<ffffffff811cb7f3>] __sb_start_write+0xb3/0x100
[ 143.255971] [<ffffffffa040ec06>] start_transaction+0x2e6/0x5a0
[btrfs]
[ 143.255992] [<ffffffffa040f1eb>] btrfs_start_transaction+0x1b/0x20
[btrfs]
[ 143.256003] [<ffffffffa03dc0ba>] btrfs_sync_fs+0xca/0xd0 [btrfs]
[ 143.256007] [<ffffffff811f7be0>] sync_fs_one_sb+0x20/0x30
[ 143.256011] [<ffffffff811cbd01>] iterate_supers+0xe1/0xf0
[ 143.256014] [<ffffffff811f7d75>] sys_sync+0x55/0x90
[ 143.256017] [<ffffffff816c49d2>] system_call_fastpath+0x12/0x17
[ 143.256111] Call Trace:
[ 143.256114] [<ffffffff816c0e09>] schedule+0x29/0x70
[ 143.256119] [<ffffffff816c3405>] rwsem_down_write_failed+0x1c5/0x2d0
[ 143.256123] [<ffffffff8133f013>] call_rwsem_down_write_failed+0x13/0x20
[ 143.256131] [<ffffffff811caae8>] thaw_super+0x28/0xc0
[ 143.256135] [<ffffffff811db3e5>] do_vfs_ioctl+0x3f5/0x540
[ 143.256187] [<ffffffff811db5c1>] SyS_ioctl+0x91/0xb0
[ 143.256213] [<ffffffff816c49d2>] system_call_fastpath+0x12/0x17
The reason is like the following:
(Holding s_umount)
VFS sync_fs staff:
|- btrfs_sync_fs()
|- btrfs_start_transaction()
|- sb_start_intwrite()
(Waiting thaw_fs to unfreeze)
VFS thaw_fs staff:
thaw_fs()
(Waiting sync_fs to release
s_umount)
So deadlock happens.
This can be easily triggered by fstest/generic/068 with inode_cache
mount option.
The fix is to check if the fs is frozen, if the fs is frozen, just
return and waiting for the next transaction.
Cc: David Sterba <dsterba@suse.cz> Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[enhanced comment, changed to SB_FREEZE_WRITE] Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit a53f4f8e9c8ebe6c9ee3b34c368913aae9876e22)
Qu Wenruo [Tue, 20 Jan 2015 09:05:33 +0000 (17:05 +0800)]
btrfs: Fix the bug that fs_info->pending_changes is never cleared.
Fs_info->pending_changes is never cleared since the original code uses
cmpxchg(&fs_info->pending_changes, 0, 0), which will only clear it if
pending_changes is already 0.
This will cause a lot of problem when mount it with inode_cache mount
option.
If the btrfs is mounted as inode_cache, pending_changes will always be
1, even when the fs is frozen.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 6c9fe14f9d64cc12401a825a60ec5c5723496ca4)
Satoru Takeuchi [Thu, 25 Dec 2014 09:21:41 +0000 (18:21 +0900)]
btrfs: fix state->private cast on 32 bit machines
Suppress the following warning displayed on building 32bit (i686) kernel.
===============================================================================
...
CC [M] fs/btrfs/extent_io.o
fs/btrfs/extent_io.c: In function ‘btrfs_free_io_failure_record’:
fs/btrfs/extent_io.c:2193:13: warning: cast to pointer from integer of
different size [-Wint-to-pointer-cast]
failrec = (struct io_failure_record *)state->private;
...
===============================================================================
Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Reported-by: Chris Murphy <chris@colorremedies.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 6e1103a6e9b19dbdc348077d04a546b626911fc5)
Filipe Manana [Fri, 16 Jan 2015 13:24:40 +0000 (13:24 +0000)]
Btrfs: fix race deleting block group from space_info->ro_bgs list
When removing a block group we were deleting it from its space_info's
ro_bgs list without the correct protection - the space info's spinlock.
Fix this by doing the list delete while holding the spinlock of the
corresponding space info, which is the correct lock for any operation
on that list.
This issue was introduced in the 3.19 kernel by the following change:
I ran into a kernel crash while a task was running statfs, which iterates
the space_info->ro_bgs list while holding the space info's spinlock,
and another task was deleting it from the same list, without holding that
spinlock, as part of the block group remove operation (while running the
function btrfs_remove_block_group). This happened often when running the
stress test xfstests/generic/038 I recently made.
David Sterba [Mon, 19 Jan 2015 13:21:02 +0000 (14:21 +0100)]
btrfs: sync ioctl, handle errors after transaction start
The version merged to 3.19 did not handle errors from start_trancaction
and could pass an invalid pointer to commit_transaction.
Fixes: 6b5fe46dfa52441f ("btrfs: do commit in sync_fs if there are pending changes") Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 98bd5c547ef2300f915fc1adce5b6f25c195d4d4)
Filipe Manana [Mon, 15 Dec 2014 16:04:42 +0000 (16:04 +0000)]
Btrfs: correctly get tree level in tree_backref_for_extent
If we are using skinny metadata, the block's tree level is in the offset
of the key and not in a btrfs_tree_block_info structure following the
extent item (it doesn't exist). Therefore fix it.
Besides returning the correct level in the tree, this also prevents reading
past the leaf's end in the case where the extent item is the last item in
the leaf (eb) and it has only 1 inline reference - this is because
sizeof(struct btrfs_tree_block_info) is greater than
sizeof(struct btrfs_extent_inline_ref).
Got it while running a scrub which produced the following warning:
BTRFS: checksum error at logical 42123264 on dev /dev/sde, sector 15840: metadata node (level 24) in tree 5
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit a1317f455ab936a9447f17b08e3e874c27742870)
Wang Shilong [Wed, 24 Dec 2014 06:45:30 +0000 (14:45 +0800)]
Btrfs: call inode_dec_link_count() on mkdir error path
In btrfs_mkdir(), if it fails to create dir, we should
clean up existed items, setting inode's link properly
to make sure it could be cleaned up properly.
Signed-off-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit c7cfb8a5405a34777d670f7a5441bb2c7ca9730f)
Filipe Manana [Thu, 4 Dec 2014 18:38:30 +0000 (18:38 +0000)]
Btrfs: always clear a block group node when removing it from the tree
Always clear a block group's rbnode after removing it from the rbtree to
ensure that any tasks that might be holding a reference on the block group
don't end up accessing stale rbnode left and right child pointers through
next_block_group().
This is a leftover from the change titled:
"Btrfs: fix invalid block group rbtree access after bg is removed"
Filipe Manana [Thu, 4 Dec 2014 15:31:01 +0000 (15:31 +0000)]
Btrfs: ensure deletion from pinned_chunks list is protected
The call to remove_extent_mapping() actually deletes the extent map
from the list it's included in - fs_info->pinned_chunks - and that
list is protected by the chunk mutex. Therefore make that call
while holding the chunk mutex and remove the redundant list delete
call because it's a noop.
This fixes an overlook of the patch titled
"Btrfs: fix race between fs trimming and block group remove/allocation"
following the same obvervation from the patch titled
"Btrfs: fix unprotected deletion from pending_chunks list".
Filipe Manana [Tue, 2 Dec 2014 18:07:30 +0000 (18:07 +0000)]
Btrfs: fix fs mapping extent map leak
On chunk allocation error (label "error_del_extent"), after adding the
extent map to the tree and to the pending chunks list, we would leave
decrementing the extent map's refcount by 2 instead of 3 (our allocation
+ tree reference + list reference).
Also, on chunk/block group removal, if the block group was on the list
pending_chunks we weren't decrementing the respective list reference.
This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"
But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.
Filipe Manana [Tue, 2 Dec 2014 18:07:49 +0000 (18:07 +0000)]
Btrfs: fix unprotected deletion from pending_chunks list
On block group remove if the corresponding extent map was on the
transaction->pending_chunks list, we were deleting the extent map
from that list, through remove_extent_mapping(), without any
synchronization with chunk allocation (which iterates that list
and adds new elements to it). Fix this by ensure that this is done
while the chunk mutex is held, since that's the mutex that protects
the list in the chunk allocation code path.
This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"
But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.
Josef Bacik [Wed, 26 Nov 2014 16:52:54 +0000 (11:52 -0500)]
Btrfs: make get_caching_control unconditionally return the ctl
This was written when we didn't do a caching control for the fast free space
cache loading. However we started doing that a long time ago, and there is
still a small window of time that we could be caching the block group the fast
way, so if there is a caching_ctl at all on the block group just return it, the
callers all wait properly for what they want. Thanks,
Filipe Manana [Thu, 27 Nov 2014 21:14:15 +0000 (21:14 +0000)]
Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.
If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.
So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.
If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:
checking extents
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
owner ref check failed [833912832 16384]
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
Check tree block failed, want=833912832, have=0
read block failed check_tree_block
root 5 root dir 256 error
root 5 inode 260 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 262 errors 2001, no inode item, link count wrong
unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
root 5 inode 263 errors 2001, no inode item, link count wrong
(...)
Filipe Manana [Mon, 1 Dec 2014 17:04:09 +0000 (17:04 +0000)]
Btrfs: fix race between writing free space cache and trimming
Trimming is completely transactionless, and the way it operates consists
of hiding free space entries from a block group, perform the trim/discard
and then make the free space entries visible again.
Therefore while a free space entry is being trimmed, we can have free space
cache writing running in parallel (as part of a transaction commit) which
will miss the free space entry. This means that an unmount (or crash/reboot)
after that transaction commit and mount again before another transaction
starts/commits after the discard finishes, we will have some free space
that won't be used again unless the free space cache is rebuilt. After the
unmount, fsck (btrfsck, btrfs check) reports the issue like the following
example:
*** fsck.btrfs output ***
checking extents
checking free space cache
There is no free space entry for 521764864-521781248
There is no free space entry for 521764864-1103101952
cache appears valid but isnt 29360128
Checking filesystem on /dev/sdc
UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
found 1235681286 bytes used err is -22
(...)
Another issue caused by this race is a crash while writing bitmap entries
to the cache, because while the cache writeout task accesses the bitmaps,
the trim task can be concurrently modifying the bitmap or worse might
be freeing the bitmap. The later case results in the following crash:
Fix this by serializing both tasks in such a way that cache writeout
doesn't wait for the trim/discard of free space entries to finish and
doesn't miss any free space entry.
Filipe Manana [Wed, 26 Nov 2014 15:28:55 +0000 (15:28 +0000)]
Btrfs: make btrfs_abort_transaction consider existence of new block groups
If the transaction handle doesn't have used blocks but has created new block
groups make sure we turn the fs into readonly mode too. This is because the
new block groups didn't get all their metadata persisted into the chunk and
device trees, and therefore if a subsequent transaction starts, allocates
space from the new block groups, writes data or metadata into that space,
commits successfully and then after we unmount and mount the filesystem
again, the same space can be allocated again for a new block group,
resulting in file data or metadata corruption.
Example where we don't abort the transaction when we fail to finish the
chunk allocation (add items to the chunk and device trees) and later a
future transaction where the block group is removed fails because it can't
find the chunk item in the chunk tree:
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit c92f6be34c501406daf5e61f3569a1813f985393)
Filipe Manana [Wed, 26 Nov 2014 15:28:50 +0000 (15:28 +0000)]
Btrfs: fix invalid block group rbtree access after bg is removed
If we grab a block group, for example in btrfs_trim_fs(), we will be holding
a reference on it but the block group can be removed after we got it (via
btrfs_remove_block_group), which means it will no longer be part of the
rbtree.
However, btrfs_remove_block_group() was only calling rb_erase() which leaves
the block group's rb_node left and right child pointers with the same content
they had before calling rb_erase. This was dangerous because a call to
next_block_group() would access the node's left and right child pointers (via
rb_next), which can be no longer valid.
Fix this by clearing a block group's node after removing it from the tree,
and have next_block_group() do a tree search to get the next block group
instead of using rb_next() if our block group was removed.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 292cbd51ecf85d73195a3e3193937fa770f6ea71)
Filipe Manana [Wed, 26 Nov 2014 15:28:51 +0000 (15:28 +0000)]
Btrfs: fix crash caused by block group removal
If we remove a block group (because it became empty), we might have left
a caching_ctl structure in fs_info->caching_block_groups that points to
the block group and is accessed at transaction commit time. This results
in accessing an invalid or incorrect block group. This issue became visible
after Josef's patch "Btrfs: remove empty block groups automatically".
So if the block group is removed make sure we don't leave a dangling
caching_ctl in caching_block_groups.
Filipe Manana [Wed, 26 Nov 2014 15:28:52 +0000 (15:28 +0000)]
Btrfs: fix freeing used extents after removing empty block group
There's a race between adding a block group to the list of the unused
block groups and removing an unused block group (cleaner kthread) that
leads to freeing extents that are in use or a crash during transaction
commmit. Basically the cleaner kthread, when executing
btrfs_delete_unused_bgs(), might catch the newly added block group to
the list fs_info->unused_bgs and clear the range representing the whole
group from fs_info->freed_extents[] before the task that added the block
group to the list (running update_block_group()) marked the last freed
extent as dirty in fs_info->freed_extents (pinned_extents).
That is:
CPU 1 CPU 2
btrfs_delete_unused_bgs()
update_block_group()
add block group to
fs_info->unused_bgs
got block group from the list
clear_extent_bits for the whole
block group range in freed_extents[]
set_extent_dirty for the
range covering the freed
extent in freed_extents[]
(fs_info->pinned_extents)
block group deleted, and a new block
group with the same logical address is
created
reserve space from the new block group
for new data or metadata - the reserved
space overlaps the range specified by
CPU 1 for set_extent_dirty()
commit transaction
find all ranges marked as dirty in
fs_info->pinned_extents, clear them
and add them to the free space cache
Alternatively, if CPU 2 doesn't create a new block group with the same
logical address, we get a crash/BUG_ON at transaction commit when unpining
extent ranges because we can't find a block group for the range marked as
dirty by CPU 1. Sample trace:
So fix this by making update_block_group() first set the range as dirty
in pinned_extents before adding the block group to the unused_bgs list.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit ae0ab003f2840e7ed66ed762ec0d4e298ad1e725)
Miao Xie [Thu, 23 Oct 2014 06:42:50 +0000 (14:42 +0800)]
Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.
Miao Xie [Wed, 15 Oct 2014 03:18:44 +0000 (11:18 +0800)]
Btrfs, raid56: don't change bbio and raid_map
Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.
Filipe Manana [Mon, 3 Nov 2014 14:08:39 +0000 (14:08 +0000)]
Btrfs: fix freeing used extent after removing empty block group
Due to ignoring errors returned by clear_extent_bits (at the moment only
-ENOMEM is possible), we can end up freeing an extent that is actually in
use (i.e. return the extent to the free space cache).
The sequence of steps that lead to this:
1) Cleaner thread starts execution and calls btrfs_delete_unused_bgs(), with
the goal of freeing empty block groups;
2) btrfs_delete_unused_bgs() finds an empty block group, joins the current
transaction (or starts a new one if none is running) and attempts to
clear the EXTENT_DIRTY bit for the block group's range from freed_extents[0]
and freed_extents[1] (of which one corresponds to fs_info->pinned_extents);
3) Clearing the EXTENT_DIRTY bit (via clear_extent_bits()) fails with
-ENOMEM, but such error is ignored and btrfs_delete_unused_bgs() proceeds
to delete the block group and the respective chunk, while pinned_extents
remains with that bit set for the whole (or a part of the) range covered
by the block group;
4) Later while the transaction is still running, the chunk ends up being reused
for a new block group (maybe for different purpose, data or metadata), and
extents belonging to the new block group are allocated for file data or btree
nodes/leafs;
5) The current transaction is committed, meaning that we unpinned one or more
extents from the new block group (through btrfs_finish_extent_commit() and
unpin_extent_range()) which are now being used for new file data or new
metadata (through btrfs_finish_extent_commit() and unpin_extent_range()).
And unpinning means we returned the extents to the free space cache of the
new block group, which implies those extents can be used for future allocations
while they're still in use.
Alternatively, we can hit a BUG_ON() when doing a lookup for a block group's cache
object in unpin_extent_range() if a new block group didn't end up being allocated for
the same chunk (step 4 above).
Fix this by not freeing the block group and chunk if we fail to clear the dirty bit.
Filipe Manana [Tue, 21 Oct 2014 10:11:41 +0000 (11:11 +0100)]
Btrfs: ensure send always works on roots without orphans
Move the logic from the snapshot creation ioctl into send. This avoids
doing the transaction commit if send isn't used, and ensures that if
a crash/reboot happens after the transaction commit that created the
snapshot and before the transaction commit that switched the commit
root, send will not get a commit root that differs from the main root
(that has orphan items).
Filipe Manana [Wed, 29 Oct 2014 11:57:59 +0000 (11:57 +0000)]
Btrfs: fix snapshot inconsistency after a file write followed by truncate
If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.
For example, if we perform the following file operations:
and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:
item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
inode ref index 282 namelen 10 name: foobar
item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
extent data disk byte 1104855040 nr 32768
extent data offset 0 nr 32768 ram 32768
extent compression 0
item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
extent data disk byte 0 nr 0
extent data offset 0 nr 40960 ram 40960
extent compression 0
There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.
Btrfs' fsck tool complains about such cases with a message like the following:
>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:
1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured
But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).
Qu Wenruo [Thu, 30 Oct 2014 08:52:31 +0000 (16:52 +0800)]
btrfs: Fix a lockdep warning when running xfstest.
The following lockdep warning is triggered during xfstests:
[ 1702.980872] =========================================================
[ 1702.981181] [ INFO: possible irq lock inversion dependency detected ]
[ 1702.981482] 3.18.0-rc1 #27 Not tainted
[ 1702.981781] ---------------------------------------------------------
[ 1702.982095] kswapd0/77 just changed the state of lock:
[ 1702.982415] (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa03b0b51>] __btrfs_release_delayed_node+0x41/0x1f0 [btrfs]
[ 1702.982794] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1702.983160] (&fs_info->dev_replace.lock){+.+.+.}
and interrupts could create inverse lock ordering between them.
[ 1702.984675]
other info that might help us debug this:
[ 1702.985524] Chain exists of:
&delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
[ 1702.986799] Possible interrupt unsafe locking scenario:
It is because the btrfs_kobj_{add/rm}_device() will call memory
allocation with GFP_KERNEL,
which may flush fs page cache to free space, waiting for it self to do
the commit, causing the deadlock.
To solve the problem, move btrfs_kobj_{add/rm}_device() out of the
dev_replace lock range, also involing split the
btrfs_rm_dev_replace_srcdev() function into remove and free parts.
Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace
lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are
called out of the lock range.
Filipe Manana [Thu, 13 Nov 2014 17:01:45 +0000 (17:01 +0000)]
Btrfs: ensure ordered extent errors aren't missed on fsync
When doing a fsync with a fast path we have a time window where we can miss
the fact that writeback of some file data failed, and therefore we endup
returning success (0) from fsync when we should return an error.
The steps that lead to this are the following:
1) We start all ordered extents by calling filemap_fdatawrite_range();
2) We do some other work like locking the inode's i_mutex, start a transaction,
start a log transaction, etc;
3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
ordered extents from inode's ordered tree into a list;
4) But by the time we do ordered extent collection, some ordered extents we started
at step 1) might have already completed with an error, and therefore we didn't
found them in the ordered tree and had no idea they finished with an error. This
makes our fsync return success (0) to userspace, but has no bad effects on the log
like for example insertion of file extent items into the log that point to unwritten
extents, because the invalid extent maps were removed before the ordered extent
completed (in inode.c:btrfs_finish_ordered_io).
So after collecting the ordered extents just check if the inode's i_mapping has any
error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
writeback fails for a page of an ordered extent, we call mapping_set_error (done in
extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
that sets one of those error flags in the inode's i_mapping flags.
This change also has the side effect of fixing the issue where for fast fsyncs we
never checked/cleared the error flags from the inode's i_mapping flags, which means
that a full fsync performed after a fast fsync could get such errors that belonged
to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
calls filemap_fdatawait_range(), and the later checks for and clears those flags,
while for fast fsyncs we never call filemap_fdatawait_range() or anything else
that checks for and clears the error flags from the inode's i_mapping.
Filipe Manana [Thu, 13 Nov 2014 17:00:35 +0000 (17:00 +0000)]
Btrfs: collect only the necessary ordered extents on ranged fsync
Instead of collecting all ordered extents from the inode's ordered tree
and then wait for all of them to complete, just collect the ones that
overlap the fsync range.
Filipe Manana [Thu, 13 Nov 2014 16:59:53 +0000 (16:59 +0000)]
Btrfs: don't ignore log btree writeback errors
If an error happens during writeback of log btree extents, make sure the
error is returned to the caller (fsync), so that it takes proper action
(commit current transaction) instead of writing a superblock that points
to log btrees with all or some nodes that weren't durably persisted.
This leads to an ABBA pattern deadlock. To fix it,
o we just change it to an AABB pattern which means to @unlock_extent_bits()
before we @lock_page(), and in this way the @extent_read_full_page_nolock()
is no longer in an locked context, so change it back to @extent_read_full_page()
to regain protection.
o Since we @unlock_extent_bits() earlier, then before @write_page_nocow(),
the extent may not really point at the physical block we want, so we
have to check it before write.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 321592427c0146126aadfab8a9b663de1875c9f4)
Filipe Manana [Sun, 9 Nov 2014 08:38:39 +0000 (08:38 +0000)]
Btrfs: make xattr replace operations atomic
Replacing a xattr consists of doing a lookup for its existing value, delete
the current value from the respective leaf, release the search path and then
finally insert the new value. This leaves a time window where readers (getxattr,
listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs,
so this has security implications.
This change also fixes 2 other existing issues which were:
*) Deleting the old xattr value without verifying first if the new xattr will
fit in the existing leaf item (in case multiple xattrs are packed in the
same item due to name hash collision);
*) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't
exist but we have have an existing item that packs muliple xattrs with
the same name hash as the input xattr. In this case we should return ENOSPC.
A test case for xfstests follows soon.
Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace
implementation.
Reported-by: Alexandre Oliva <oliva@gnu.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 5f5bc6b1e2d5a6f827bc860ef2dc5b6f365d1339)
Filipe Manana [Mon, 3 Nov 2014 14:12:57 +0000 (14:12 +0000)]
Btrfs: avoid premature -ENOMEM in clear_extent_bit()
We try to allocate an extent state structure before acquiring the extent
state tree's spinlock as we might need a new one later and therefore avoid
doing later an atomic allocation while holding the tree's spinlock. However
we returned -ENOMEM if that initial non-atomic allocation failed, which is
a bit excessive since we might end up not needing the pre-allocated extent
state at all - for the case where the tree doesn't have any extent states
that cover the input range and cover too any other range. Therefore don't
return -ENOMEM if that pre-allocation fails.
Josef Bacik [Mon, 3 Nov 2014 13:56:50 +0000 (08:56 -0500)]
Btrfs: don't take the chunk_mutex/dev_list mutex in statfs V2
Our gluster boxes get several thousand statfs() calls per second, which begins
to suck hardcore with all of the lock contention on the chunk mutex and dev list
mutex. We don't really need to hold these things, if we have transient
weirdness with statfs() because of the chunk allocator we don't care, so remove
this locking.
We still need the dev_list lock if you mount with -o alloc_start however, which
is a good argument for nuking that thing from orbit, but that's a patch for
another day. Thanks,
Josef Bacik [Fri, 31 Oct 2014 13:49:34 +0000 (09:49 -0400)]
Btrfs: move read only block groups onto their own list V2
Our gluster boxes were spending lots of time in statfs because our fs'es are
huge. The problem is statfs loops through all of the block groups looking for
read only block groups, and when you have several terabytes worth of data that
ends up being a lot of block groups. Move the read only block groups onto a
read only list and only proces that list in
btrfs_account_ro_block_groups_free_space to reduce the amount of churn. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 633c0aad4c0243a506a3e8590551085ad78af82d)
Stefan Behrens [Fri, 17 Oct 2014 12:10:10 +0000 (14:10 +0200)]
Btrfs: check-int: don't complain about balanced blocks
The xfstest btrfs/014 which tests the balance operation caused that the
check_int module complained that known blocks changed their physical
location. Since this is not an error in this case, only print such
message if the verbose mode was enabled.
Reported-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Tested-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit cf90c59e680ee83c263a60ee2cd464a7d96cc6d5)
Filipe Manana [Mon, 13 Oct 2014 11:28:37 +0000 (12:28 +0100)]
Btrfs: deal with convert_extent_bit errors to avoid fs corruption
When committing a transaction or a log, we look for btree extents that
need to be durably persisted by searching for ranges in a io tree that
have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear
those bits and set the EXTENT_NEED_WAIT bit, with calls to the function
convert_extent_bit, and then start writeback for the extents.
That function however can return an error (at the moment only -ENOMEM
is possible, specially when it does GFP_ATOMIC allocation requests
through alloc_extent_state_atomic) - that means the ranges didn't got
the EXTENT_NEED_WAIT bit set (or at least not for the whole range),
which in turn means a call to btrfs_wait_marked_extents() won't find
those ranges for which we started writeback, causing a transaction
commit or a log commit to persist a new superblock without waiting
for the writeback of extents in that range to finish first.
Therefore if a crash happens after persisting the new superblock and
before writeback finishes, we have a superblock pointing to roots that
weren't fully persisted or roots that point to nodes or leafs that weren't
fully persisted, causing all sorts of unexpected/bad behaviour as we endup
reading garbage from disk or the content of some node/leaf from a past
generation that got cowed or deleted and is no longer valid (for this later
case we end up getting error messages like "parent transid verify failed on
X wanted Y found Z" when reading btree nodes/leafs from disk).
Filipe Manana [Mon, 13 Oct 2014 11:28:39 +0000 (12:28 +0100)]
Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early
We try to allocate an extent state before acquiring the tree's spinlock
just in case we end up needing to split an existing extent state into two.
If that allocation failed, we would return -ENOMEM.
However, our only single caller (transaction/log commit code), passes in
an extent state that was cached from a call to find_first_extent_bit() and
that has a very high chance to match exactly the input range (always true
for a transaction commit and very often, but not always, true for a log
commit) - in this case we end up not needing at all that initial extent
state used for an eventual split. Therefore just don't return -ENOMEM if
we can't allocate the temporary extent state, since we might not need it
at all, and if we end up needing one, we'll do it later anyway.
Stefan Behrens [Thu, 16 Oct 2014 15:48:48 +0000 (17:48 +0200)]
Btrfs: check_int: use the known block location
The xfstest btrfs/014 which tests the balance operation caused issues with
the check_int module. The attempt was made to use btrfs_map_block() to
find the physical location for a written block. However, this was not
at all needed since the location of the written block was known since
a hook to submit_bio() was the reason for entering the check_int module.
Additionally, after a block relocation it happened that btrfs_map_block()
failed causing misleading error messages afterwards.
This patch changes the check_int module to use the known information of
the physical location from the bio.
Reported-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Tested-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit f382e4653f06f4e285b7800f80e6548e026f90a8)
Filipe Manana [Fri, 10 Oct 2014 09:45:12 +0000 (10:45 +0100)]
Btrfs: report error after failure inlining extent in compressed write path
If cow_file_range_inline() failed, when called from compress_file_range(),
we were tagging the locked page for writeback, end its writeback and unlock it,
but not marking it with an error nor setting AS_EIO in inode's mapping flags.
This made it impossible for a caller of filemap_fdatawrite_range (writepages)
or filemap_fdatawait_range() to know that an error happened. And the return
value of compress_file_range() is useless because it's returned to a workqueue
task and not to the task calling filemap_fdatawrite_range (writepages).
This change applies on top of the previous patchset starting at the patch
titled:
"[1/5] Btrfs: set page and mapping error on compressed write failure"
Which changed extent_clear_unlock_delalloc() to use SetPageError and
mapping_set_error().
Since we are allocating memory for hash table array, so
it will be good if we could allocate continuous pages here.
Fix this problem by firstly trying kzalloc(), if we fail,
use vzalloc() instead.
Signed-off-by: Wang Shilong <wangshilong1991@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 6b3a4d60dbf7a60b91d810498e8c212f26718efd)
Eryu Guan [Mon, 13 Oct 2014 04:42:12 +0000 (12:42 +0800)]
Btrfs: return failure if btrfs_dev_replace_finishing() failed
device replace could fail due to another running scrub process or any
other errors btrfs_scrub_dev() may hit, but this failure doesn't get
returned to userspace.
The following steps could reproduce this issue
mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
mount /dev/sdb1 /mnt/btrfs
while true; do btrfs scrub start -B /mnt/btrfs >/dev/null 2>&1; done &
btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
# if this replace succeeded, do the following and repeat until
# you see this log in dmesg
# BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
#btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs
# once you see the error log in dmesg, check return value of
# replace
echo $?
Introduce a new dev replace result
BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS
to catch -EINPROGRESS explicitly and return other errors directly to
userspace.
after previous steps, inode will be detected as bad compression ratio,
and NOCOMPRESS flag will be set for that inode.
Reason is that compress have a max limit pages every time(128K), if a
132k write in, it will be splitted into two write(128k+4k), this bug
is a leftover for commit 68bb462d42a(Btrfs: don't compress for a small write)
Fix this problem by checking every time before compression, if it is a
small write(<=blocksize), we bail out and fall into nocompression directly.
Signed-off-by: Wang Shilong <wangshilong1991@gmail.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 4bcbb33255131adbe481c0467df26d654ce3bc78)
Filipe Manana [Thu, 9 Oct 2014 20:15:44 +0000 (21:15 +0100)]
Btrfs: make inode.c:compress_file_range() return void
Its return value is useless, its single caller ignores it and can't do
anything with it anyway, since it's a workqueue task and not the task
calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range().
Failure is communicated to such functions via start and end of writeback
with the respective pages tagged with an error and AS_EIO flag set in the
inode's imapping.
Filipe Manana [Thu, 9 Oct 2014 20:18:55 +0000 (21:18 +0100)]
Btrfs: correctly flush compressed data before/after direct IO
For compressed writes, after doing the first filemap_fdatawrite_range() we
don't get the pages tagged for writeback immediately. Instead we create
a workqueue task, which is run by other kthread, and keep the pages locked.
That other kthread compresses data, creates the respective ordered extent/s,
tags the pages for writeback and unlocks them. Therefore we need a second
call to filemap_fdatawrite_range() if we have compressed writes, as this
second call will wait for the pages to become unlocked, then see they became
tagged for writeback and finally wait for the writeback to finish.
Filipe Manana [Mon, 6 Oct 2014 21:14:24 +0000 (22:14 +0100)]
Btrfs: don't leak pages and memory on compressed write error
In inode.c:submit_compressed_extents(), if we fail before calling
btrfs_submit_compressed_write(), or when that function fails, we
were freeing the async_extent structure without releasing its pages
and freeing the pages array.
Filipe Manana [Mon, 6 Oct 2014 21:14:25 +0000 (22:14 +0100)]
Btrfs: process all async extents on compressed write failure
If we had an error when processing one of the async extents from our list,
we were not processing the remaining async extents, meaning we would leak
those async_extent structs, never release the pages with the compressed
data and never unlock and clear the dirty flag from the inode's pages (those
that correspond to the uncompressed content).
Filipe Manana [Tue, 7 Oct 2014 00:48:26 +0000 (01:48 +0100)]
Btrfs: don't ignore compressed bio write errors
Our compressed bio write end callback was essentially ignoring the error
parameter. When a write error happens, it must pass a value of 0 to the
inode's write_page_end_io_hook callback, SetPageError on the respective
pages and set AS_EIO in the inode's mapping flags, so that a call to
filemap_fdatawait_range() / filemap_fdatawait() can find out that errors
happened (we surely don't want silent failures on fsync for example).
Filipe Manana [Mon, 6 Oct 2014 21:14:22 +0000 (22:14 +0100)]
Btrfs: set page and mapping error on compressed write failure
If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
but we don't tag the pages, nor the inode's mapping, with an error. This makes it
impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
for e.g.) know that there was an error.
Note that the return value of submit_compressed_extents() is useless, as that function
is executed by a workqueue task and not directly by the fill_delalloc callback. This
means the writepage/s callbacks of the inode's address space operations don't get that
return value.
Filipe Manana [Mon, 6 Oct 2014 21:14:23 +0000 (22:14 +0100)]
Btrfs: fix hang on compressed write error
In inode.c:submit_compressed_extents(), before calling btrfs_submit_compressed_write()
we start writeback for all pages, clear their dirty flag, unlock them, etc, but if
btrfs_submit_compressed_write() fails (at the moment it can only fail with -ENOMEM),
we never end the writeback on the pages, so any filemap_fdatawait_range() call will
hang forever. We were also not calling the writepage end io hook, which means the
corresponding ordered extent will never complete and all its waiters will block
forever, such as a full fsync (via btrfs_wait_ordered_range()).
David Sterba [Wed, 12 Nov 2014 13:24:35 +0000 (14:24 +0100)]
btrfs: introduce pending action: commit
In some contexts, like in sysfs handlers, we don't want to trigger a
transaction commit. It's a heavy operation, we don't know what external
locks may be taken. Instead, make it possible to finish the operation
through sync syscall or SYNC_FS ioctl.
David Sterba [Fri, 28 Mar 2014 16:38:48 +0000 (17:38 +0100)]
btrfs: do commit in sync_fs if there are pending changes
If a pending change is requested, it's not processed unless there is a
transaction commit about to happen, not even after sync or SYNC_FS
ioctl. For example a remount that toggles the inode_cache option will
not take effect after sync on a quiescent filesystem.
David Sterba [Wed, 5 Feb 2014 14:26:17 +0000 (15:26 +0100)]
btrfs: switch inode_cache option handling to pending changes
The pending mount option(s) now share namespace and bits with the normal
options, and the existing one for (inode_cache) is unset unconditionally
at each transaction commit.
Introduce a separate namespace for pending changes and enhance the
descriptions of the intended change to use separate bits for each
action.
David Sterba [Wed, 5 Feb 2014 14:26:17 +0000 (15:26 +0100)]
btrfs: add support for processing pending changes
There are some actions that modify global filesystem state but cannot be
performed at the time of request, but later at the transaction commit
time when the filesystem is in a known state.
For example enabling new incompat features on-the-fly or issuing
transaction commit from unsafe contexts (sysfs handlers).
Linus Torvalds [Thu, 22 Jan 2015 18:53:06 +0000 (06:53 +1200)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Martin Schwidefsky:
"Five more bug fixes from Michael for the s390 BPF jit"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/bpf: Zero extend parameters before calling C function
s390/bpf: Fix sk_load_byte_msh()
s390/bpf: Fix offset parameter for skb_copy_bits()
s390/bpf: Fix skb_copy_bits() parameter passing
s390/bpf: Fix JMP_JGE_K (A >= K) and JMP_JGT_K (A > K)
Linus Torvalds [Thu, 22 Jan 2015 18:40:36 +0000 (06:40 +1200)]
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module and param fixes from Rusty Russell:
"Surprising number of fixes this merge window :(
The first two are minor fallout from the param rework which went in
this merge window.
The next three are a series which fixes a longstanding (but never
previously reported and unlikely , so no CC stable) race between
kallsyms and freeing the init section.
Finally, a minor cleanup as our module refcount will now be -1 during
unload"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
module: make module_refcount() a signed integer.
module: fix race in kallsyms resolution during module load success.
module: remove mod arg from module_free, rename module_memfree().
module_arch_freeing_init(): new hook for archs before module->module_init freed.
param: fix uninitialized read with CONFIG_DEBUG_LOCK_ALLOC
param: initialize store function to NULL if not available.
Rusty Russell [Thu, 22 Jan 2015 00:43:14 +0000 (11:13 +1030)]
module: make module_refcount() a signed integer.
James Bottomley points out that it will be -1 during unload. It's
only used for diagnostics, so let's not hide that as it could be a
clue as to what's gone wrong.
Cc: Jason Wessel <jason.wessel@windriver.com> Acked-and-documention-added-by: James Bottomley <James.Bottomley@HansenPartnership.com> Reviewed-by: Masami Hiramatsu <maasami.hiramatsu.pt@hitachi.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Linus Torvalds [Wed, 21 Jan 2015 18:26:07 +0000 (06:26 +1200)]
Merge tag 'trace-sh-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull superh tracing fix from Steven Rostedt:
"It's been reported that function tracing does not work on the sh
architecture because gcc 4.8 for superH does not support -m32, and the
recordmcount.pl script adds "-m32" when re-compiling the object files
with the mcount locations.
I was not able to reproduce this problem, as it seems that -m32 works
fine for my cross compiler gcc 4.6.3, but I have to assume that -m32
was deprecated somewhere between 4.6 and 4.8. As it still seems to
compile fine without -m32, I have no reason not to add this patch, as
having -m32 seems to cause trouble for others"
* tag 'trace-sh-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
scripts/recordmcount.pl: There is no -m32 gcc option on Super-H anymore
Linus Torvalds [Wed, 21 Jan 2015 08:37:25 +0000 (20:37 +1200)]
Merge tag 'sound-3.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This batch contains two fixes for FireWire lib module and a quirk for
yet another Logitech WebCam. The former is the fixes for MIDI
handling I forgot to pick up during the merge window. All the fixed
code is pretty local and shouldn't give any regressions"
* tag 'sound-3.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: usb-audio: Add mic volume fix quirk for Logitech Webcam C210
ALSA: firewire-lib: limit the MIDI data rate
ALSA: firewire-lib: remove rx_blocks_for_midi quirk
Linus Torvalds [Wed, 21 Jan 2015 08:23:33 +0000 (20:23 +1200)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Just back from LCA + some days off, had some fixes from the past 2 weeks,
Some amdkfd code removal for a feature that wasn't ready, otherwise
just one fix for core helper sleeping, exynos, i915, and radeon fixes.
I thought I had some sti fixes but they were already in, and it
confused me for a few mins this morning"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm: fb helper should avoid sleeping in panic context
drm/exynos: fix warning of vblank reference count
drm/exynos: remove unnecessary runtime pm operations
drm/exynos: fix reset codes for memory mapped hdmi phy
drm/radeon: use rv515_ring_start on r5xx
drm/radeon: add si dpm quirk list
drm/radeon: don't print error on -ERESTARTSYS
drm/i915: Fix mutex->owner inspection race under DEBUG_MUTEXES
drm/i915: Ban Haswell from using RCS flips
drm/i915: vlv: sanitize RPS interrupt mask during GPU idling
drm/i915: fix HW lockup due to missing RPS IRQ workaround on GEN6
drm/i915: gen9: fix RPS interrupt routing to CPU vs. GT
drm/exynos: remove the redundant machine checking code
drm/radeon: add a dpm quirk list
drm/amdkfd: Fix sparse warning (different address space)
drm/radeon: fix VM flush on CIK (v3)
drm/radeon: fix VM flush on SI (v3)
drm/radeon: fix VM flush on cayman/aruba (v3)
drm/amdkfd: Drop interrupt SW ring buffer
Linus Torvalds [Wed, 21 Jan 2015 06:29:44 +0000 (18:29 +1200)]
Merge tag 'mfd-fixes-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
Pull MFD fixes from Lee Jones:
- Avoid platform ID collision in da9052
- Skip caching volatile registers in tps65218
- Use correct address base in tps65218
- Repair deadlock on suspend in rtsx_usb
* tag 'mfd-fixes-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd:
mfd: rtsx_usb: Fix runtime PM deadlock
mfd: tps65218: Make INT1 our status_base register
mfd: tps65218: Make INT[12] and STATUS registers volatile
mfd: da9052-core: Fix platform-device id collision
Qu Wenruo [Mon, 19 Jan 2015 07:42:41 +0000 (15:42 +0800)]
btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.
Commit 6b5fe46dfa52 (btrfs: do commit in sync_fs if there are pending
changes) will call btrfs_start_transaction() in sync_fs(), to handle
some operations needed to be done in next transaction.
However this can cause deadlock if the filesystem is frozen, with the
following sys_r+w output:
[ 143.255932] Call Trace:
[ 143.255936] [<ffffffff816c0e09>] schedule+0x29/0x70
[ 143.255939] [<ffffffff811cb7f3>] __sb_start_write+0xb3/0x100
[ 143.255971] [<ffffffffa040ec06>] start_transaction+0x2e6/0x5a0
[btrfs]
[ 143.255992] [<ffffffffa040f1eb>] btrfs_start_transaction+0x1b/0x20
[btrfs]
[ 143.256003] [<ffffffffa03dc0ba>] btrfs_sync_fs+0xca/0xd0 [btrfs]
[ 143.256007] [<ffffffff811f7be0>] sync_fs_one_sb+0x20/0x30
[ 143.256011] [<ffffffff811cbd01>] iterate_supers+0xe1/0xf0
[ 143.256014] [<ffffffff811f7d75>] sys_sync+0x55/0x90
[ 143.256017] [<ffffffff816c49d2>] system_call_fastpath+0x12/0x17
[ 143.256111] Call Trace:
[ 143.256114] [<ffffffff816c0e09>] schedule+0x29/0x70
[ 143.256119] [<ffffffff816c3405>] rwsem_down_write_failed+0x1c5/0x2d0
[ 143.256123] [<ffffffff8133f013>] call_rwsem_down_write_failed+0x13/0x20
[ 143.256131] [<ffffffff811caae8>] thaw_super+0x28/0xc0
[ 143.256135] [<ffffffff811db3e5>] do_vfs_ioctl+0x3f5/0x540
[ 143.256187] [<ffffffff811db5c1>] SyS_ioctl+0x91/0xb0
[ 143.256213] [<ffffffff816c49d2>] system_call_fastpath+0x12/0x17
The reason is like the following:
(Holding s_umount)
VFS sync_fs staff:
|- btrfs_sync_fs()
|- btrfs_start_transaction()
|- sb_start_intwrite()
(Waiting thaw_fs to unfreeze)
VFS thaw_fs staff:
thaw_fs()
(Waiting sync_fs to release
s_umount)
So deadlock happens.
This can be easily triggered by fstest/generic/068 with inode_cache
mount option.
The fix is to check if the fs is frozen, if the fs is frozen, just
return and waiting for the next transaction.
Cc: David Sterba <dsterba@suse.cz> Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[enhanced comment, changed to SB_FREEZE_WRITE] Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
Qu Wenruo [Tue, 20 Jan 2015 09:05:33 +0000 (17:05 +0800)]
btrfs: Fix the bug that fs_info->pending_changes is never cleared.
Fs_info->pending_changes is never cleared since the original code uses
cmpxchg(&fs_info->pending_changes, 0, 0), which will only clear it if
pending_changes is already 0.
This will cause a lot of problem when mount it with inode_cache mount
option.
If the btrfs is mounted as inode_cache, pending_changes will always be
1, even when the fs is frozen.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
Dave Airlie [Tue, 20 Jan 2015 23:26:47 +0000 (09:26 +1000)]
Merge tag 'drm-amdkfd-fixes-2015-01-13' of git://people.freedesktop.org/~gabbayo/linux into drm-fixes
- Remove the interrupt SW ring buffer impl. as it is not used by any module
in amdkfd.
- Fix a sparse warning
* tag 'drm-amdkfd-fixes-2015-01-13' of git://people.freedesktop.org/~gabbayo/linux:
drm/amdkfd: Fix sparse warning (different address space)
drm/amdkfd: Drop interrupt SW ring buffer
Dave Airlie [Tue, 20 Jan 2015 23:26:28 +0000 (09:26 +1000)]
Merge tag 'drm-intel-fixes-2015-01-15' of git://anongit.freedesktop.org/drm-intel into drm-fixes
misc i915 fixes
* tag 'drm-intel-fixes-2015-01-15' of git://anongit.freedesktop.org/drm-intel:
drm/i915: Fix mutex->owner inspection race under DEBUG_MUTEXES
drm/i915: Ban Haswell from using RCS flips
drm/i915: vlv: sanitize RPS interrupt mask during GPU idling
drm/i915: fix HW lockup due to missing RPS IRQ workaround on GEN6
drm/i915: gen9: fix RPS interrupt routing to CPU vs. GT
There's __drm_modeset_lock_all() which Daniel Vetter introduced for this
purpose. We can leverage that without reinventing anything. This patch
works with the latest kernel.
Reviewed-by: Rob Clark <robdclark@gmail.com> Tested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Rui Wang <rui.y.wang@intel.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Dave Airlie [Tue, 20 Jan 2015 23:25:19 +0000 (09:25 +1000)]
Merge branch 'exynos-drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos into drm-fixes
This pull request includes below fixups,
- Remove duplicated machine checking.
. It seems that this code was added when you merged 'v3.18-rc7' into
drm-next. commit id : e8115e79aa62b6ebdb3e8e61ca4092cc32938afc
- Fix hdmiphy reset.
. Exynos hdmi has two interfaces to control hdmyphy, one is I2C, other
is APB bus - memory mapped I/O. So this patch makes hdmiphy reset
to be done according to interfaces, I2C or APB bus.
- And add some exception codes.
* 'exynos-drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos:
drm/exynos: fix warning of vblank reference count
drm/exynos: remove unnecessary runtime pm operations
drm/exynos: fix reset codes for memory mapped hdmi phy
drm/exynos: remove the redundant machine checking code
Dave Airlie [Tue, 20 Jan 2015 23:21:32 +0000 (09:21 +1000)]
Merge branch 'drm-fixes-3.19' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
Some radeon fixes for 3.19:
- GPUVM stability fixes
- SI dpm quirks
- Regression fixes
* 'drm-fixes-3.19' of git://people.freedesktop.org/~agd5f/linux:
drm/radeon: use rv515_ring_start on r5xx
drm/radeon: add si dpm quirk list
drm/radeon: don't print error on -ERESTARTSYS
drm/radeon: add a dpm quirk list
drm/radeon: fix VM flush on CIK (v3)
drm/radeon: fix VM flush on SI (v3)
drm/radeon: fix VM flush on cayman/aruba (v3)
Linus Torvalds [Tue, 20 Jan 2015 19:54:16 +0000 (07:54 +1200)]
Merge branch 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata fixes from Tejun Heo:
- Bartlomiej will be co-maintaining PATA portion of libata. git
workflow will stay the same.
- sata_sil24 wasn't happy with tag ordered submission. An option to
restore the old tag allocation behavior is implemented for sil24.
- a very old race condition in PIO host state machine which can trigger
BUG fixed.
- other driver-specific changes
* 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
libata: prevent HSM state change race between ISR and PIO
libata: allow sata_sil24 to opt-out of tag ordered submission
ata: pata_at91: depend on !ARCH_MULTIPLATFORM
ahci: Remove Device ID for Intel Sunrise Point PCH
ahci: Use dev_info() to inform about the lack of Device Sleep support
libata: Whitelist SSDs that are known to properly return zeroes after TRIM
sata_dwc_460ex: fix resource leak on error path
ata: add MAINTAINERS entry for libata PATA drivers
libata: clean up MAINTAINERS entries
libata: export ata_get_cmd_descript()
ahci_xgene: Fix the DMA state machine lockup for the ATA_CMD_PACKET PIO mode command.
ahci_xgene: Fix the endianess issue in APM X-Gene SoC AHCI SATA controller driver.
Linus Torvalds [Tue, 20 Jan 2015 19:51:46 +0000 (07:51 +1200)]
Merge branch 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
"The xfs folks have been running into weird and very rare lockups for
some time now. I didn't think this could have been from workqueue
side because no one else was reporting it. This time, Eric had a
kdump which we looked into and it turned out this actually was a
workqueue bug and the bug has been there since the beginning of
concurrency managed workqueue.
A worker pool ensures forward progress of the workqueues associated
with it by always having at least one worker reserved from executing
work items. When the pool is under contention, the idle one tries to
create more workers for the pool and if that doesn't succeed quickly
enough, it calls the rescuers to the pool.
This logic had a subtle race condition in an early exit path. When a
worker invokes this manager function, the function may return %false
indicating that the caller may proceed to executing work items either
because another worker is already performing the role or conditions
have changed and the pool is no longer under contention.
The latter part depended on the assumption that whether more workers
are necessary or not remains stable while the pool is locked; however,
pool->nr_running (concurrency count) may change asynchronously and it
getting bumped from zero asynchronously could send off the last idle
worker to execute work items.
The race window is fairly narrow, and, even when it gets triggered,
the pool deadlocks iff if all work items get blocked on pending work
items of the pool, which is highly unlikely but can be triggered by
xfs.
The patch removes the race window by removing the early exit path,
which doesn't server any purpose anymore anyway"
* 'for-3.19-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix subtle pool management issue which can stall whole worker_pool
Roger Tseng [Thu, 15 Jan 2015 07:14:44 +0000 (15:14 +0800)]
mfd: rtsx_usb: Fix runtime PM deadlock
sd_set_power_mode() in derived module drivers/mmc/host/rtsx_usb_sdmmc.c
acquires dev_mutex and then calls pm_runtime_get_sync() to make sure the
device is awake while initializing a newly inserted card. Once it is
called during suspending state and explicitly before rtsx_usb_suspend()
acquires the same dev_mutex, both routine deadlock and further hang the
driver because pm_runtime_get_sync() waits the pending PM operations.
Fix this by using an empty suspend method. mmc_core always turns the
LED off after a request is done and thus it is ok to remove the only
rtsx_usb_turn_off_led() here.
Cc: <stable@vger.kernel.org> # v3.16+ Fixes: 730876be2566 ("mfd: Add realtek USB card reader driver") Signed-off-by: Roger Tseng <rogerable@realtek.com>
[Lee: Removed newly unused variable] Signed-off-by: Lee Jones <lee.jones@linaro.org>
Felipe Balbi [Fri, 26 Dec 2014 19:28:21 +0000 (13:28 -0600)]
mfd: tps65218: Make INT1 our status_base register
If we don't tell regmap-irq that our first status
register is at offset 1, it will try to read offset
zero, which is the chipid register.
Fixes: 44b4dc6 mfd: tps65218: Add driver for the TPS65218 PMIC Cc: <stable@vger.kernel.org> # v3.15+ Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Lee Jones <lee.jones@linaro.org>
Felipe Balbi [Fri, 26 Dec 2014 19:28:20 +0000 (13:28 -0600)]
mfd: tps65218: Make INT[12] and STATUS registers volatile
STATUS register can be modified by the HW, so we
should bypass cache because of that.
In the case of INT[12] registers, they are the ones
that actually clear the IRQ source at the time they
are read. If we rely on the cache for them, we will
never be able to clear the interrupt, which will cause
our IRQ line to be disabled due to IRQ throttling.
Fixes: 44b4dc6 mfd: tps65218: Add driver for the TPS65218 PMIC Cc: <stable@vger.kernel.org> # v3.15+ Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Lee Jones <lee.jones@linaro.org>
Fabio Estevam [Wed, 10 Dec 2014 01:39:53 +0000 (23:39 -0200)]
mfd: da9052-core: Fix platform-device id collision
Allow multiple DA9052 regulators be registered by registering with
PLATFORM_DEVID_AUTO instead of PLATFORM_DEVID_NONE.
The subdevices are currently registered with PLATFORM_DEVID_NONE, which
will cause a name collision on the platform bus when multiple regulators
are registered:
[ 0.128855] da9052-regulator da9052-regulator: invalid regulator ID specified
[ 0.128973] da9052-regulator: probe of da9052-regulator failed with error -22
[ 0.129148] ------------[ cut here ]------------
[ 0.129200] WARNING: CPU: 0 PID: 1 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5c/0x7c()
[ 0.129233] sysfs: cannot create duplicate filename '/devices/platform/soc/60000000.aips/63fc8000.i2c/i2c-0/0-0048/da9052-regulator
...
[ 0.132891] ------------[ cut here ]------------
[ 0.132924] WARNING: CPU: 0 PID: 1 at lib/kobject.c:240 kobject_add_internal+0x24c/0x2cc()
[ 0.132957] kobject_add_internal failed for da9052-regulator with -EEXIST, don't try to register things with the same name in the same directory.
...
[ 0.137000] da9052 0-0048: mfd_add_devices failed: -17
[ 0.138486] da9052: probe of 0-0048 failed with error -17
Based on the fix done by Johan Hovold at commit b6684228726cc255 ("mfd:
viperboard: Fix platform-device id collision").
Tested on a imx53-qsb board, where multiple DA9053 regulators can be
successfully probed.
Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com> Signed-off-by: Lee Jones <lee.jones@linaro.org>
Linus Torvalds [Tue, 20 Jan 2015 09:23:41 +0000 (21:23 +1200)]
Merge tag 'pinctrl-v3.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
"Here is a (hopefully final) slew of pin control fixes for the v3.19
series. The deadlock fix is kind of serious and tagged for stable,
the rest is business as usual.
- Fix two deadlocks around the pin control mutexes, a long-standing
issue that manifest itself in plug/unplug of pin controllers.
(Tagged for stable.)
- Handle an error path with zero functions in the Qualcomm pin
controller.
- Drop a bogus second GPIO chip added in the Lantiq driver.
- Fix sudden IRQ loss on Rockchip pin controllers.
- Register the GIT tree in MAINTAINERS"
* tag 'pinctrl-v3.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: MAINTAINERS: add git tree reference
pinctrl: qcom: Don't iterate past end of function array
pinctrl: lantiq: remove bogus of_gpio_chip_add
pinctrl: Fix two deadlocks
pinctrl: rockchip: Avoid losing interrupts when supporting both edges
1) Socket addresses returned in the error queue need to be fully
initialized before being passed on to userspace, fix from Willem de
Bruijn.
2) Interrupt handling fixes to davinci_emac driver from Tony Lindgren.
3) Fix races between receive packet steering and cpu hotplug, from Eric
Dumazet.
4) Allowing netlink sockets to subscribe to unknown multicast groups
leads to crashes, don't allow it. From Johannes Berg.
5) One to many socket races in SCTP fixed by Daniel Borkmann.
6) Put in a guard against the mis-use of ipv6 atomic fragments, from
Hagen Paul Pfeifer.
7) Fix promisc mode and ethtool crashes in sh_eth driver, from Ben
Hutchings.
8) NULL deref and double kfree fix in sxgbe driver from Girish K.S and
Byungho An.
9) cfg80211 deadlock fix from Arik Nemtsov.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (36 commits)
s2io: use snprintf() as a safety feature
r8152: remove sram_read
r8152: remove generic_ocp_read before writing
bgmac: activate irqs only if there is nothing to poll
bgmac: register napi before the device
sh_eth: Fix ethtool operation crash when net device is down
sh_eth: Fix promiscuous mode on chips without TSU
ipv6: stop sending PTB packets for MTU < 1280
net: sctp: fix race for one-to-many sockets in sendmsg's auto associate
genetlink: synchronize socket closing and family removal
genetlink: disallow subscribing to unknown mcast groups
genetlink: document parallel_ops
net: rps: fix cpu unplug
net: davinci_emac: Add support for emac on dm816x
net: davinci_emac: Fix ioremap for devices with MDIO within the EMAC address space
net: davinci_emac: Fix incomplete code for getting the phy from device tree
net: davinci_emac: Free clock after checking the frequency
net: davinci_emac: Fix runtime pm calls for davinci_emac
net: davinci_emac: Fix hangs with interrupts
ip: zero sockaddr returned on error queue
...
Pull crypto fix from Herbert Xu:
"This fixes a regression that arose from the change to add a crypto
prefix to module names which was done to prevent the loading of
arbitrary modules through the Crypto API.
In particular, a number of modules were missing the crypto prefix
which meant that they could no longer be autoloaded"
Rusty Russell [Mon, 19 Jan 2015 22:37:05 +0000 (09:07 +1030)]
module: fix race in kallsyms resolution during module load success.
The kallsyms routines (module_symbol_name, lookup_module_* etc) disable
preemption to walk the modules rather than taking the module_mutex:
this is because they are used for symbol resolution during oopses.
This works because there are synchronize_sched() and synchronize_rcu()
in the unload and failure paths. However, there's one case which doesn't
have that: the normal case where module loading succeeds, and we free
the init section.
We don't want a synchronize_rcu() there, because it would slow down
module loading: this bug was introduced in 2009 to speed module
loading in the first place.
Thus, we want to do the free in an RCU callback. We do this in the
simplest possible way by allocating a new rcu_head: if we put it in
the module structure we'd have to worry about that getting freed.
Reported-by: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Rusty Russell [Mon, 19 Jan 2015 22:37:05 +0000 (09:07 +1030)]
module: remove mod arg from module_free, rename module_memfree().
Nothing needs the module pointer any more, and the next patch will
call it from RCU, where the module itself might no longer exist.
Removing the arg is the safest approach.
This just codifies the use of the module_alloc/module_free pattern
which ftrace and bpf use.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Mikael Starvik <starvik@axis.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Ley Foon Tan <lftan@altera.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: x86@kernel.org Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: linux-cris-kernel@axis.com Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: nios2-dev@lists.rocketboards.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparclinux@vger.kernel.org Cc: netdev@vger.kernel.org
Rusty Russell [Mon, 19 Jan 2015 22:37:04 +0000 (09:07 +1030)]
module_arch_freeing_init(): new hook for archs before module->module_init freed.
Archs have been abusing module_free() to clean up their arch-specific
allocations. Since module_free() is also (ab)used by BPF and trace code,
let's keep it to simple allocations, and provide a hook called before
that.
This means that avr32, ia64, parisc and s390 no longer need to implement
their own module_free() at all. avr32 doesn't need module_finalize()
either.