Filipe Manana [Fri, 13 Nov 2015 00:56:02 +0000 (00:56 +0000)]
Btrfs: fix the number of transaction units needed to remove a block group
We were using only 1 transaction unit when attempting to delete an unused
block group but in reality we need 3 units. We were accounting only for
the addition of the orphan item (for the block group's free space cache
inode) but we were not accounting that we need to delete one block group
item from the extent tree and one free space item from the tree of tree
roots.
While one unit is not enough, it worked most of the time because for each
single unit we are too pessimistic and assume an entire tree path, with
the highest possible height (8), needs to be COWed with eventual node
splits at every possible level in the tree, so there was usually enough
reserved space for removing the block group and free space items after
adding the orphan.
However after adding the orphan item, writepages() can by called by the VM
subsystem against the btree inode when we are under memory pressure, which
causes writeback to start for the nodes we COWed before, this forces the
operation to remove the free space item to COW again some (or all of) the
same nodes (in the tree of tree roots). Even without writepages() being
called, we could fail with ENOSPC because the block group item is located
in a different tree (the extent tree) and it can have a big enough heigth
and require enough node/leaf splits to blow up the reserved space that did
not ended up getting used for adding the orphan and removing the free
space cache item.
In the kernel 4.0 release, commit 3d84be799194 ("Btrfs: fix BUG_ON in
btrfs_orphan_add() when delete unused block group"), we attempted to fix
a BUG_ON due to ENOSPC when trying to add the orphan item by making the
cleaner kthread reserve one transaction unit before attempting to remove
the block group, but this was not enough. We had a couple user reports
still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
a 4.2-rc6 kernel for example:
So fix this by reserving the neceasary 3 units of metadata.
Reported-by: Stefan Priebe <s.priebe@profihost.ag> Fixes: 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group") Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 4fb665f3c2fac42a516b7a0953f2750c525c13a9)
(cherry picked from commit a88e877b29d793156af9d8fc86c1cbf4de798487)
Filipe Manana [Thu, 12 Nov 2015 17:49:45 +0000 (17:49 +0000)]
Btrfs: use global reserve when deleting unused block group after ENOSPC
It's possible to reach a state where the cleaner kthread isn't able to
start a transaction to delete an unused block group due to lack of enough
free metadata space and due to lack of unallocated device space to allocate
a new metadata block group as well. If this happens try to use space from
the global block group reserve just like we do for unlink operations, so
that we don't reach a permanent state where starting a transaction for
filesystem operations (file creation, renames, etc) keeps failing with
-ENOSPC. Such an unfortunate state was observed on a machine where over
a dozen unused data block groups existed and the cleaner kthread was
failing to delete them due to ENOSPC error when attempting to start a
transaction, and even running balance with a -dusage=0 filter failed with
ENOSPC as well. Also unmounting and mounting again the filesystem didn't
help. Allowing the cleaner kthread to use the global block reserve to
delete the unused data block groups fixed the problem.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com>
(cherry picked from commit 59b6a0d1cc59b739915d7f3d5c03bad56ec4ceb8)
Reason:
btrfs balance delete last data chunk in case of no data in
the filesystem, then we can see "no data chunk" by "fi usage"
command.
And when we do write operation to fs, the only available data
profile is 0x0, result is all new chunks are allocated single type.
Fix:
Allocate a data chunk explicitly to ensure we don't lose the
raid profile for data.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 2c9fe835525896077e7e6d8e416b97f2f868edef)
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
umount "$TEST_DEV"
mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
btrfs filesystem usage $TEST_DIR
We can see the data chunk changed from raid1 to single:
# btrfs filesystem usage $TEST_DIR
Data,single: Size:8.00MiB, Used:0.00B
/dev/vdg 8.00MiB
#
Reason:
When a empty filesystem mount with -o nospace_cache, the last
data blockgroup will be auto-removed in umount.
Then if we mount it again, there is no data chunk in the
filesystem, so the only available data profile is 0x0, result
is all new chunks are created as single type.
Fix:
Don't auto-delete last blockgroup for a raid type.
Test:
Test by above script, and confirmed the logic by debug output.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit aefbe9a633b50a6124dbeb33a5d4efcdc6de6c30)
Zhao Lei [Wed, 19 Aug 2015 14:39:18 +0000 (22:39 +0800)]
btrfs: scrub: setup all fields for sblock_to_check
scrub_setup_recheck_block() isn't setup all necessary fields for
sblock_to_check because history reason.
So current code need more arguments in severial functions,
and more local variables, just to passing these lacked values to
necessary place.
This patch setup above fields to sblock_to_check in
scrub_setup_recheck_block(), for:
1: more cleanup for function arg, local variable
2: to make sblock_to_check complete, then we can use sblock_to_check
without concern about some uninitialized member.
Zhao Lei [Tue, 25 Aug 2015 13:31:40 +0000 (21:31 +0800)]
btrfs: scrub: set error stats when tree block spanning stripes
It is better to show error stats to user when we found tree block
spanning stripes.
On a btrfs created by old version of btrfs-convert:
Before patch:
# btrfs scrub start -B /dev/vdh
scrub done for 8b342d35-2904-41ab-b3cb-2f929709cf47
scrub started at Tue Aug 25 21:19:09 2015 and finished after 00:00:00
total bytes scrubbed: 53.54MiB with 0 errors
# dmesg
...
[ 128.711434] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27000832
[ 128.712744] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27066368
...
After patch:
# btrfs scrub start -B /dev/vdh
scrub done for ff7f844b-7a4e-4b1a-88a9-8252ab25be1b
scrub started at Tue Aug 25 21:42:29 2015 and finished after 00:00:00
total bytes scrubbed: 53.60MiB with 2 errors
error details:
corrected errors: 0, uncorrectable errors: 2, unverified errors: 0
ERROR: There are uncorrectable errors.
# dmesg
...omit...
#
Filipe Manana [Thu, 22 Oct 2015 08:47:34 +0000 (09:47 +0100)]
Btrfs: fix regression when running delayed references
In the kernel 4.2 merge window we had a refactoring/rework of the delayed
references implementation in order to fix certain problems with qgroups.
However that rework introduced one more regression that leads to the
following trace when running delayed references for metadata:
This happens because the new delayed references code no longer merges
delayed references that have different sequence values. The following
steps are an example sequence leading to this issue:
1) Transaction N starts, fs_info->tree_mod_seq has value 0;
2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
bytenr A is created, with a value of 1 and a seq value of 0;
3) fs_info->tree_mod_seq is incremented to 1;
4) Extent buffer A is deleted through btrfs_del_items(), which calls
btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
later returns the metadata extent associated to extent buffer A to
the free space cache (the range is not pinned), because the extent
buffer was created in the current transaction (N) and writeback never
happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
in the extent buffer).
This creates the delayed reference Ref2 for bytenr A, with a value
of -1 and a seq value of 1;
5) Delayed reference Ref2 is not merged with Ref1 when we create it,
because they have different sequence numbers (decided at
add_delayed_ref_tail_merge());
6) fs_info->tree_mod_seq is incremented to 2;
7) Some task attempts to allocate a new extent buffer (done at
extent-tree.c:find_free_extent()), but due to heavy fragmentation
and running low on metadata space the clustered allocation fails
and we fall back to unclustered allocation, which finds the
extent at offset A, so a new extent buffer at offset A is allocated.
This creates delayed reference Ref3 for bytenr A, with a value of 1
and a seq value of 2;
8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
all have different seq values;
9) We start running the delayed references (__btrfs_run_delayed_refs());
10) The delayed Ref1 is the first one being applied, which ends up
creating an inline extent backref in the extent tree;
10) Next the delayed reference Ref3 is selected for execution, and not
Ref2, because select_delayed_ref() always gives a preference for
positive references (that have an action of BTRFS_ADD_DELAYED_REF);
11) When running Ref3 we encounter alreay the inline extent backref
in the extent tree at insert_inline_extent_backref(), which makes
us hit the following BUG_ON:
BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
This is always true because owner corresponds to the level of the
extent buffer/btree node in the btree.
For the scenario described above we hit the BUG_ON because we never merge
references that have different seq values.
We used to do the merging before the 4.2 kernel, more specifically, before
the commmits:
c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.") c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")
This issue became more exposed after the following change that was added
to 4.2 as well:
cffc3374e567 ("Btrfs: fix order by which delayed references are run")
Which in turn fixed another regression by the two commits previously
mentioned.
So fix this by bringing back the delayed reference merge code, with the
proper adaptations so that it operates against the new data structure
(linked list vs old red black tree implementation).
This issue was hit running fstest btrfs/063 in a loop. Several people have
reported this issue in the mailing list when running on kernels 4.2+.
Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).
Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.") Reported-by: Peter Becker <floyd.net@gmail.com> Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Reported-by: Malte Schröder <malte@tnxip.de> Reported-by: Derek Dongray <derek@valedon.co.uk> Reported-by: Erkki Seppala <flux-btrfs@inside.org> Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
(cherry picked from commit 2c3cf7d5f6105bb957df125dfce61d4483b8742d)
Filipe Manana [Fri, 23 Oct 2015 06:52:54 +0000 (07:52 +0100)]
Btrfs: fix regression running delayed references when using qgroups
In the kernel 4.2 merge window we had a big changes to the implementation
of delayed references and qgroups which made the no_quota field of delayed
references not used anymore. More specifically the no_quota field is not
used anymore as of:
commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
Leaving the no_quota field actually prevents delayed references from
getting merged, which in turn cause the following BUG_ON(), at
fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
static int run_delayed_tree_ref(...)
{
(...)
BUG_ON(node->ref_mod != 1);
(...)
}
2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref2 is incompatible
due to Ref2->no_quota != Ref3->no_quota.
4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
It's not merged with the reference at the tail of the list of refs
for bytenr X because the reference at the tail, Ref3 is incompatible
due to Ref3->no_quota != Ref4->no_quota.
5) We run delayed references, trigger merging of delayed references,
through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
all other conditions are satisfied too. So Ref1 gets a ref_mod
value of 2.
7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
all other conditions are satisfied too. So Ref2 gets a ref_mod
value of 2.
8) Ref1 and Ref2 aren't merged, because they have different values
for their no_quota field.
9) Delayed reference Ref1 is picked for running (select_delayed_ref()
always prefers references with an action == BTRFS_ADD_DELAYED_REF).
So run_delayed_tree_ref() is called for Ref1 which triggers the
BUG_ON because Ref1->red_mod != 1 (equals 2).
So fix this by removing the no_quota field, as it's not used anymore as
of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented
qgroup mechanism.").
The use of no_quota was also buggy in at least two places:
1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
no_quota to 0 instead of 1 when the following condition was true:
is_fstree(ref_root) || !fs_info->quota_enabled
2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
reset a node's no_quota when the condition "!is_fstree(root_objectid)
|| !root->fs_info->quota_enabled" was true but we did it only in
an unused local stack variable, that is, we never reset the no_quota
value in the node itself.
This fixes the remainder of problems several people have been having when
running delayed references, mostly while a balance is running in parallel,
on a 4.2+ kernel.
Very special thanks to Stéphane Lesimple for helping debugging this issue
and testing this fix on his multi terabyte filesystem (which took more
than one day to balance alone, plus fsck, etc).
Also, this fixes deadlock issue when using the clone ioctl with qgroups
enabled, as reported by Elias Probst in the mailing list. The deadlock
happens because after calling btrfs_insert_empty_item we have our path
holding a write lock on a leaf of the fs/subvol tree and then before
releasing the path we called check_ref() which did backref walking, when
qgroups are enabled, and tried to read lock the same leaf. The trace for
this case is the following:
The problem goes away by eleminating check_ref(), which no longer is
needed as its purpose was to get a value for the no_quota field of
a delayed reference (this patch removes the no_quota field as mentioned
earlier).
Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Reported-by: Elias Probst <mail@eliasprobst.eu> Reported-by: Peter Becker <floyd.net@gmail.com> Reported-by: Malte Schröder <malte@tnxip.de> Reported-by: Derek Dongray <derek@valedon.co.uk> Reported-by: Erkki Seppala <flux-btrfs@inside.org> Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
(cherry picked from commit b06c4bf5c874a57254b197f53ddf588e7a24a2bf)
Filipe Manana [Mon, 9 Nov 2015 18:06:38 +0000 (18:06 +0000)]
Btrfs: fix race when listing an inode's xattrs
When listing a inode's xattrs we have a time window where we race against
a concurrent operation for adding a new hard link for our inode that makes
us not return any xattr to user space. In order for this to happen, the
first xattr of our inode needs to be at slot 0 of a leaf and the previous
leaf must still have room for an inode ref (or extref) item, and this can
happen because an inode's listxattrs callback does not lock the inode's
i_mutex (nor does the VFS does it for us), but adding a hard link to an
inode makes the VFS lock the inode's i_mutex before calling the inode's
link callback.
The race illustrated by the following sequence diagram is possible:
CPU 1 CPU 2
btrfs_listxattr()
searches for key (257 XATTR_ITEM 0)
gets path with path->nodes[0] == leaf X
and path->slots[0] == N
because path->slots[0] is >=
btrfs_header_nritems(leaf X), it calls
btrfs_next_leaf()
btrfs_next_leaf()
releases the path
adds key (257 INODE_REF 666)
to the end of leaf X (slot N),
and leaf X now has N + 1 items
searches for the key (257 INODE_REF 256),
with path->keep_locks == 1, because that
is the last key it saw in leaf X before
releasing the path
ends up at leaf X again and it verifies
that the key (257 INODE_REF 256) is no
longer the last key in leaf X, so it
returns with path->nodes[0] == leaf X
and path->slots[0] == N, pointing to
the new item with key (257 INODE_REF 666)
btrfs_listxattr's loop iteration sees that
the type of the key pointed by the path is
different from the type BTRFS_XATTR_ITEM_KEY
and so it breaks the loop and stops looking
for more xattr items
--> the application doesn't get any xattr
listed for our inode
So fix this by breaking the loop only if the key's type is greater than
BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller.
Filipe Manana [Mon, 9 Nov 2015 00:33:58 +0000 (00:33 +0000)]
Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow
If we are using the NO_HOLES feature, we have a tiny time window when
running delalloc for a nodatacow inode where we can race with a concurrent
link or xattr add operation leading to a BUG_ON.
This happens because at run_delalloc_nocow() we end up casting a leaf item
of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
file extent item (struct btrfs_file_extent_item) and then analyse its
extent type field, which won't match any of the expected extent types
(values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
explicit BUG_ON(1).
The following sequence diagram shows how the race happens when running a
no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
neighbour leafs:
(Note the implicit hole for inode 257 regarding the [0, 8K[ range)
CPU 1 CPU 2
run_dealloc_nocow()
btrfs_lookup_file_extent()
--> searches for a key with value
(257 EXTENT_DATA 4096) in the
fs/subvol tree
--> returns us a path with
path->nodes[0] == leaf X and
path->slots[0] == N
because path->slots[0] is >=
btrfs_header_nritems(leaf X), it
calls btrfs_next_leaf()
btrfs_next_leaf()
--> releases the path
hard link added to our inode,
with key (257 INODE_REF 500)
added to the end of leaf X,
so leaf X now has N + 1 keys
--> searches for the key
(257 INODE_REF 256), because
it was the last key in leaf X
before it released the path,
with path->keep_locks set to 1
--> ends up at leaf X again and
it verifies that the key
(257 INODE_REF 256) is no longer
the last key in the leaf, so it
returns with path->nodes[0] ==
leaf X and path->slots[0] == N,
pointing to the new item with
key (257 INODE_REF 500)
the loop iteration of run_dealloc_nocow()
does not break out the loop and continues
because the key referenced in the path
at path->nodes[0] and path->slots[0] is
for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
and its offset (500) is less then our delalloc
range's end (8192)
the item pointed by the path, an inode reference item,
is (incorrectly) interpreted as a file extent item and
we get an invalid extent type, leading to the BUG_ON(1):
This happened because the item we were processing did not match a file
extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this
case we cast the item to a struct btrfs_file_extent_item pointer and
then find a type field value that does not match any of the expected
values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens
due to a tiny time window where a race can happen as exemplified below.
For example, consider the following scenario where we're using the
NO_HOLES feature and we have the following two neighbour leafs:
Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather
than explicit because NO_HOLES is enabled). Now if our inode has an
ordered extent for the range [4K, 8K[ that is finishing, the following
can happen:
CPU 1 CPU 2
btrfs_finish_ordered_io()
insert_reserved_file_extent()
__btrfs_drop_extents()
Searches for the key
(257 EXTENT_DATA 4096) through
btrfs_lookup_file_extent()
Key not found and we get a path where
path->nodes[0] == leaf X and
path->slots[0] == N
Because path->slots[0] is >=
btrfs_header_nritems(leaf X), we call
btrfs_next_leaf()
btrfs_next_leaf() releases the path
inserts key
(257 INODE_REF 4096)
at the end of leaf X,
leaf X now has N + 1 keys,
and the new key is at
slot N
btrfs_next_leaf() searches for
key (257 INODE_REF 256), with
path->keep_locks set to 1,
because it was the last key it
saw in leaf X
finds it in leaf X again and
notices it's no longer the last
key of the leaf, so it returns 0
with path->nodes[0] == leaf X and
path->slots[0] == N (which is now
< btrfs_header_nritems(leaf X)),
pointing to the new key
(257 INODE_REF 4096)
__btrfs_drop_extents() casts the
item at path->nodes[0], slot
path->slots[0], to a struct
btrfs_file_extent_item - it does
not skip keys for the target
inode with a type less than
BTRFS_EXTENT_DATA_KEY
(BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY)
sees a bogus value for the type
field triggering the WARN_ON in
the trace shown above, and sets
extent_end = search_start (4096)
does the if-then-else logic to
fixup 0 length extent items created
by a past bug from hole punching:
if (extent_end == key.offset &&
extent_end >= search_start)
goto delete_extent_item;
that evaluates to true and it ends
up deleting the key pointed to by
path->slots[0], (257 INODE_REF 4096),
from leaf X
The same could happen for example for a xattr that ends up having a key
with an offset value that matches search_start (very unlikely but not
impossible).
So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are
skipped, never casted to struct btrfs_file_extent_item and never deleted
by accident. Also protect against the unexpected case of getting a key
for a lower inode number by skipping that key and issuing a warning.
Zygo Blaxell [Tue, 10 Nov 2015 14:52:05 +0000 (09:52 -0500)]
Merge tag 'v4.2.6' into zygo-4.2.6-zb64
This is the 4.2.6 stable release
# gpg: Signature made Mon Nov 9 17:38:02 2015 EST using RSA key ID 6092693E
# gpg: Good signature from "Greg Kroah-Hartman (Linux kernel stable release signing key) <greg@kroah.com>"
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 647F 2865 4894 E3BD 4571 99BE 38DB BDC8 6092 693E
Commit 0b34a166f291d255755be46e43ed5497cdd194f2 "x86/xen: Support
kexec/kdump in HVM guests by doing a soft reset" has been added to the
4.2-stable tree" needed to correct the CONFIG variable, as
CONFIG_KEXEC_CORE only showed up in 4.3.
Reported-by: David Vrabel <david.vrabel@citrix.com> Reported-by: Luis Henriques <luis.henriques@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The Intel Baytrail pinctrl driver implements irqchip callbacks which are
called with desc->lock raw_spinlock held. In mainline this is fine because
spinlock resolves to raw_spinlock. However, running the same code in -rt we
get:
BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917
in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/0
Preemption disabled at:[<ffffffff81092e9f>] cpu_startup_entry+0x17f/0x480
There is a hardware issue in Intel Baytrail where concurrent GPIO register
access might result reads of 0xffffffff and writes might get dropped
completely.
Prevent this from happening by taking the serializing lock in all places
where it is possible that more than one thread might be accessing the
hardware concurrently.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Cc: Lucas De Marchi <lucas.de.marchi@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Use is_zero_pfn() on pteval only after pte_present() check on pteval
(It might be better idea to introduce is_zero_pte() which checks
pte_present() first).
Otherwise when working on a swap or migration entry and if pte_pfn's
result is equal to zero_pfn by chance, we lose user's data in
__collapse_huge_page_copy(). So if you're unlucky, the application
segfaults and finally you could see below message on exit:
If user space calls unreference on a user_dmabuf it will typically
kill the struct ttm_base_object member which is responsible for the
user-space visibility. However the dmabuf part may still be alive and
refcounted. In some situations, like for shared guest-backed surface
referencing/opening, the driver may try to reference the
struct ttm_base_object member again, causing an immediate kernel warning
and a later kernel NULL pointer dereference.
Fix this by always maintaining a reference on the struct
ttm_base_object member, in situations where it might subsequently be
referenced.
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by: Brian Paul <brianp@vmware.com> Reviewed-by: Sinclair Yeh <syeh@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the STXR instruction fails in the SWP emulation code, we leave *data
overwritten with the loaded value, therefore corrupting the data written
by a subsequent, successful attempt.
This patch re-jigs the code so that we only write back to *data once we
know that the update has happened.
Fixes: bd35a4adc413 ("arm64: Port SWP/SWPB emulation support from arm") Reported-by: Shengjiu Wang <shengjiu.wang@freescale.com> Reported-by: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This is a workaround for KNL platform, where in some cases MPERF counter
will not have updated value before next read of MSR_IA32_MPERF. In this
case divide by zero will occur. This change ignores current sample for
busy calculation in this case.
Fixes: b34ef932d79a (intel_pstate: Knights Landing support) Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9d5142624256 ("sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target")
broke select_task_rq_dl() and find_lock_later_rq(), because it introduced
a comparison between the local task's deadline and dl.earliest_dl.curr of
the remote queue.
However, if the remote runqueue does not contain any SCHED_DEADLINE
task its earliest_dl.curr is 0 (always smaller than the deadline of
the local task) and the remote runqueue is not selected for pushing.
As a result, if an application creates multiple SCHED_DEADLINE
threads, they will never be pushed to runqueues that do not already
contain SCHED_DEADLINE tasks.
This patch fixes the issue by checking if dl.dl_nr_running == 0.
Signed-off-by: Luca Abeni <luca.abeni@unitn.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@linux.intel.com> Fixes: 9d5142624256 ("sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target") Link: http://lkml.kernel.org/r/1444982781-15608-1-git-send-email-luca.abeni@unitn.it Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ib_send_cm_sidr_rep could sometimes erase the node from the sidr
(depending on errors in the process). Since ib_send_cm_sidr_rep is
called both from cm_sidr_req_handler and cm_destroy_id, cm_id_priv
could be either erased from the rb_tree twice or not erased at all.
Fixing that by making sure it's erased only once before freeing
cm_id_priv.
tags is freed in blk_mq_free_rq_map() and should not be used after that.
The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
free_cpumask_var() is nop.
tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
free cpumask in its counter part, blk_mq_free_tags().
Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Keith Busch <keith.busch@intel.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We have to exclude memory locations <= PAGE_SIZE from
the condition and let the kernel mode fault path catch it.
Otherwise a kernel NULL pointer exception will be reported
as a kernel user space access.
Fixes: d2313084e2c (um: Catch unprotected user memory access) Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The value of emul_con was getting overwritten if the selected soc is
SOC_ARCH_EXYNOS5260. And so as a result we were reading from the wrong
register in the case of SOC_ARCH_EXYNOS5260.
Fixes: 488c7455d74c ("thermal: exynos: Add the support for Exynos5433 TMU") Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Reviewed-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Reviewed-by: Chanwoo Choi <cw00.choi@samsung.com> Acked-by: Lukasz Majewski <l.majewski@samsung.com> Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Kukjin Kim <kgene@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 8eb934591f8b ("btrfs: check unsupported filters in balance
arguments") adds a jump to exit label out_bargs in case the argument
check fails. At this point in addition to the bargs memory, the
memory for struct btrfs_balance_control has already been allocated.
Ownership of bctl is passed to btrfs_balance() in the good case,
thus the memory is not freed due to the introduced jump. Make sure
that the memory gets freed in any case as necessary. Detected by
Coverity CID 1328378.
Signed-off-by: Christian Engelmayer <cengelma@gmx.at> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 00590fdd5be0 introduced RCU locking in list type and in
doing so introduced a memory allocation in list_set_add, which
is done in an atomic context, due to the fact that ipset rcu
list modifications are serialised with a spin lock. The reason
why we can't use a mutex is that in addition to modifying the
list with ipset commands, it's also being modified when a
particular ipset rule timeout expires aka garbage collection.
This gc is triggered from set_cleanup_entries, which in turn
is invoked from a timer thus requiring the lock to be bh-safe.
Concretely the following call chain can lead to "sleeping function
called in atomic context" splat:
call_ad -> list_set_uadt -> list_set_uadd -> kzalloc(, GFP_KERNEL).
And since GFP_KERNEL allows initiating direct reclaim thus
potentially sleeping in the allocation path.
To fix the issue change the allocation type to GFP_ATOMIC, to
correctly reflect that it is occuring in an atomic context.
Fixes: 00590fdd5be0 ("netfilter: ipset: Introduce RCU locking in list type") Signed-off-by: Nikolay Borisov <kernel@kyup.com> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When pci_pool_alloc fails in mvs_task_prep then task->lldd_task stays
NULL but it's later used in mvs_abort_task as slot which is passed
to mvs_slot_task_free causing NULL pointer dereference.
Just return from mvs_slot_task_free when passed with NULL slot.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=101891 Signed-off-by: Dāvis Mosāns <davispuh@gmail.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: James Bottomley <JBottomley@Odin.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The LIC doesn't deal with the different types of interrupts itself
but needs to forward calls to set the appropriate type to its parent
IRQ controller.
Without this fix all IRQs routed through the LIC will stay at the
initial EDGE type, while most of them should actually be level triggered.
Fixes: 1eec582158e2 "irqchip: tegra: Add Tegra210 support" Signed-off-by: Lucas Stach <dev@lynxeye.de> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: Thierry Reding <thierry.reding@gmail.com> Cc: Alexandre Courbot <gnurou@gmail.com> Cc: Jason Cooper <jason@lakedaemon.net> Cc: Marc Zyngier <marc.zyngier@arm.com> Link: http://lkml.kernel.org/r/1445787552-13062-1-git-send-email-dev@lynxeye.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7d375bffa524 ("sb_edac: Fix support for systems with two home agents per socket")
NUM_CHANNELS was changed to 8 and the channel space was renumerated to
handle EN, EP, and EX configurations.
The *_mci_bind_devs() functions - except for sbridge_mci_bind_devs() -
got a new device presence check in the form of saw_chan_mask. However,
sbridge_mci_bind_devs() still uses the NUM_CHANNELS for loop.
With the increase in NUM_CHANNELS, this loop fails at index 4 since
SB only has 4 TADs. This results in the following error on SB machines:
EDAC sbridge: Some needed devices are missing
EDAC sbridge: Couldn't find mci handler
EDAC sbridge: Couldn't find mci handle
This patch adapts the saw_chan_mask logic for sbridge_mci_bind_devs() as
well.
After this patch:
EDAC MC0: Giving out device to module sbridge_edac.c controller Sandy Bridge Socket#0: DEV 0000:3f:0e.0 (POLLED)
EDAC MC1: Giving out device to module sbridge_edac.c controller Sandy Bridge Socket#1: DEV 0000:7f:0e.0 (POLLED)
This commit is poorly justified, I can find not discusison in email,
and it clearly causes a problem.
If a device which is being recovered fails and is subsequently
re-added to an array, there could easily have been changes to the
array *before* the point where the recovery was up to. So the
recovery must start again from the beginning.
If a spare is being recovered and fails, then when it is re-added we
really should do a bitmap-based recovery up to the recovery-offset,
and then a full recovery from there. Before this reversion, we only
did the "full recovery from there" which is not corect. After this
reversion with will do a full recovery from the start, which is safer
but not ideal.
It will be left to a future patch to arrange the two different styles
of recovery.
Reported-and-tested-by: Nate Dailey <nate.dailey@stratus.com> Signed-off-by: NeilBrown <neilb@suse.com> Fixes: 7eb418851f32 ("md: allow a partially recovered device to be hot-added to an array.") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After commit 566c09c53455 ("raid5: relieve lock contention in get_active_stripe()")
__find_stripe() is called under conf->hash_locks + hash.
But handle_stripe_clean_event() calls remove_hash() under
conf->device_lock.
Under some cirscumstances the hash chain can be circuited,
and we get an infinite loop with disabled interrupts and locked hash
lock in __find_stripe(). This leads to hard lockup on multiple CPUs
and following system crash.
I was able to reproduce this behavior on raid6 over 6 ssd disks.
The devices_handle_discard_safely option should be set to enable trim
support. The following script was used:
for i in `seq 1 32`; do
dd if=/dev/zero of=large$i bs=10M count=100 &
done
neilb: original was against a 3.x kernel. I forward-ported
to 4.3-rc. This verison is suitable for any kernel since
Commit: 59fc630b8b5f ("RAID5: batch adjacent full stripe write")
(v4.1+). I'll post a version for earlier kernels to stable.
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Fixes: 566c09c53455 ("raid5: relieve lock contention in get_active_stripe()") Signed-off-by: NeilBrown <neilb@suse.com> Cc: Shaohua Li <shli@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This was introduced with 9e882242c6193ae6f416f2d8d8db0d9126bd996b
which changed the return value of submit_bio_wait() to return != 0 on
error, but didn't update the caller accordingly.
This was introduced with 9e882242c6193ae6f416f2d8d8db0d9126bd996b
which changed the return value of submit_bio_wait() to return != 0 on
error, but didn't update the caller accordingly.
Currently a number of Crypto API operations may fail when a signal
occurs. This causes nasty problems as the caller of those operations
are often not in a good position to restart the operation.
In fact there is currently no need for those operations to be
interrupted by user signals at all. All we need is for them to
be killable.
This patch replaces the relevant calls of signal_pending with
fatal_signal_pending, and wait_for_completion_interruptible with
wait_for_completion_killable, respectively.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 92bac83dd79e ("Input: alps - non interleaved V2 dualpoint has
separate stick button bits") assumes that all alps v2 non-interleaved
dual point setups have the separate stick button bits.
Later we limited this to Dell laptops only because of reports that this
broke things on non Dell laptops. Now it turns out that this breaks things
on the Dell Latitude D600 too. So it seems that only the Dell Latitude
D420/430/620/630, which all share the same touchpad / stick combo,
have these separate bits.
This patch limits the checking of the separate bits to only these models
fixing regressions with other models.
Reported-and-tested-by: Larry Finger <Larry.Finger@lwfinger.net> Tested-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Acked-By: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If two overlayfs filesystems are stacked on top of each other, then we need
recursion in ovl_d_select_inode().
I guess d_backing_inode() is supposed to do that. But currently it doesn't
and that functionality is open coded in vfs_open(). This is now copied
into ovl_d_select_inode() to fix this regression.
Reported-by: Alban Crequy <alban.crequy@gmail.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay...") Cc: David Howells <dhowells@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In ovl_copy_up_locked(), newdentry is leaked if the function exits through
out_cleanup as this just to out after calling ovl_cleanup() - which doesn't
actually release the ref on newdentry.
The out_cleanup segment should instead exit through out2 as certainly
newdentry leaks - and possibly upper does also, though this isn't caught
given the catch of newdentry.
Without this fix, something like the following is seen:
Open the lower file with O_LARGEFILE in ovl_copy_up().
Pass O_LARGEFILE unconditionally in ovl_copy_up_data() as it's purely for
catching 32-bit userspace dealing with a file large enough that it'll be
mishandled if the application isn't aware that there might be an integer
overflow. Inside the kernel, there shouldn't be any problems.
63692df103e9 ("PCI: Allow numa_node override via sysfs") didn't check that
the numa node provided by userspace is valid. Passing a node number too
high would attempt to access invalid memory and trigger a kernel panic.
Laura Abbott <labbott@redhat.com> produced a patch which lead us to
inspect symbol_put_addr(). This function has a comment claiming it
doesn't need to disable preemption around the module lookup
because it holds a reference to the module it wants to find, which
therefore cannot go away.
This is wrong (and a false optimization too, preempt_disable() is really
rather cheap, and I doubt any of this is on uber critical paths,
otherwise it would've retained a pointer to the actual module anyway and
avoided the second lookup).
While its true that the module cannot go away while we hold a reference
on it, the data structure we do the lookup in very much _CAN_ change
while we do the lookup. Therefore fix the comment and add the
required preempt_disable().
xen-blkfront will crash if the check to talk_to_blkback()
in blkback_changed()(XenbusStateInitWait) returns an error.
The driver data is freed and info is set to NULL. Later during
the close process via talk_to_blkback's call to xenbus_dev_fatal()
the null pointer is passed to and dereference in blkfront_closing.
We received several reports of systems rebooting and powering on
after an attempted shutdown. Testing showed that setting
XHCI_SPURIOUS_WAKEUP quirk in addition to the XHCI_SPURIOUS_REBOOT
quirk allowed the system to shutdown as expected for LynxPoint-LP
xHCI controllers. Set the quirk back.
Note that the quirk was originally introduced for LynxPoint and
LynxPoint-LP just for this same reason. See:
commit 638298dc66ea ("xhci: Fix spurious wakeups after S5 on Haswell")
It was later limited to only concern HP machines as it caused
regression on some machines, see both bug and commit:
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=66171
commit 6962d914f317 ("xhci: Limit the spurious wakeup fix only to HP machines")
Later it was discovered that the powering on after shutdown
was limited to LynxPoint-LP (Haswell-ULT) and that some non-LP HP
machine suffered from spontaneous resume from S3 (which should
not be related to the SPURIOUS_WAKEUP quirk at all). An attempt
to fix this then removed the SPURIOUS_WAKEUP flag usage completely.
commit b45abacde3d5 ("xhci: no switching back on non-ULT Haswell")
Current understanding is that LynxPoint-LP (Haswell ULT) machines
need the SPURIOUS_WAKEUP quirk, otherwise they will restart, and
plain Lynxpoint (Haswell) machines may _not_ have the quirk
set otherwise they again will restart.
Signed-off-by: Laura Abbott <labbott@fedoraproject.org> Cc: Takashi Iwai <tiwai@suse.de> Cc: Oliver Neukum <oneukum@suse.com>
[Added more history to commit message -Mathias] Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If a host fails to wake up a isochronous SuperSpeed device from U1/U2
in time for a isoch transfer it will generate a "No ping response error"
Host will then move to the next transfer descriptor.
Handle this case in the same way as missed service errors, tag the
current TD as skipped and handle it on the next transfer event.
a PPC64LE kernel fails to boot when fbcon_add_cursor_timer uses an
uninitialized ops->cur_blink_jiffies. Prevent by initializing
in fbcon_init before the call to info->fbops->fb_set_par.
clk_add_alias() was not correctly handling the case where alias_dev_name
was NULL: rather than producing an entry with a NULL dev_id pointer,
it would produce a device name of (null). Fix this.
Fixes: 2568999835d7 ("clkdev: add clkdev_create() helper") Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi> Tested-by: Aaro Koskinen <aaro.koskinen@iki.fi> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 00d8689b85a7 ("i2c: mv64xxx: rework offload support to fix
several problems") completely reworked the offload support, but left a
debugging-related "return false" at the beginning of the
mv64xxx_i2c_can_offload() function. This has the unfortunate consequence
that offloading is in fact never used, which wasn't really the
intention.
This commit fixes that problem by removing the bogus "return false".
Fixes: 00d8689b85a7 ("i2c: mv64xxx: rework offload support to fix several problems") Signed-off-by: Hezi Shahmoon <hezi@marvell.com>
[Thomas: reworked commit log and title.] Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This commit prevents from sending "big" file using Bluetooth.
When sending a lot of data quickly through the Bluetooth interface, and
after a variable amount of data sent, transfer fails with error:
kernel: [ 415.247453] Bluetooth: hci0 hardware error 0x00
Found on T100TA.
After reverting this commit, send works fine for any file size.
Signed-off-by: Frederic Danis <frederic.danis@linux.intel.com> Fixes: 9119fba0cfed (serial: 8250_dma: don't bother DMA with small transfers) Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Compiling the nvme driver on 32-bit warns about a cast from a __u64
variable to a pointer:
drivers/block/nvme-core.c: In function 'nvme_submit_io':
drivers/block/nvme-core.c:1847:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
(void __user *)io.addr, length, NULL, 0);
The cast here is intentional and safe, so we can shut up the
gcc warning by adding an intermediate cast to 'uintptr_t'.
I had previously submitted a patch to fix this problem in the
nvme driver, but it was accepted on the same day that two new
warnings got added.
For clarification, I also change the third instance of this cast
to use uintptr_t instead of unsigned long now.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: d29ec8241c10e ("nvme: submit internal commands through the block layer") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
btree_split_beneath()'s error path had an outstanding FIXME that speaks
directly to the potential for _not_ cleaning up a previously allocated
bufio-backed block.
Fix this by releasing the previously allocated bufio block using
unlock_block().
Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <thornber@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the CLEAN_SHUTDOWN flag is not set when a cache is loaded then all cache
blocks are marked as dirty and a full writeback occurs.
__commit_transaction() is responsible for setting/clearing
CLEAN_SHUTDOWN (based the flags_mutator that is passed in).
Fix this issue, of the cache's on-disk flags being wrong, by making sure
__commit_transaction() does not reset the flags after the mutator has
altered the flags in preparation for them being serialized to disk.
Commit 4c7e309340ff ("dm btree remove: fix bug in redistribute3") wasn't
a complete fix for redistribute3().
The redistribute3 function takes 3 btree nodes and shares out the entries
evenly between them. If the three nodes in total contained
(MAX_ENTRIES * 3) - 1 entries between them then this was erroneously getting
rebalanced as (MAX_ENTRIES - 1) on the left and right, and (MAX_ENTRIES + 1) in
the center.
Fix this issue by being more careful about calculating the target number
of entries for the left and right nodes.
Unit tested in userspace using this program:
https://github.com/jthornber/redistribute3-test/blob/master/redistribute3_t.c
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
bdi's are initialized in two steps, bdi_init() and bdi_register(), but
destroyed in a single step by bdi_destroy() which, for a bdi embedded
in a request_queue, is called during blk_cleanup_queue() which makes
the queue invisible and starts the draining of remaining usages.
A request_queue's user can access the congestion state of the embedded
bdi as long as it holds a reference to the queue. As such, it may
access the congested state of a queue which finished
blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
Because the congested state was embedded in backing_dev_info which in
turn is embedded in request_queue, accessing the congested state after
bdi_destroy() was called was fine. The bdi was destroyed but the
memory region for the congested state remained accessible till the
queue got released.
a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in
bdi_writeback") changed the situation. Now, the root congested state
which is expected to be pinned while request_queue remains accessible
is separately reference counted and the base ref is put during
bdi_destroy(). This means that the root congested state may go away
prematurely while the queue is between bdi_dstroy() and
blk_cleanup_queue(), which was detected by Andrey's KASAN tests.
The root cause of this problem is that bdi doesn't distinguish the two
steps of destruction, unregistration and release, and now the root
congested state actually requires a separate release step. To fix the
issue, this patch separates out bdi_unregister() and bdi_exit() from
bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue()
and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a
simple wrapper calling the two steps back-to-back.
While at it, the prototype of bdi_destroy() is moved right below
bdi_setup_and_register() so that the counterpart operations are
located together.
Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") Reported-and-tested-by: Andrey Konovalov <andreyknvl@google.com> Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com Reviewed-by: Jan Kara <jack@suse.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit dd006da21646 ("arm64: mm: increase VA range of identity map")
introduced a mechanism to extend the virtual memory map range
to support arm64 systems with system RAM located at very high offset,
where the identity mapping used to enable/disable the MMU requires
additional translation levels to map the physical memory at an equal
virtual offset.
The kernel detects at boot time the tcr_el1.t0sz value required by the
identity mapping and sets-up the tcr_el1.t0sz register field accordingly,
any time the identity map is required in the kernel (ie when enabling the
MMU).
After enabling the MMU, in the cold boot path the kernel resets the
tcr_el1.t0sz to its default value (ie the actual configuration value for
the system virtual address space) so that after enabling the MMU the
memory space translated by ttbr0_el1 is restored as expected.
Commit dd006da21646 ("arm64: mm: increase VA range of identity map")
also added code to set-up the tcr_el1.t0sz value when the kernel resumes
from low-power states with the MMU off through cpu_resume() in order to
effectively use the identity mapping to enable the MMU but failed to add
the code required to restore the tcr_el1.t0sz to its default value, when
the core returns to the kernel with the MMU enabled, so that the kernel
might end up running with tcr_el1.t0sz value set-up for the identity
mapping which can be lower than the value required by the actual virtual
address space, resulting in an erroneous set-up.
This patchs adds code in the resume path that restores the tcr_el1.t0sz
default value upon core resume, mirroring this way the cold boot path
behaviour therefore fixing the issue.
Cc: Catalin Marinas <catalin.marinas@arm.com> Fixes: dd006da21646 ("arm64: mm: increase VA range of identity map") Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With this patch applied, we were the only architecture making this sort
of adjustment to the PC calculation in the unwinder. This causes
problems for ftrace, where the PC values are matched against the
contents of the stack frames in the callchain and fail to match any
records after the address adjustment.
Whilst there has been some effort to change ftrace to workaround this,
those patches are not yet ready for mainline and, since we're the odd
architecture in this regard, let's just step in line with other
architectures (like arch/arm/) for now.
Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 8a603f91cc48 ("ARM: 8445/1: fix vdsomunge not to depend on
glibc specific byteswap.h") unfortunately introduced a bug created but
not found during discussion and patch simplification.
Reported-by: Efraim Yawitz <efraim.yawitz@gmail.com> Signed-off-by: H. Nikolaus Schaller <hns@goldelico.com> Fixes: 8a603f91cc48 ("ARM: 8445/1: fix vdsomunge not to depend on glibc specific byteswap.h") Signed-off-by: Nathan Lynch <nathan_lynch@mentor.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 685e2d08c54b ("ARM: OMAP1: Change interrupt numbering for
sparse IRQ") turned on SPARSE_IRQ on OMAP1, but forgot to change
the number of INT_DMA_LCD. This broke the boot at least on Nokia 770,
where the device hangs during framebuffer initialization.
Fix by defining INT_DMA_LCD like the other interrupts.
Fixes: 685e2d08c54b ("ARM: OMAP1: Change interrupt numbering for sparse IRQ") Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1d8aca9df612f5751892fb2642d72536f2f48fd0
"ARM: ux500: fix MMC/SD card regression"
fixed broken the level shifter: it should be default ON
but became default OFF.
LDO1 regulator (VDD_SD) is connected to SoC's vddshv8. vddshv8 needs to
be kept always powered (see commit 5a0f93c6576a ("ARM: dts: Add
am57xx-beagle-x15"), but at the moment VDD_SD is enabled/disabled
depending on whether an SD card is inserted or not.
This patch sets LDO1 regulator to always-on.
This patch has a side effect of fixing another issue, HDMI DDC not
working when SD card is not inserted:
Why this happens is that the tpd12s015 (HDMI level shifter/ESD
protection chip) has LS_OE GPIO input, which needs to be enabled for the
HDMI DDC to work. LS_OE comes from gpio6_28. The pin that provides
gpio6_28 is powered by vddshv8, and vddshv8 comes from VDD_SD.
So when SD card is not inserted, VDD_SD is disabled, and LS_OE stays
off.
The proper fix for the HDMI DDC issue would be to maybe have the pinctrl
framework manage the pin specific power.
Apparently this fixes also a third issue (copy paste from Kishon's
patch):
ldo1_reg in addition to being connected to the io lines is also
connected to the card detect line. On card removal, omap_hsmmc
driver does a regulator_disable causing card detect line to be
pulled down. This raises a card insertion interrupt and once the
MMC core detects there is no card inserted, it does a
regulator disable which again raises a card insertion interrupt.
This happens in a loop causing infinite MMC interrupts.
Fixes: 5a0f93c6576a ("ARM: dts: Add am57xx-beagle-x15") Cc: Kishon Vijay Abraham I <kishon@ti.com> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Reported-by: Louis McCarthy <compeoree@gmail.com> Acked-by: Nishanth Menon <nm@ti.com> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, BG2Q shares a compatible with BG2. This is incorrect, since
BG2 and BG2Q use different USB PLL dividers. In reality, BG2Q shares a
divider with BG2CD. Change BG2Q's USB PHY compatible string to reflect
that.
Signed-off-by: Thomas Hebb <tommyhebb@gmail.com> Signed-off-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This commit enables standby support on Armada 385 DB-AP board, because
the PM initalization routine requires "marvell,armada380" compatible
string for all Armada 38x-based platforms.
Beside the compatible "marvell,armada38x" was wrong and should be fixed
in the stable kernels too.
[gregory.clement@free-electrons.com: add information, about the fixes] Fixes: e5ee12817e9ea ("ARM: mvebu: Add Armada 385 Access Point
Development Board support") Signed-off-by: Marcin Wojtas <mw@semihalf.com> Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
DSA expects the host_dev pointer to be the device structure associated
with the MDIO bus controller driver. First commit breaking that was c3a07134e6aa ("mv643xx_eth: convert to use the Marvell Orion MDIO
driver"), and then, it got completely under the radar for a while.
Reported-by: Frans van de Wiel <fvdw@fvdw.eu> Fixes: c3a07134e6aa ("mv643xx_eth: convert to use the Marvell Orion MDIO driver") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On each next iteration of for_each_compatible_node() the reference
counter for current device node is already decreased by the loop
iterator. The manual call to of_node_get() is required only on loop
break which is not happening here.
The double of_node_get() (with enabled CONFIG_OF_DYNAMIC) lead to
decreasing the counter below expected, initial value.
Fixes: fe4034a3fad7 ("ARM: EXYNOS: Add missing of_node_put() when parsing power domains") Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Kukjin Kim <kgene@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Mapping an image with a long parent chain (e.g. image foo, whose parent
is bar, whose parent is baz, etc) currently leads to a kernel stack
overflow, due to the following recursion in the reply path:
Currently we leak parent_spec and trigger a "parent reference
underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
The problem is we take the !parent out_err branch and that only drops
refcounts; parent_spec that would've been freed had we called
rbd_dev_unparent() remains and triggers rbd_warn() in
rbd_dev_parent_put() - at that point we have parent_spec != NULL and
parent_ref == 0, so counter ends up being -1 after the decrement.
rbd requires stable pages, as it performs a crc of the page data before
they are send to the OSDs.
But since kernel 3.9 (patch 1d1d1a767206fbe5d4c69493b7e6d2a8d08cc0a0
"mm: only enforce stable page writes if the backing device requires
it") it is not assumed anymore that block devices require stable pages.
This patch sets the necessary flag to get stable pages back for rbd.
In a ceph installation that provides multiple ext4 formatted rbd
devices "bad crc" messages appeared regularly (ca 1 message every 1-2
minutes on every OSD that provided the data for the rbd) in the
OSD-logs before this patch. After this patch this messages are pretty
much gone (only ca 1-2 / month / OSD).
Signed-off-by: Ronny Hegewald <Ronny.Hegewald@online.de>
[idryomov@gmail.com: require stable pages only in crc case, changelog]
[idryomov@gmail.com: backport to 3.18-4.2: context] Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This accelerometer accidentally either emits a DRDY signal or an
IRQ signal. Accidentally I activated the IRQ signal as I thought
it was analogous to the interrupt generator on other ST
accelerometers. This was wrong. After this patch generic_buffer
gives a nice stream of accelerometer readings.
Fixes: 3acddf74f807778f "iio: st-sensors: add support for lis3lv02d accelerometer" Cc: Denis CIOCCA <denis.ciocca@st.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Jonathan Cameron <jic23@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pinning a userptr onto the hardware raises interesting questions about
the lifetime of such a surface as the framebuffer extends that life
beyond the client's address space. That is the hardware will need to
keep scanning out from the backing storage even after the client wants
to remap its address space. As the hardware pins the backing storage,
the userptr becomes invalid and this raises a WARN when the clients
tries to unmap its address space. The situation can be even more
complicated when the buffer is passed between processes, between a
client and display server, where the lifetime and hardware access is
even more confusing. Deny it.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We accidentally lost the initial DPLL register write in 1c4e02746147 drm/i915: Fix DVO 2x clock enable on 830M
The "three times for luck" hack probably saved us from a total
disaster. But anyway, bring the initial write back so that the
code actually makes some sense.
Reported-and-tested-by: Nick Bowler <nbowler@draconx.ca>
References: http://mid.gmane.org/CAN_QmVyMaArxYgEcVVsGvsMo7-6ohZr8HmF5VhkkL4i9KOmrhw@mail.gmail.com Cc: Nick Bowler <nbowler@draconx.ca> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In order to flush the results from in-batch pipecontrol writes (used for
example in glQuery) before declaring the batch complete (and so declaring
the query results coherent), we need to set the FlushEnable bit in our
flushing pipecontrol. The FlushEnable bit "waits until all previous
writes of immediate data from post-sync circles are complete before
executing the next command".
I get GPU hangs on byt without flushing these writes (running ue4).
piglit has examples where the flush is required for correct rendering.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Acked-by: Daniel Vetter <daniel@ffwll.ch> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On nv50+, we restrict the valid domains to just the one where the buffer
was originally created. However after the buffer is evicted to system
memory, we might move it back to a different domain that was not
originally valid. When sharing the buffer and retrieving its GEM_INFO
data, we still want the domain that will be valid for this buffer in a
pushbuf, not the one where it currently happens to be.
This resolves fdo#92504 and several others. These are due to suspend
evicting all buffers, making it more likely that they temporarily end up
in the wrong place.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92504 Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu> Signed-off-by: Ben Skeggs <bskeggs@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When PMU context is migrating between CPUs, interrupt affinity is set as
well. Only this should not happen when the CCN interrupt is not being
used at all (the driver is using a hrtimer tick instead).
The reason is that the stack tracer traces all function calls, and some of
those calls happen while exiting or entering user space and idle. Some of
these functions are called after RCU had already stopped watching, as RCU
does not watch userspace or idle CPUs.
If a max stack is hit, then the save_stack_trace() is called, which will
check module addresses and call module_assert_mutex_or_preempt(), and then
trigger the warning. Sad part is, the warning itself will also do a stack
trace and tigger the same warning. That probably should be fixed.
The warning was added by 0be964be0d45 "module: Sanitize RCU usage and
locking" but this bug has probably been around longer. But it's unlikely to
cause much harm, but the new warning causes the system to lock up.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc:"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>