]> git.hungrycats.org Git - linux/log
linux
2 years agozygo: config: -WARN_ALL_UNSEEDED_RANDOM, it is far too noisy zygo-5.0.x-zb64
Zygo Blaxell [Mon, 12 Sep 2022 19:50:59 +0000 (15:50 -0400)]
zygo: config: -WARN_ALL_UNSEEDED_RANDOM, it is far too noisy

(cherry picked from commit c042d30fa3b35c68703988aba2136c49c7971075)

5 years agobtrfs: do not use readahead for running delayed refs
Josef Bacik [Fri, 13 Mar 2020 21:09:53 +0000 (17:09 -0400)]
btrfs: do not use readahead for running delayed refs

Readahead will generate a lot of extra reads for adjacent nodes, but
when running delayed refs we have no idea if the next ref is going to be
adjacent or not, so this potentially just generates a lot of extra IO.
To make matters worse each ref is truly just looking for one item, it
doesn't generally search forward, so we simply don't need it here.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry-picked from commit cd22a51c6650f567f297215dd7c47c3401ee7290)

(cherry picked from commit cd308517ce6b4be33f6802f14c785a4a81affc6b)

5 years agobtrfs: reloc: reorder reservation before root selection
Josef Bacik [Fri, 13 Mar 2020 21:17:06 +0000 (17:17 -0400)]
btrfs: reloc: reorder reservation before root selection

Since we're not only checking for metadata reservations but also if we
need to throttle our delayed ref generation, reorder
reserve_metadata_space() above the select_one_root() call in
relocate_tree_block().

The reason we want this is because select_reloc_root() will mess with
the backref cache, and if we're going to bail we want to be able to
cleanly remove this node from the backref cache and come back along to
regenerate it.  Move it up so this is the first thing we do to make
restarting cleaner.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 5f6b2e5cd67a7bb39eb24fa3206e54803e1a25ff)
(cherry picked from commit 7cc6fc512a506bc67013849c8f56110b625805f5)

5 years agobtrfs: restart relocate_tree_blocks properly
Josef Bacik [Fri, 13 Mar 2020 21:17:07 +0000 (17:17 -0400)]
btrfs: restart relocate_tree_blocks properly

There are two bugs here, but fixing them independently would just result
in pain if you happened to bisect between the two patches.

First is how we handle the -EAGAIN from relocate_tree_block().  We don't
set error, unless we happen to be the first node, which makes no sense,
I have no idea what the code was trying to accomplish here.

We in fact _do_ want err set here so that we know we need to restart in
relocate_block_group().  Also we need finish_pending_nodes() to not
actually call link_to_upper(), because we didn't actually relocate the
block.

And then if we do get -EAGAIN we do not want to set our backref cache
last_trans to the one before ours.  This would force us to update our
backref cache if we didn't cross transaction ids, which would mean we'd
have some nodes updated to their new_bytenr, but still able to find
their old bytenr because we're searching the same commit root as the
last time we went through relocate_tree_blocks.

Fixing these two things keeps us from panicing when we start breaking
out of relocate_tree_blocks() either for delayed ref flushing or enospc.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 50dbbb71c79df89532ec41d118d59386e5a877e3)
(cherry picked from commit 1fbf6008e0ee4614410342854862d7106e2e6304)

5 years agobtrfs: track reloc roots based on their commit root bytenr
Josef Bacik [Fri, 13 Mar 2020 21:17:08 +0000 (17:17 -0400)]
btrfs: track reloc roots based on their commit root bytenr

We always search the commit root of the extent tree for looking up back
references, however we track the reloc roots based on their current
bytenr.

This is wrong, if we commit the transaction between relocating tree
blocks we could end up in this code in build_backref_tree

  if (key.objectid == key.offset) {
  /*
   * Only root blocks of reloc trees use backref
   * pointing to itself.
   */
  root = find_reloc_root(rc, cur->bytenr);
  ASSERT(root);
  cur->root = root;
  break;
  }

find_reloc_root() is looking based on the bytenr we had in the commit
root, but if we've COWed this reloc root we will not find that bytenr,
and we will trip over the ASSERT(root).

Fix this by using the commit_root->start bytenr for indexing the commit
root.  Then we change the __update_reloc_root() caller to be used when
we switch the commit root for the reloc root during commit.

This fixes the panic I was seeing when we started throttling relocation
for delayed refs.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit ea287ab157c2816bf12aad4cece41372f9d146b4)
(cherry picked from commit 88aef5ba02f47e58559d7c4bd35af9bc81cf7b74)

5 years agobtrfs: do not resolve backrefs for roots that are being deleted
Josef Bacik [Fri, 13 Mar 2020 21:17:09 +0000 (17:17 -0400)]
btrfs: do not resolve backrefs for roots that are being deleted

Zygo reported a deadlock where a task was stuck in the inode logical
resolve code.  The deadlock looks like this

  Task 1
  btrfs_ioctl_logical_to_ino
  ->iterate_inodes_from_logical
   ->iterate_extent_inodes
    ->path->search_commit_root isn't set, so a transaction is started
      ->resolve_indirect_ref for a root that's being deleted
->search for our key, attempt to lock a node, DEADLOCK

  Task 2
  btrfs_drop_snapshot
  ->walk down to a leaf, lock it, walk up, lock node
   ->end transaction
    ->start transaction
      -> wait_cur_trans

  Task 3
  btrfs_commit_transaction
  ->wait_event(cur_trans->write_wait, num_writers == 1) DEADLOCK

We are holding a transaction open in btrfs_ioctl_logical_to_ino while we
try to resolve our references.  btrfs_drop_snapshot() holds onto its
locks while it stops and starts transaction handles, because it assumes
nobody is going to touch the root now.  Commit just does what commit
does, waiting for the writers to finish, blocking any new trans handles
from starting.

Fix this by making the backref code not try to resolve backrefs of roots
that are currently being deleted.  This will keep us from walking into a
snapshot that's currently being deleted.

This problem was harder to hit before because we rarely broke out of the
snapshot delete halfway through, but with my delayed ref throttling code
it happened much more often.  However we've always been able to do this,
so it's not a new problem.

Fixes: 8da6d5815c59 ("Btrfs: added btrfs_find_all_roots()")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 39dba8739c4e360d7d1b27119c728791e68b0448)
(cherry picked from commit b62796815ded6cb71914000a019bd6909913f867)

5 years agobtrfs: use nofs allocations for running delayed items (v2, fixed merge splat)
Josef Bacik [Thu, 19 Mar 2020 14:11:32 +0000 (10:11 -0400)]
btrfs: use nofs allocations for running delayed items (v2, fixed merge splat)

Zygo reported the following lockdep splat while testing the balance
patches

======================================================
WARNING: possible circular locking dependency detected
5.6.0-c6f0579d496a+ #53 Not tainted
------------------------------------------------------
kswapd0/1133 is trying to acquire lock:
ffff888092f622c0 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node+0x7c/0x5b0

but task is already holding lock:
ffffffff8fc5f860 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (fs_reclaim){+.+.}:
       fs_reclaim_acquire.part.91+0x29/0x30
       fs_reclaim_acquire+0x19/0x20
       kmem_cache_alloc_trace+0x32/0x740
       add_block_entry+0x45/0x260
       btrfs_ref_tree_mod+0x6e2/0x8b0
       btrfs_alloc_tree_block+0x789/0x880
       alloc_tree_block_no_bg_flush+0xc6/0xf0
       __btrfs_cow_block+0x270/0x940
       btrfs_cow_block+0x1ba/0x3a0
       btrfs_search_slot+0x999/0x1030
       btrfs_insert_empty_items+0x81/0xe0
       btrfs_insert_delayed_items+0x128/0x7d0
       __btrfs_run_delayed_items+0xf4/0x2a0
       btrfs_run_delayed_items+0x13/0x20
       btrfs_commit_transaction+0x5cc/0x1390
       insert_balance_item.isra.39+0x6b2/0x6e0
       btrfs_balance+0x72d/0x18d0
       btrfs_ioctl_balance+0x3de/0x4c0
       btrfs_ioctl+0x30ab/0x44a0
       ksys_ioctl+0xa1/0xe0
       __x64_sys_ioctl+0x43/0x50
       do_syscall_64+0x77/0x2c0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 (&delayed_node->mutex){+.+.}:
       __lock_acquire+0x197e/0x2550
       lock_acquire+0x103/0x220
       __mutex_lock+0x13d/0xce0
       mutex_lock_nested+0x1b/0x20
       __btrfs_release_delayed_node+0x7c/0x5b0
       btrfs_remove_delayed_node+0x49/0x50
       btrfs_evict_inode+0x6fc/0x900
       evict+0x19a/0x2c0
       dispose_list+0xa0/0xe0
       prune_icache_sb+0xbd/0xf0
       super_cache_scan+0x1b5/0x250
       do_shrink_slab+0x1f6/0x530
       shrink_slab+0x32e/0x410
       shrink_node+0x2a5/0xba0
       balance_pgdat+0x4bd/0x8a0
       kswapd+0x35a/0x800
       kthread+0x1e9/0x210
       ret_from_fork+0x3a/0x50

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
                               lock(&delayed_node->mutex);
                               lock(fs_reclaim);
  lock(&delayed_node->mutex);

 *** DEADLOCK ***

3 locks held by kswapd0/1133:
 #0: ffffffff8fc5f860 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
 #1: ffffffff8fc380d8 (shrinker_rwsem){++++}, at: shrink_slab+0x1e8/0x410
 #2: ffff8881e0e6c0e8 (&type->s_umount_key#42){++++}, at: trylock_super+0x1b/0x70

stack backtrace:
CPU: 2 PID: 1133 Comm: kswapd0 Not tainted 5.6.0-c6f0579d496a+ #53
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
 dump_stack+0xc1/0x11a
 print_circular_bug.isra.38.cold.57+0x145/0x14a
 check_noncircular+0x2a9/0x2f0
 ? print_circular_bug.isra.38+0x130/0x130
 ? stack_trace_consume_entry+0x90/0x90
 ? save_trace+0x3cc/0x420
 __lock_acquire+0x197e/0x2550
 ? btrfs_inode_clear_file_extent_range+0x9b/0xb0
 ? register_lock_class+0x960/0x960
 lock_acquire+0x103/0x220
 ? __btrfs_release_delayed_node+0x7c/0x5b0
 __mutex_lock+0x13d/0xce0
 ? __btrfs_release_delayed_node+0x7c/0x5b0
 ? __asan_loadN+0xf/0x20
 ? pvclock_clocksource_read+0xeb/0x190
 ? __btrfs_release_delayed_node+0x7c/0x5b0
 ? mutex_lock_io_nested+0xc20/0xc20
 ? __kasan_check_read+0x11/0x20
 ? check_chain_key+0x1e6/0x2e0
 mutex_lock_nested+0x1b/0x20
 ? mutex_lock_nested+0x1b/0x20
 __btrfs_release_delayed_node+0x7c/0x5b0
 btrfs_remove_delayed_node+0x49/0x50
 btrfs_evict_inode+0x6fc/0x900
 ? btrfs_setattr+0x840/0x840
 ? do_raw_spin_unlock+0xa8/0x140
 evict+0x19a/0x2c0
 dispose_list+0xa0/0xe0
 prune_icache_sb+0xbd/0xf0
 ? invalidate_inodes+0x310/0x310
 super_cache_scan+0x1b5/0x250
 do_shrink_slab+0x1f6/0x530
 shrink_slab+0x32e/0x410
 ? do_shrink_slab+0x530/0x530
 ? do_shrink_slab+0x530/0x530
 ? __kasan_check_read+0x11/0x20
 ? mem_cgroup_protected+0x13d/0x260
 shrink_node+0x2a5/0xba0
 balance_pgdat+0x4bd/0x8a0
 ? mem_cgroup_shrink_node+0x490/0x490
 ? _raw_spin_unlock_irq+0x27/0x40
 ? finish_task_switch+0xce/0x390
 ? rcu_read_lock_bh_held+0xb0/0xb0
 kswapd+0x35a/0x800
 ? _raw_spin_unlock_irqrestore+0x4c/0x60
 ? balance_pgdat+0x8a0/0x8a0
 ? finish_wait+0x110/0x110
 ? __kasan_check_read+0x11/0x20
 ? __kthread_parkme+0xc6/0xe0
 ? balance_pgdat+0x8a0/0x8a0
 kthread+0x1e9/0x210
 ? kthread_create_worker_on_cpu+0xc0/0xc0
 ret_from_fork+0x3a/0x50

This is because we hold that delayed node's mutex while doing tree
operations.  Fix this by just wrapping the searches in nofs.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 351cbf6e4410e7ece05e35d0a07320538f2418b4)
(cherry picked from commit a6f2bc3e4bf7a734b5a2a9dd1f0b6cc0d242e4de)

5 years agozygo: config: enable page poisoning
Zygo Blaxell [Tue, 7 Apr 2020 03:29:26 +0000 (23:29 -0400)]
zygo: config: enable page poisoning

5 years agozygo: build-kernel: support "NO_MODULES=1" to build bzimage without modules
Zygo Blaxell [Sat, 14 Mar 2020 04:19:46 +0000 (00:19 -0400)]
zygo: build-kernel: support "NO_MODULES=1" to build bzimage without modules

zygo: build-kernel: support "extraversion" tag to rebuild bzimage without modules

(cherry picked from commit a4c24c9f81885f70a4cc442a96fd03a5c15daef2)
(cherry picked from commit 3a51996a107433b54e38dcfd6a3da25e4d707d6e)

zygo: fixup build-kernel

(cherry picked from commit bf16da36b115594ff8ab9db8e4417b00ee6ece41)
(cherry picked from commit 464bb2f50fa6ee316eff09d4d76e984a12003f29)

zygo: build-kernel: support no-modules builds instead of extraversion tag?

(cherry picked from commit b7d9c895fe6923c387248c812ac4ebfc8619280b)
(cherry picked from commit 793ee1c053f6854e4dcef9bb5c1c6446ef9c742c)
(cherry picked from commit 5d10b7a3fc529dcdcf0e23a6bb87944c3cd813a6)
(cherry picked from commit 29c86e1afb4005835bf9ffc8a54cd4a0ba964927)

5 years agozygo: config: build enough in to run btrfs tests on a VM without modules
Zygo Blaxell [Sat, 14 Mar 2020 22:36:23 +0000 (18:36 -0400)]
zygo: config: build enough in to run btrfs tests on a VM without modules

(cherry picked from commit d282979a89e18d2a54903d4f6d4c22be69af18e7)
(cherry picked from commit 6d39bfb5352db50f662882cd6b19d95fe10b2c97)
(cherry picked from commit dee62815f09edff7aab25efd017001fa44728d35)

5 years agozygo: btrfs: test BTRFS_LOGICAL_INO_ARGS_SEARCH_COMMIT_ROOT
Zygo Blaxell [Sun, 5 Apr 2020 21:04:46 +0000 (17:04 -0400)]
zygo: btrfs: test BTRFS_LOGICAL_INO_ARGS_SEARCH_COMMIT_ROOT

(cherry picked from commit 089c927a4b8edde89903667daef4ff7edf5f5428)
(cherry picked from commit 02a74429e93bedfe70218a8f164f227ef14ef7ad)

5 years agobtrfs: introduce BTRFS_LOGICAL_INO_ARGS_SEARCH_COMMIT_ROOT to trade latency for accuracy
Zygo Blaxell [Sun, 22 Mar 2020 20:57:42 +0000 (16:57 -0400)]
btrfs: introduce BTRFS_LOGICAL_INO_ARGS_SEARCH_COMMIT_ROOT to trade latency for accuracy

The LOGICAL_INO ioctl is a simple wrapper around an internal btrfs kernel
function which iterates over references to a data extent.  This function
can operate in two modes:

#1 joins the current transaction to get up to date information
on uncommitted references.  LOGICAL_INO can run for a long
time--seconds to minutes on deduped filesystems--and all that
time gets added to the latency of transaction commits. This slows
down other threads writing to the filesystem, as well as reducing
concurrency in LOGICAL_INO itself.

#2 doesn't join a transaction, and just searches commit roots
instead.  This loses access to backref data from uncommitted
references, but doesn't add latency to all other users of the
filesystem while it runs.

Userspace has no mechanism to prevent concurrent changes on the
filesystem, so userspace must tolerate out-of-date backref information
e.g.  looping removing extent refs until LOGICAL_INO gives no more
reachable references.  With a switch from the #1 mode to the #2 mode,
userspace must ensure a commit occurs between loops, either by calling
fssync itself, or by finding something else to do between loop iterations
until a commit occurs naturally.

This is a change in behavior, so we don't do it by default.  Add a new
flag SEARCH_COMMIT_ROOT for LOGICAL_INO_V2 so that users can request
the faster, lower-latency version.

Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
(cherry picked from commit f35030f24f09ae428a3c069151ee8aaac5d5d041)
(cherry picked from commit 32c0c08a2c7cb65131c2679add5a302eb24c01f2)
(cherry picked from commit d753020b163b1b2ccba6fc39009a11e2c13633e4)

5 years agozygo: OK maybe 96 kernel jobs at once is a bit excessive
Zygo Blaxell [Thu, 2 Apr 2020 15:04:32 +0000 (11:04 -0400)]
zygo: OK maybe 96 kernel jobs at once is a bit excessive

5 years agobtrfs: relocation: Check cancel request after each extent found
Qu Wenruo [Mon, 17 Feb 2020 06:16:54 +0000 (14:16 +0800)]
btrfs: relocation: Check cancel request after each extent found

When relocating data block groups with tons of small extents, or large
metadata block groups, there can be over 200,000 extents.

We will iterate all extents of such block group in relocate_block_group(),
where iteration itself can be kinda time-consuming.

So when user want to cancel the balance, the extent iteration loop can
be another target.

This patch will add the cancelling check in the extent iteration loop of
relocate_block_group() to make balance cancelling faster.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit f31ea0888caeee75bb8a8514c9f8f28b521b55df)

5 years agobtrfs: relocation: Check cancel request after each data page read
Qu Wenruo [Mon, 17 Feb 2020 06:16:53 +0000 (14:16 +0800)]
btrfs: relocation: Check cancel request after each data page read

When relocating a data extents with large large data extents, we spend
most of our time in relocate_file_extent_cluster() at stage "moving data
extents":

 1)               |  btrfs_relocate_block_group [btrfs]() {
 1)               |    relocate_file_extent_cluster [btrfs]() {
 1) $ 6586769 us  |    }
 1) + 18.260 us   |    relocate_file_extent_cluster [btrfs]();
 1) + 15.770 us   |    relocate_file_extent_cluster [btrfs]();
 1) $ 8916340 us  |  }
 1)               |  btrfs_relocate_block_group [btrfs]() {
 1)               |    relocate_file_extent_cluster [btrfs]() {
 1) $ 11611586 us |    }
 1) + 16.930 us   |    relocate_file_extent_cluster [btrfs]();
 1) + 15.870 us   |    relocate_file_extent_cluster [btrfs]();
 1) $ 14986130 us |  }

To make data relocation cancelling quicker, add extra balance cancelling
check after each page read in relocate_file_extent_cluster().

Cleanup and error handling uses the same mechanism as if the whole
process finished

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 7f913c7cfec62b91ceb4383c4dfb481092ae9d20)

5 years agobtrfs: relocation: add error injection points for cancelling balance
Qu Wenruo [Mon, 17 Feb 2020 06:16:52 +0000 (14:16 +0800)]
btrfs: relocation: add error injection points for cancelling balance

Introduce a new error injection point, should_cancel_balance().

It's just a wrapper of atomic_read(&fs_info->balance_cancel_req), but
allows us to override the return value.

Currently there are only one locations using this function:

- btrfs_balance()
  It checks cancel before each block group.

There are other locations checking fs_info->balance_cancel_req, but they
are not used as an indicator to exit, so there is no need to use the
wrapper.

But there will be more locations coming, and some locations can cause
kernel panic if not handled properly.  So introduce this error injection
to provide better test interface.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 726a342120eba8197b3bc5e01af1bd2dbf80f77f)

5 years agoRevert "btrfs: relocation: Check cancel request after each data page read"
Zygo Blaxell [Mon, 30 Mar 2020 16:42:24 +0000 (12:42 -0400)]
Revert "btrfs: relocation: Check cancel request after each data page read"

This reverts commit 34bb874930d1d6fecec7927101041884fe695aa7.

5 years agoRevert "btrfs: relocation: Check cancel request after each extent found"
Zygo Blaxell [Mon, 30 Mar 2020 16:42:11 +0000 (12:42 -0400)]
Revert "btrfs: relocation: Check cancel request after each extent found"

This reverts commit 400faee53af2f9f1d0d4a0e0f92ea3d08fe53d59.

5 years agoRevert "btrfs: relocation: Work around dead relocation stage loop"
Zygo Blaxell [Mon, 30 Mar 2020 16:41:56 +0000 (12:41 -0400)]
Revert "btrfs: relocation: Work around dead relocation stage loop"

This reverts commit 51e1987039e8138615d7da1e52aa53a3fb332612.

5 years agoRevert "btrfs: relocation: Introduce error injection points for cancelling balance"
Zygo Blaxell [Mon, 30 Mar 2020 16:41:42 +0000 (12:41 -0400)]
Revert "btrfs: relocation: Introduce error injection points for cancelling balance"

This reverts commit 6d2751364569f885ca795e37882690a78ac9f149.

5 years agozygo: no nested fakeroot
Zygo Blaxell [Tue, 24 Mar 2020 04:19:09 +0000 (00:19 -0400)]
zygo: no nested fakeroot

23:54:06 O:   LD [M]  virt/lib/irqbypass.ko
23:54:07 E:  fakeroot -u debian/rules binary
23:54:07 E: fakeroot: FAKEROOTKEY set to 108604309
23:54:07 E: fakeroot: nested operation not yet supported
23:54:07 E: dpkg-buildpackage: error: fakeroot -u debian/rules binary subprocess returned exit status 1
23:54:08 E: make[1]: *** [scripts/package/Makefile:80: bindeb-pkg] Error 1
23:54:08 E: make: *** [Makefile:1365: bindeb-pkg] Error 2
23:54:08 I: Finished with exitcode 2
(cherry picked from commit 0ad8aded2bdba7c063ce7c9cbebdc4d11290155a)

5 years agozygo: switch from kernel-package to bindeb-pkg
Zygo Blaxell [Mon, 23 Mar 2020 18:15:56 +0000 (14:15 -0400)]
zygo: switch from kernel-package to bindeb-pkg

(cherry picked from commit 9f0897f2bd10304eaece6c011725c3481595e4e0)

5 years agobtrfs: do not READA in build_backref_tree
Josef Bacik [Fri, 6 Mar 2020 20:08:40 +0000 (15:08 -0500)]
btrfs: do not READA in build_backref_tree

Here we are just searching down to the bytenr we're building the backref
tree for, and all of it's paths to the roots.  These bytenrs are not
guaranteed to be anywhere near each other, so the READA just generates
extra latency.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
(cherry picked from commit 74e27481b54c60b1beeae82542d5438908e3df20)

5 years agoRevert "zygo: btrfs: EXPERIMENTAL patch from josef to combat long commit intervals"
Zygo Blaxell [Thu, 13 Feb 2020 21:46:53 +0000 (16:46 -0500)]
Revert "zygo: btrfs: EXPERIMENTAL patch from josef to combat long commit intervals"

This reverts commit d92f2ca3480e7be87e990a21252f11bdac1399d0.

5 years agozygo: btrfs: EXPERIMENTAL patch from josef to combat long commit intervals
Zygo Blaxell [Wed, 12 Feb 2020 16:32:42 +0000 (11:32 -0500)]
zygo: btrfs: EXPERIMENTAL patch from josef to combat long commit intervals

(cherry picked from commit add02ef51bebdea81e4244491de248944e68eb32)

  Conflicts:
  fs/btrfs/extent-tree.c

5 years agoRevert "btrfs: assert tree mod log lock in __tree_mod_log_insert"
Zygo Blaxell [Mon, 10 Feb 2020 02:22:45 +0000 (21:22 -0500)]
Revert "btrfs: assert tree mod log lock in __tree_mod_log_insert"

This reverts commit f53b1341ca990e2c771cce92e65e949f0b7d909c.

5 years agozygo: WARN_ON_ONCE(!rwsem_is_locked(&sb->s_umount));
Zygo Blaxell [Mon, 10 Feb 2020 01:16:07 +0000 (20:16 -0500)]
zygo: WARN_ON_ONCE(!rwsem_is_locked(&sb->s_umount));

5 years agoRevert "zygo: silence WARN_ON(!rwsem_is_locked(&sb->s_umount));"
Zygo Blaxell [Mon, 10 Feb 2020 01:09:14 +0000 (20:09 -0500)]
Revert "zygo: silence WARN_ON(!rwsem_is_locked(&sb->s_umount));"

This reverts commit 961e8b0c95f57ce37fb04945621116bee049a31f.

5 years agozygo: build-kernel: lightweight build-and-deploy script inspired by Raspberry Pi...
Zygo Blaxell [Mon, 27 Jan 2020 17:24:46 +0000 (12:24 -0500)]
zygo: build-kernel: lightweight build-and-deploy script inspired by Raspberry Pi and occasionally useful for bisection testing

(cherry picked from commit b43dc869686c996118ee7a879698013648f5dc22)

5 years agozygo: btrfs: Revert "btrfs: don't report readahead errors and don't update statistics...
Zygo Blaxell [Fri, 10 Jan 2020 17:02:07 +0000 (12:02 -0500)]
zygo: btrfs: Revert "btrfs: don't report readahead errors and don't update statistics" because we don't want read errors to be silently ignored.

This reverts commit 0cc068e6ee59c1fffbfa977d8bf868b7551d80ac.

(cherry picked from commit ddda8c5edea780edd997e2736d40198f8dbb39bf)
(cherry picked from commit dc2c37f049f4b272b620349ed2f7d719cd28993f)

5 years agobtrfs: get fs_info from eb in tree_mod_log_eb_copy
David Sterba [Wed, 20 Mar 2019 13:22:04 +0000 (14:22 +0100)]
btrfs: get fs_info from eb in tree_mod_log_eb_copy

We can read fs_info from extent buffer and can drop it from the
parameters.

Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit ed874f0db89724d0af1b4793fb518f640f333b0b)

5 years agobtrfs: assert tree mod log lock in __tree_mod_log_insert
David Sterba [Wed, 27 Mar 2019 15:19:55 +0000 (16:19 +0100)]
btrfs: assert tree mod log lock in __tree_mod_log_insert

The tree is going to be modified so it must be the exclusive lock.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 73e82fe4099bbf3e06a351430bd4ed9703212dda)

5 years agozygo: remove obsolete (and a little bit dangerous) cherry-pick scripts
Zygo Blaxell [Thu, 30 Jan 2020 17:54:34 +0000 (12:54 -0500)]
zygo: remove obsolete (and a little bit dangerous) cherry-pick scripts

(cherry picked from commit 3ec46e9b8830f60617d195b04c0df35890820dd5)

5 years agoBtrfs: don't iterate mod seq list when putting a tree mod seq
Filipe Manana [Wed, 22 Jan 2020 12:23:54 +0000 (12:23 +0000)]
Btrfs: don't iterate mod seq list when putting a tree mod seq

Each new element added to the mod seq list is always appended to the list,
and each one gets a sequence number coming from a counter which gets
incremented everytime a new element is added to the list (or a new node
is added to the tree mod log rbtree). Therefore the element with the
lowest sequence number is always the first element in the list.

So just remove the list iteration at btrfs_put_tree_mod_seq() that
computes the minimum sequence number in the list and replace it with
a check for the first element's sequence number.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 5630d5632266052619739f1b7e582954bc62cab4)

5 years agoBtrfs: fix race between adding and putting tree mod seq elements and nodes
Filipe Manana [Wed, 22 Jan 2020 12:23:20 +0000 (12:23 +0000)]
Btrfs: fix race between adding and putting tree mod seq elements and nodes

There is a race between adding and removing elements to the tree mod log
list and rbtree that can lead to use-after-free problems.

Consider the following example that explains how/why the problems happens:

1) Task A has mod log element with sequence number 200. It currently is
   the only element in the mod log list;

2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to
   access the tree mod log. When it enters the function, it initializes
   'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock'
   before checking if there are other elements in the mod seq list.
   Since the list it empty, 'min_seq' remains set to (u64)-1. Then it
   unlocks the lock 'tree_mod_seq_lock';

3) Before task A acquires the lock 'tree_mod_log_lock', task B adds
   itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a
   sequence number of 201;

4) Some other task, name it task C, modifies a btree and because there
   elements in the mod seq list, it adds a tree mod elem to the tree
   mod log rbtree. That node added to the mod log rbtree is assigned
   a sequence number of 202;

5) Task B, which is doing fiemap and resolving indirect back references,
   calls btrfs get_old_root(), with 'time_seq' == 201, which in turn
   calls tree_mod_log_search() - the search returns the mod log node
   from the rbtree with sequence number 202, created by task C;

6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating
   the mod log rbtree and finds the node with sequence number 202. Since
   202 is less than the previously computed 'min_seq', (u64)-1, it
   removes the node and frees it;

7) Task B still has a pointer to the node with sequence number 202, and
   it dereferences the pointer itself and through the call to
   __tree_mod_log_rewind(), resulting in a use-after-free problem._

This issue can be triggered sporadically with the test case generic/561
from fstests, and it happens more frequently with a higher number of
duperemove processes. When it happens to me, it either freezes the vm or
it produces a trace like the following before crashing:

  [ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  [ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1
  [ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  [ 1245.321287] RIP: 0010:rb_next+0x16/0x50
  [ 1245.321307] Code: ....
  [ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202
  [ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b
  [ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80
  [ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000
  [ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038
  [ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8
  [ 1245.321539] FS:  00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000
  [ 1245.321591] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0
  [ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [ 1245.321706] Call Trace:
  [ 1245.321798]  __tree_mod_log_rewind+0xbf/0x280 [btrfs]
  [ 1245.321841]  btrfs_search_old_slot+0x105/0xd00 [btrfs]
  [ 1245.321877]  resolve_indirect_refs+0x1eb/0xc60 [btrfs]
  [ 1245.321912]  find_parent_nodes+0x3dc/0x11b0 [btrfs]
  [ 1245.321947]  btrfs_check_shared+0x115/0x1c0 [btrfs]
  [ 1245.321980]  ? extent_fiemap+0x59d/0x6d0 [btrfs]
  [ 1245.322029]  extent_fiemap+0x59d/0x6d0 [btrfs]
  [ 1245.322066]  do_vfs_ioctl+0x45a/0x750
  [ 1245.322081]  ksys_ioctl+0x70/0x80
  [ 1245.322092]  ? trace_hardirqs_off_thunk+0x1a/0x1c
  [ 1245.322113]  __x64_sys_ioctl+0x16/0x20
  [ 1245.322126]  do_syscall_64+0x5c/0x280
  [ 1245.322139]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [ 1245.322155] RIP: 0033:0x7fdee3942dd7
  [ 1245.322177] Code: ....
  [ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7
  [ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004
  [ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44
  [ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48
  [ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50
  [ 1245.322423] Modules linked in: ....
  [ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]---

Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum
sequence number and iterates the rbtree while holding the lock
'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock'
lock, since it is now redundant.

Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions")
Fixes: 097b8a7c9e48e2 ("Btrfs: join tree mod log code with the code holding back delayed refs")
CC: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 54a0a23e1bd278705cfe15d05c89a57a52c70b6a)

5 years agozygo: add 'config-kernel' script to run 'make oldconfig' with KCONFIG_NOTIMESTAMP
Zygo Blaxell [Thu, 28 Nov 2019 03:39:16 +0000 (22:39 -0500)]
zygo: add 'config-kernel' script to run 'make oldconfig' with KCONFIG_NOTIMESTAMP

(cherry picked from commit 4cc38d103bc42929ef42f3df5a344b27ab3fda45)
(cherry picked from commit 0cfb266e54ab44cbaa090c5b303eedc676debf54)

5 years agoBtrfs: make deduplication with range including the last block work
Filipe Manana [Mon, 16 Dec 2019 18:26:56 +0000 (18:26 +0000)]
Btrfs: make deduplication with range including the last block work

Since btrfs was migrated to use the generic VFS helpers for clone and
deduplication, it stopped allowing for the last block of a file to be
deduplicated when the source file size is not sector size aligned (when
eof is somewhere in the middle of the last block). There are two reasons
for that:

1) The generic code always rounds down, to a multiple of the block size,
   the range's length for deduplications. This means we end up never
   deduplicating the last block when the eof is not block size aligned,
   even for the safe case where the destination range's end offset matches
   the destination file's size. That rounding down operation is done at
   generic_remap_check_len();

2) Because of that, the btrfs specific code does not expect anymore any
   non-aligned range length's for deduplication and therefore does not
   work if such nona-aligned length is given.

This patch addresses that second part, and it depends on a patch that
fixes generic_remap_check_len(), in the VFS, which was submitted ealier
and has the following subject:

  "fs: allow deduplication of eof block into the end of the destination file"

These two patches address reports from users that started seeing lower
deduplication rates due to the last block never being deduplicated when
the file size is not aligned to the filesystem's block size.

Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit f14e449df755c55056aea004d696d0ac7b317812)
(cherry picked from commit 53c03ddf9695e4e466088442601642e056305bf6)

5 years agoBtrfs: remove no longer needed range length checks for deduplication
Filipe Manana [Wed, 12 Dec 2018 18:05:59 +0000 (18:05 +0000)]
Btrfs: remove no longer needed range length checks for deduplication

Comparing the content of the pages in the range to deduplicate is now
done in generic_remap_checks called by the generic helper
generic_remap_file_range_prep(), which takes care of ensuring we do not
compare/deduplicate undefined data beyond a file's EOF (range from EOF
to the next block boundary). So remove these checks which are now
redundant.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 57a50e2506df3f603580f8f30247caa7ac902369)
(cherry picked from commit 77a3354696e708a8dddff61821b2b3f7fa752c8d)

5 years agofs: allow deduplication of eof block into the end of the destination file
Filipe Manana [Mon, 16 Dec 2019 18:26:55 +0000 (18:26 +0000)]
fs: allow deduplication of eof block into the end of the destination file

We always round down, to a multiple of the filesystem's block size, the
length to deduplicate at generic_remap_check_len().  However this is only
needed if an attempt to deduplicate the last block into the middle of the
destination file is requested, since that leads into a corruption if the
length of the source file is not block size aligned.  When an attempt to
deduplicate the last block into the end of the destination file is
requested, we should allow it because it is safe to do it - there's no
stale data exposure and we are prepared to compare the data ranges for
a length not aligned to the block (or page) size - in fact we even do
the data compare before adjusting the deduplication length.

After btrfs was updated to use the generic helpers from VFS (by commit
34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning
and deduplication")) we started to have user reports of deduplication
not reflinking the last block anymore, and whence users getting lower
deduplication scores.  The main use case is deduplication of entire
files that have a size not aligned to the block size of the filesystem.

We already allow cloning the last block to the end (and beyond) of the
destination file, so allow for deduplication as well.

Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit fca96845e4fd07998b60787f4da0a5b3588fe70e)
(cherry picked from commit 14051eeefda96b01a19b62fe43cab91eca68ded8)

5 years agoRevert "fs: allow deduplication of eof block into the end of the destination file"
Zygo Blaxell [Sat, 28 Dec 2019 21:23:24 +0000 (16:23 -0500)]
Revert "fs: allow deduplication of eof block into the end of the destination file"

This reverts commit 5cd011479936a8f9095e39bfececaf3f34bd52df.

(cherry picked from commit cf39e72396f5cb90d14644f044595caf7df46150)

5 years agoRevert "Btrfs: make deduplication with range including the last block work"
Zygo Blaxell [Sat, 28 Dec 2019 21:23:23 +0000 (16:23 -0500)]
Revert "Btrfs: make deduplication with range including the last block work"

This reverts commit 22372ec49714bc36a869517ffc57a7be9e3a00ae.

(cherry picked from commit 50595c4ea8f660f9408c8c875a43aced0bd750b8)

5 years agoBtrfs: make deduplication with range including the last block work
Filipe Manana [Mon, 16 Dec 2019 18:26:56 +0000 (18:26 +0000)]
Btrfs: make deduplication with range including the last block work

Since btrfs was migrated to use the generic VFS helpers for clone and
deduplication, it stopped allowing for the last block of a file to be
deduplicated when the source file size is not sector size aligned (when
eof is somewhere in the middle of the last block). There are two reasons
for that:

1) The generic code always rounds down, to a multiple of the block size,
   the range's length for deduplications. This means we end up never
   deduplicating the last block when the eof is not block size aligned,
   even for the safe case where the destination range's end offset matches
   the destination file's size. That rounding down operation is done at
   generic_remap_check_len();

2) Because of that, the btrfs specific code does not expect anymore any
   non-aligned range length's for deduplication and therefore does not
   work if such nona-aligned length is given.

This patch addresses that second part, and it depends on a patch that
fixes generic_remap_check_len(), in the VFS, which was submitted ealier
and has the following subject:

  "fs: allow deduplication of eof block into the end of the destination file"

These two patches address reports from users that started seeing lower
deduplication rates due to the last block never being deduplicated when
the file size is not aligned to the filesystem's block size.

Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit f14e449df755c55056aea004d696d0ac7b317812)

5 years agofs: allow deduplication of eof block into the end of the destination file
Filipe Manana [Mon, 16 Dec 2019 18:26:55 +0000 (18:26 +0000)]
fs: allow deduplication of eof block into the end of the destination file

We always round down, to a multiple of the filesystem's block size, the
length to deduplicate at generic_remap_check_len().  However this is only
needed if an attempt to deduplicate the last block into the middle of the
destination file is requested, since that leads into a corruption if the
length of the source file is not block size aligned.  When an attempt to
deduplicate the last block into the end of the destination file is
requested, we should allow it because it is safe to do it - there's no
stale data exposure and we are prepared to compare the data ranges for
a length not aligned to the block (or page) size - in fact we even do
the data compare before adjusting the deduplication length.

After btrfs was updated to use the generic helpers from VFS (by commit
34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning
and deduplication")) we started to have user reports of deduplication
not reflinking the last block anymore, and whence users getting lower
deduplication scores.  The main use case is deduplication of entire
files that have a size not aligned to the block size of the filesystem.

We already allow cloning the last block to the end (and beyond) of the
destination file, so allow for deduplication as well.

Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
5 years agobtrfs: relocation: Introduce error injection points for cancelling balance
Qu Wenruo [Tue, 3 Dec 2019 06:42:51 +0000 (14:42 +0800)]
btrfs: relocation: Introduce error injection points for cancelling balance

Introduce a new error injection point, should_cancel_balance().

It's just a wrapper of atomic_read(&fs_info->balance_cancel_req), but
allows us to override the return value.

Currently there are only one locations using this function:
- btrfs_balance()
  It checks cancel before each block group.

There are other locations checking fs_info->balance_cancel_req, but they
are not used as an indicator to exit, so there is no need to use the
wrapper.

But there will be more locations coming, and some locations can cause
kernel panic if not handled properly.

So introduce this error injection to provide better test interface.

Signed-off-by: Qu Wenruo <wqu@suse.com>
(cherry picked from commit 6d2751364569f885ca795e37882690a78ac9f149)
(cherry picked from commit 2b39bbf4fbde34813d13de1f10da7fcd28bd252e)
(cherry picked from commit a8a7fd9a26ef3fc044f8e18287e5bbe9bb149164)
(cherry picked from commit be37be4252bf8815a115a823d869498746179a2b)

5 years agobtrfs: relocation: Work around dead relocation stage loop
Qu Wenruo [Tue, 3 Dec 2019 06:42:54 +0000 (14:42 +0800)]
btrfs: relocation: Work around dead relocation stage loop

There are some reports of dead relocation stage loop, where dmesg is
flooded by "Found X extents".

The root cause of it is still uncertain, but we can work around such bug
by checking cancelling request so user can at least cancel such dead
loop.

Signed-off-by: Qu Wenruo <wqu@suse.com>
(cherry picked from commit 51e1987039e8138615d7da1e52aa53a3fb332612)
(cherry picked from commit 3fef15d04226a183051b29a9d956eb6a3a0f6c9d)
(cherry picked from commit 22c00c2f89ee2732c4cdbf2d9eb43259d502cc63)
(cherry picked from commit 75a7464fded0d61d94d0746a1a09aa02d0878a68)

5 years agobtrfs: relocation: Check cancel request after each extent found
Qu Wenruo [Tue, 3 Dec 2019 06:42:53 +0000 (14:42 +0800)]
btrfs: relocation: Check cancel request after each extent found

When relocating data block groups with tons of small extents, or
large metadata block groups, there can be over 200,000 extents.

We will iterate all extents of such block group in relocate_block_group(),
where iteration itself can be kinda time-consuming.

So when user want to cancel the balance, the extent iteration loop can
be another target.

This patch will add the cancelling check in the extent iteration loop of
relocate_block_group() to make balance cancelling faster.

Signed-off-by: Qu Wenruo <wqu@suse.com>
(cherry picked from commit 400faee53af2f9f1d0d4a0e0f92ea3d08fe53d59)
(cherry picked from commit 3e0b705ea6ce42941083c89b56273ae160cf9ce6)
(cherry picked from commit 67f04025608f154b002c94d477946011c3c16ccb)
(cherry picked from commit fd66a3b01ebe13af88dc59f669d444230386042c)

5 years agobtrfs: relocation: Check cancel request after each data page read
Qu Wenruo [Tue, 3 Dec 2019 06:42:52 +0000 (14:42 +0800)]
btrfs: relocation: Check cancel request after each data page read

When relocating a data extents with large large data extents, we spend
most of our time in relocate_file_extent_cluster() at stage "moving data
extents":
 1)               |  btrfs_relocate_block_group [btrfs]() {
 1)               |    relocate_file_extent_cluster [btrfs]() {
 1) $ 6586769 us  |    }
 1) + 18.260 us   |    relocate_file_extent_cluster [btrfs]();
 1) + 15.770 us   |    relocate_file_extent_cluster [btrfs]();
 1) $ 8916340 us  |  }
 1)               |  btrfs_relocate_block_group [btrfs]() {
 1)               |    relocate_file_extent_cluster [btrfs]() {
 1) $ 11611586 us |    }
 1) + 16.930 us   |    relocate_file_extent_cluster [btrfs]();
 1) + 15.870 us   |    relocate_file_extent_cluster [btrfs]();
 1) $ 14986130 us |  }

So to make data relocation cancelling quicker, here add extra balance
cancelling check after each page read in relocate_file_extent_cluster().

Signed-off-by: Qu Wenruo <wqu@suse.com>
(cherry picked from commit 34bb874930d1d6fecec7927101041884fe695aa7)
(cherry picked from commit d98b3fc6f2b4d733b4abcbb79dfd47daacee1b4f)
(cherry picked from commit 9420d0783eea8bbb0c6cfe629cea6df48bb714a3)
(cherry picked from commit bf29c04e417398c8ef60bf57cecbc18f8620bdd3)

5 years agobtrfs: relocation: Output current relocation stage at btrfs_relocate_block_group()
Qu Wenruo [Fri, 29 Nov 2019 04:40:59 +0000 (12:40 +0800)]
btrfs: relocation: Output current relocation stage at btrfs_relocate_block_group()

There are several reports of hanging relocation, populating the dmesg
with things like:
  BTRFS info (device dm-5): found 1 extents

The investigation is still on going, but will never hurt to output a
little more info.

This patch will also output the current relocation stage, making that
output something like:

  BTRFS info (device dm-5): balance: start -d -m -s
  BTRFS info (device dm-5): relocating block group 30408704 flags metadata|dup
  BTRFS info (device dm-5): found 2 extents, stage: move data extents
  BTRFS info (device dm-5): relocating block group 22020096 flags system|dup
  BTRFS info (device dm-5): found 1 extents, stage: move data extents
  BTRFS info (device dm-5): relocating block group 13631488 flags data
  BTRFS info (device dm-5): found 1 extents, stage: move data extents
  BTRFS info (device dm-5): found 1 extents, stage: update data pointers
  BTRFS info (device dm-5): balance: ended with status: 0

This patch will not increase the number of lines, but with extra info
for us to debug the reported problem.
(Although it's very likely the bug is sticking at "update data pointers"
stage, even without the patch)

Signed-off-by: Qu Wenruo <wqu@suse.com>
(cherry picked from commit b04bb3a0ee5916885cb989f904d0168ac181e649)
(cherry picked from commit e798a685ada7aecb72c54a3b94a08853aa2be8cb)
(cherry picked from commit 34374b860dfd86ead6cf12062b83ba431abaddd5)
(cherry picked from commit eb8b21df208d8f2f69123a341e371b2ec4897182)

5 years agoBtrfs: fix removal logic of the tree mod log that leads to use-after-free issues
Filipe Manana [Fri, 6 Dec 2019 11:13:31 +0000 (11:13 +0000)]
Btrfs: fix removal logic of the tree mod log that leads to use-after-free issues

When a tree mod log user no longer needs to use the tree it calls
btrfs_put_tree_mod_seq() to remove itself from the list of users and
delete all no longer used elements of the tree's red black tree, which
should be all elements with a sequence number less then our equals to
the caller's sequence number. However the logic is broken because it
can delete and free elements from the red black tree that have a
sequence number greater then the caller's sequence number:

1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the
   tree mod log;

2) The task which got assigned the sequence number 1 calls
   btrfs_put_tree_mod_seq();

3) Sequence number 1 is deleted from the list of sequence numbers;

4) The current minimum sequence number is computed to be the sequence
   number 2;

5) A task using sequence number 2 is at tree_mod_log_rewind() and gets
   a pointer to one of its elements from the red black tree through
   a call to tree_mod_log_search();

6) The task with sequence number 1 iterates the red black tree of tree
   modification elements and deletes (and frees) all elements with a
   sequence number less then or equals to 2 (the computed minimum sequence
   number) - it ends up only leaving elements with sequence numbers of 3
   and 4;

7) The task with sequence number 2 now uses the pointer to its element,
   already freed by the other task, at __tree_mod_log_rewind(), resulting
   in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces
   a trace like the following:

  [16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  [16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G        W         5.4.0-rc8-btrfs-next-51 #1
  [16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  [16804.548666] RIP: 0010:rb_next+0x16/0x50
  (...)
  [16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202
  [16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b
  [16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600
  [16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000
  [16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e
  [16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8
  [16804.554399] FS:  00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000
  [16804.555039] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0
  [16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [16804.557583] Call Trace:
  [16804.558207]  __tree_mod_log_rewind+0xbf/0x280 [btrfs]
  [16804.558835]  btrfs_search_old_slot+0x105/0xd00 [btrfs]
  [16804.559468]  resolve_indirect_refs+0x1eb/0xc70 [btrfs]
  [16804.560087]  ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs]
  [16804.560700]  find_parent_nodes+0x388/0x1120 [btrfs]
  [16804.561310]  btrfs_check_shared+0x115/0x1c0 [btrfs]
  [16804.561916]  ? extent_fiemap+0x59d/0x6d0 [btrfs]
  [16804.562518]  extent_fiemap+0x59d/0x6d0 [btrfs]
  [16804.563112]  ? __might_fault+0x11/0x90
  [16804.563706]  do_vfs_ioctl+0x45a/0x700
  [16804.564299]  ksys_ioctl+0x70/0x80
  [16804.564885]  ? trace_hardirqs_off_thunk+0x1a/0x20
  [16804.565461]  __x64_sys_ioctl+0x16/0x20
  [16804.566020]  do_syscall_64+0x5c/0x250
  [16804.566580]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [16804.567153] RIP: 0033:0x7f4b1ba2add7
  (...)
  [16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7
  [16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003
  [16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44
  [16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48
  [16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50
  (...)
  [16804.575623] ---[ end trace 87317359aad4ba50 ]---

Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that
have a sequence number equals to the computed minimum sequence number, and
not just elements with a sequence number greater then that minimum.

Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 58630a8c28664b8e826b2a096412107ca77d4c90)
(cherry picked from commit 91560e03a665935e48219d0f3bc4a705344cf341)
(cherry picked from commit 4fe79a48b56c7c41aa5ffb62436b4698dedb663a)
(cherry picked from commit 3fd891532caac101c8446865603d566715496854)

5 years agoBtrfs: fix use-after-free when using the tree modification log
Filipe Manana [Mon, 12 Aug 2019 18:14:29 +0000 (19:14 +0100)]
Btrfs: fix use-after-free when using the tree modification log

At ctree.c:get_old_root(), we are accessing a root's header owner field
after we have freed the respective extent buffer. This results in an
use-after-free that can lead to crashes, and when CONFIG_DEBUG_PAGEALLOC
is set, results in a stack trace like the following:

  [ 3876.799331] stack segment: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
  [ 3876.799363] CPU: 0 PID: 15436 Comm: pool Not tainted 5.3.0-rc3-btrfs-next-54 #1
  [ 3876.799385] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  [ 3876.799433] RIP: 0010:btrfs_search_old_slot+0x652/0xd80 [btrfs]
  (...)
  [ 3876.799502] RSP: 0018:ffff9f08c1a2f9f0 EFLAGS: 00010286
  [ 3876.799518] RAX: ffff8dd300000000 RBX: ffff8dd85a7a9348 RCX: 000000038da26000
  [ 3876.799538] RDX: 0000000000000000 RSI: ffffe522ce368980 RDI: 0000000000000246
  [ 3876.799559] RBP: dae1922adadad000 R08: 0000000008020000 R09: ffffe522c0000000
  [ 3876.799579] R10: ffff8dd57fd788c8 R11: 000000007511b030 R12: ffff8dd781ddc000
  [ 3876.799599] R13: ffff8dd9e6240578 R14: ffff8dd6896f7a88 R15: ffff8dd688cf90b8
  [ 3876.799620] FS:  00007f23ddd97700(0000) GS:ffff8dda20200000(0000) knlGS:0000000000000000
  [ 3876.799643] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 3876.799660] CR2: 00007f23d4024000 CR3: 0000000710bb0005 CR4: 00000000003606f0
  [ 3876.799682] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [ 3876.799703] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [ 3876.799723] Call Trace:
  [ 3876.799735]  ? do_raw_spin_unlock+0x49/0xc0
  [ 3876.799749]  ? _raw_spin_unlock+0x24/0x30
  [ 3876.799779]  resolve_indirect_refs+0x1eb/0xc80 [btrfs]
  [ 3876.799810]  find_parent_nodes+0x38d/0x1180 [btrfs]
  [ 3876.799841]  btrfs_check_shared+0x11a/0x1d0 [btrfs]
  [ 3876.799870]  ? extent_fiemap+0x598/0x6e0 [btrfs]
  [ 3876.799895]  extent_fiemap+0x598/0x6e0 [btrfs]
  [ 3876.799913]  do_vfs_ioctl+0x45a/0x700
  [ 3876.799926]  ksys_ioctl+0x70/0x80
  [ 3876.799938]  ? trace_hardirqs_off_thunk+0x1a/0x20
  [ 3876.799953]  __x64_sys_ioctl+0x16/0x20
  [ 3876.799965]  do_syscall_64+0x62/0x220
  [ 3876.799977]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [ 3876.799993] RIP: 0033:0x7f23e0013dd7
  (...)
  [ 3876.800056] RSP: 002b:00007f23ddd96ca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [ 3876.800078] RAX: ffffffffffffffda RBX: 00007f23d80210f8 RCX: 00007f23e0013dd7
  [ 3876.800099] RDX: 00007f23d80210f8 RSI: 00000000c020660b RDI: 0000000000000003
  [ 3876.800626] RBP: 000055fa2a2a2440 R08: 0000000000000000 R09: 00007f23ddd96d7c
  [ 3876.801143] R10: 00007f23d8022000 R11: 0000000000000246 R12: 00007f23ddd96d80
  [ 3876.801662] R13: 00007f23ddd96d78 R14: 00007f23d80210f0 R15: 00007f23ddd96d80
  (...)
  [ 3876.805107] ---[ end trace e53161e179ef04f9 ]---

Fix that by saving the root's header owner field into a local variable
before freeing the root's extent buffer, and then use that local variable
when needed.

Fixes: 30b0463a9394d9 ("Btrfs: fix accessing the root pointer in tree mod log functions")
CC: stable@vger.kernel.org # 3.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit efad8a853ad2057f96664328a0d327a05ce39c76)

5 years agoRevert "Revert "Btrfs: do not start a transaction at iterate_extent_inodes()""
Zygo Blaxell [Sun, 24 Nov 2019 15:54:50 +0000 (10:54 -0500)]
Revert "Revert "Btrfs: do not start a transaction at iterate_extent_inodes()""

This reverts commit cde3267d3dabbc281eed98a57776c21c28185f3e.

5 years agoRevert "Btrfs: do not start a transaction at iterate_extent_inodes()"
Zygo Blaxell [Wed, 20 Nov 2019 22:03:06 +0000 (17:03 -0500)]
Revert "Btrfs: do not start a transaction at iterate_extent_inodes()"
due to suspected crashing interactions with balance.

This reverts commit b658422ed8f4524e9bde401c10280980f0c148fe.

6 years agoRevert "btrfs: EXPERIMENT: wrap entire LOGICAL_TO_INO in transaction"
Zygo Blaxell [Sat, 6 Jul 2019 01:27:26 +0000 (21:27 -0400)]
Revert "btrfs: EXPERIMENT: wrap entire LOGICAL_TO_INO in transaction"

This reverts commit cd3eb5a73a436f5da80d7af52311513085f2e394.

6 years agobtrfs: EXPERIMENT: wrap entire LOGICAL_TO_INO in transaction
Zygo Blaxell [Fri, 5 Jul 2019 05:17:06 +0000 (01:17 -0400)]
btrfs: EXPERIMENT: wrap entire LOGICAL_TO_INO in transaction

6 years agozygo: trace "unable to find logical 0 length 4096"
Zygo Blaxell [Fri, 5 Jul 2019 04:57:44 +0000 (00:57 -0400)]
zygo: trace "unable to find logical 0 length 4096"

6 years agoMerge remote-tracking branch 'stable/linux-5.0.y' into zygo-5.0.x-zb64
Zygo Blaxell [Tue, 4 Jun 2019 16:17:29 +0000 (12:17 -0400)]
Merge remote-tracking branch 'stable/linux-5.0.y' into zygo-5.0.x-zb64

6 years agoLinux 5.0.21 stable/linux-5.0.y
Greg Kroah-Hartman [Tue, 4 Jun 2019 06:01:31 +0000 (08:01 +0200)]
Linux 5.0.21

6 years agotipc: fix modprobe tipc failed after switch order of device registration
Junwei Hu [Mon, 20 May 2019 06:43:59 +0000 (14:43 +0800)]
tipc: fix modprobe tipc failed after switch order of device registration

commit 526f5b851a96566803ee4bee60d0a34df56c77f8 upstream.

Error message printed:
modprobe: ERROR: could not insert 'tipc': Address family not
supported by protocol.
when modprobe tipc after the following patch: switch order of
device registration, commit 7e27e8d6130c
("tipc: switch order of device registration to fix a crash")

Because sock_create_kern(net, AF_TIPC, ...) called by
tipc_topsrv_create_listener() in the initialization process
of tipc_init_net(), so tipc_socket_init() must be execute before that.
Meanwhile, tipc_net_id need to be initialized when sock_create()
called, and tipc_socket_init() is no need to be called for each namespace.

I add a variable tipc_topsrv_net_ops, and split the
register_pernet_subsys() of tipc into two parts, and split
tipc_socket_init() with initialization of pernet params.

By the way, I fixed resources rollback error when tipc_bcast_init()
failed in tipc_init_net().

Fixes: 7e27e8d6130c ("tipc: switch order of device registration to fix a crash")
Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
Reported-by: Wang Wang <wangwang2@huawei.com>
Reported-by: syzbot+1e8114b61079bfe9cbc5@syzkaller.appspotmail.com
Reviewed-by: Kang Zhou <zhoukang7@huawei.com>
Reviewed-by: Suanming Mou <mousuanming@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoRevert "tipc: fix modprobe tipc failed after switch order of device registration"
David S. Miller [Fri, 17 May 2019 19:15:05 +0000 (12:15 -0700)]
Revert "tipc: fix modprobe tipc failed after switch order of device registration"

commit 5593530e56943182ebb6d81eca8a3be6db6dbba4 upstream.

This reverts commit 532b0f7ece4cb2ffd24dc723ddf55242d1188e5e.

More revisions coming up.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoxen/pciback: Don't disable PCI_COMMAND on PCI device reset.
Konrad Rzeszutek Wilk [Wed, 13 Feb 2019 23:21:31 +0000 (18:21 -0500)]
xen/pciback: Don't disable PCI_COMMAND on PCI device reset.

commit 7681f31ec9cdacab4fd10570be924f2cef6669ba upstream.

There is no need for this at all. Worst it means that if
the guest tries to write to BARs it could lead (on certain
platforms) to PCI SERR errors.

Please note that with af6fc858a35b90e89ea7a7ee58e66628c55c776b
"xen-pciback: limit guest control of command register"
a guest is still allowed to enable those control bits (safely), but
is not allowed to disable them and that therefore a well behaved
frontend which enables things before using them will still
function correctly.

This is done via an write to the configuration register 0x4 which
triggers on the backend side:
command_write
  \- pci_enable_device
     \- pci_enable_device_flags
        \- do_pci_enable_device
           \- pcibios_enable_device
              \-pci_enable_resourcess
                [which enables the PCI_COMMAND_MEMORY|PCI_COMMAND_IO]

However guests (and drivers) which don't do this could cause
problems, including the security issues which XSA-120 sought
to address.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agocrypto: vmx - ghash: do nosimd fallback manually
Daniel Axtens [Thu, 16 May 2019 15:40:02 +0000 (01:40 +1000)]
crypto: vmx - ghash: do nosimd fallback manually

commit 357d065a44cdd77ed5ff35155a989f2a763e96ef upstream.

VMX ghash was using a fallback that did not support interleaving simd
and nosimd operations, leading to failures in the extended test suite.

If I understood correctly, Eric's suggestion was to use the same
data format that the generic code uses, allowing us to call into it
with the same contexts. I wasn't able to get that to work - I think
there's a very different key structure and data layout being used.

So instead steal the arm64 approach and perform the fallback
operations directly if required.

Fixes: cc333cd68dfa ("crypto: vmx - Adding GHASH routines for VMX module")
Cc: stable@vger.kernel.org # v4.1+
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet: correct zerocopy refcnt with udp MSG_MORE
Willem de Bruijn [Thu, 30 May 2019 22:01:21 +0000 (18:01 -0400)]
net: correct zerocopy refcnt with udp MSG_MORE

[ Upstream commit 100f6d8e09905c59be45b6316f8f369c0be1b2d8 ]

TCP zerocopy takes a uarg reference for every skb, plus one for the
tcp_sendmsg_locked datapath temporarily, to avoid reaching refcnt zero
as it builds, sends and frees skbs inside its inner loop.

UDP and RAW zerocopy do not send inside the inner loop so do not need
the extra sock_zerocopy_get + sock_zerocopy_put pair. Commit
52900d22288ed ("udp: elide zerocopy operation in hot path") introduced
extra_uref to pass the initial reference taken in sock_zerocopy_alloc
to the first generated skb.

But, sock_zerocopy_realloc takes this extra reference at the start of
every call. With MSG_MORE, no new skb may be generated to attach the
extra_uref to, so refcnt is incorrectly 2 with only one skb.

Do not take the extra ref if uarg && !tcp, which implies MSG_MORE.
Update extra_uref accordingly.

This conditional assignment triggers a false positive may be used
uninitialized warning, so have to initialize extra_uref at define.

Changes v1->v2: fix typo in Fixes SHA1

Fixes: 52900d22288e7 ("udp: elide zerocopy operation in hot path")
Reported-by: syzbot <syzkaller@googlegroups.com>
Diagnosed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agocxgb4: Revert "cxgb4: Remove SGE_HOST_PAGE_SIZE dependency on page size"
Vishal Kulkarni [Thu, 23 May 2019 02:37:21 +0000 (08:07 +0530)]
cxgb4: Revert "cxgb4: Remove SGE_HOST_PAGE_SIZE dependency on page size"

[ Upstream commit ab0610efabb4c4f419a531455708caf1dd29357e ]

This reverts commit 2391b0030e241386d710df10e53e2cfc3c5d4fc1 which has
introduced regression. Now SGE's BAR2 Doorbell/GTS Page Size is
interpreted correctly in the firmware itself by using actual host
page size. Hence previous commit needs to be reverted.

Signed-off-by: Vishal Kulkarni <vishal@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet/tls: don't ignore netdev notifications if no TLS features
Jakub Kicinski [Wed, 22 May 2019 02:02:02 +0000 (19:02 -0700)]
net/tls: don't ignore netdev notifications if no TLS features

[ Upstream commit c3f4a6c39cf269a40d45f813c05fa830318ad875 ]

On device surprise removal path (the notifier) we can't
bail just because the features are disabled.  They may
have been enabled during the lifetime of the device.
This bug leads to leaking netdev references and
use-after-frees if there are active connections while
device features are cleared.

Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet/tls: fix state removal with feature flags off
Jakub Kicinski [Wed, 22 May 2019 02:02:01 +0000 (19:02 -0700)]
net/tls: fix state removal with feature flags off

[ Upstream commit 3686637e507b48525fcea6fb91e1988bdbc14530 ]

TLS offload drivers shouldn't (and currently don't) block
the TLS offload feature changes based on whether there are
active offloaded connections or not.

This seems to be a good idea, because we want the admin to
be able to disable the TLS offload at any time, and there
is no clean way of disabling it for active connections
(TX side is quite problematic).  So if features are cleared
existing connections will stay offloaded until they close,
and new connections will not attempt offload to a given
device.

However, the offload state removal handling is currently
broken if feature flags get cleared while there are
active TLS offloads.

RX side will completely bail from cleanup, even on normal
remove path, leaving device state dangling, potentially
causing issues when the 5-tuple is reused.  It will also
fail to release the netdev reference.

Remove the RX-side warning message, in next release cycle
it should be printed when features are disabled, rather
than when connection dies, but for that we need a more
efficient method of finding connection of a given netdev
(a'la BPF offload code).

Fixes: 4799ac81e52a ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agobnxt_en: Reduce memory usage when running in kdump kernel.
Michael Chan [Wed, 22 May 2019 23:12:56 +0000 (19:12 -0400)]
bnxt_en: Reduce memory usage when running in kdump kernel.

[ Upstream commit d629522e1d66561f38e5c8d4f52bb6d254ec0707 ]

Skip RDMA context memory allocations, reduce to 1 ring, and disable
TPA when running in the kdump kernel.  Without this patch, the driver
fails to initialize with memory allocation errors when running in a
typical kdump kernel.

Fixes: cf6daed098d1 ("bnxt_en: Increase context memory allocations on 57500 chips for RDMA.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agobnxt_en: Fix possible BUG() condition when calling pci_disable_msix().
Michael Chan [Wed, 22 May 2019 23:12:55 +0000 (19:12 -0400)]
bnxt_en: Fix possible BUG() condition when calling pci_disable_msix().

[ Upstream commit 1b3f0b75c39f534278a895c117282014e9d0ae1f ]

When making configuration changes, the driver calls bnxt_close_nic()
and then bnxt_open_nic() for the changes to take effect.  A parameter
irq_re_init is passed to the call sequence to indicate if IRQ
should be re-initialized.  This irq_re_init parameter needs to
be included in the bnxt_reserve_rings() call.  bnxt_reserve_rings()
can only call pci_disable_msix() if the irq_re_init parameter is
true, otherwise it may hit BUG() because some IRQs may not have been
freed yet.

Fixes: 41e8d7983752 ("bnxt_en: Modify the ring reservation functions for 57500 series chips.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agobnxt_en: Fix aggregation buffer leak under OOM condition.
Michael Chan [Wed, 22 May 2019 23:12:54 +0000 (19:12 -0400)]
bnxt_en: Fix aggregation buffer leak under OOM condition.

[ Upstream commit 296d5b54163964b7ae536b8b57dfbd21d4e868e1 ]

For every RX packet, the driver replenishes all buffers used for that
packet and puts them back into the RX ring and RX aggregation ring.
In one code path where the RX packet has one RX buffer and one or more
aggregation buffers, we missed recycling the aggregation buffer(s) if
we are unable to allocate a new SKB buffer.  This leads to the
aggregation ring slowly running out of buffers over time.  Fix it
by properly recycling the aggregation buffers.

Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.")
Reported-by: Rakesh Hemnani <rhemnani@fb.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet: stmmac: dma channel control register need to be init first
Weifeng Voon [Tue, 21 May 2019 05:38:38 +0000 (13:38 +0800)]
net: stmmac: dma channel control register need to be init first

stmmac_init_chan() needs to be called before stmmac_init_rx_chan() and
stmmac_init_tx_chan(). This is because if PBLx8 is to be used,
"DMA_CH(#i)_Control.PBLx8" needs to be set before programming
"DMA_CH(#i)_TX_Control.TxPBL" and "DMA_CH(#i)_RX_Control.RxPBL".

Fixes: 47f2a9ce527a ("net: stmmac: dma channel init prepared for multiple queues")
Reviewed-by: Zhang, Baoli <baoli.zhang@intel.com>
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Weifeng Voon <weifeng.voon@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet: stmmac: fix ethtool flow control not able to get/set
Tan, Tee Min [Tue, 21 May 2019 04:55:42 +0000 (12:55 +0800)]
net: stmmac: fix ethtool flow control not able to get/set

Currently ethtool was not able to get/set the flow control due to a
missing "!". It will always return -EOPNOTSUPP even the device is
flow control supported.

This patch fixes the condition check for ethtool flow control get/set
function for ETHTOOL_LINK_MODE_Asym_Pause_BIT.

Fixes: 3c1bcc8614db (“net: ethernet: Convert phydev advertize and supported from u32 to link mode”)
Signed-off-by: Tan, Tee Min <tee.min.tan@intel.com>
Reviewed-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: Voon, Weifeng <weifeng.voon@intel.com@intel.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet/mlx5e: Disable rxhash when CQE compress is enabled
Saeed Mahameed [Thu, 23 May 2019 19:55:10 +0000 (12:55 -0700)]
net/mlx5e: Disable rxhash when CQE compress is enabled

[ Upstream commit c0194e2d0ef0e5ce5e21a35640d23a706827ae28 ]

When CQE compression is enabled (Multi-host systems), compressed CQEs
might arrive to the driver rx, compressed CQEs don't have a valid hash
offload and the driver already reports a hash value of 0 and invalid hash
type on the skb for compressed CQEs, but this is not good enough.

On a congested PCIe, where CQE compression will kick in aggressively,
gro will deliver lots of out of order packets due to the invalid hash
and this might cause a serious performance drop.

The only valid solution, is to disable rxhash offload at all when CQE
compression is favorable (Multi-host systems).

Fixes: 7219ab34f184 ("net/mlx5e: CQE compression")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet/mlx5: Allocate root ns memory using kzalloc to match kfree
Parav Pandit [Fri, 10 May 2019 15:40:08 +0000 (10:40 -0500)]
net/mlx5: Allocate root ns memory using kzalloc to match kfree

[ Upstream commit 25fa506b70cadb580c1e9cbd836d6417276d4bcd ]

root ns is yet another fs core node which is freed using kfree() by
tree_put_node().
Rest of the other fs core objects are also allocated using kmalloc
variants.

However, root ns memory is allocated using kvzalloc().
Hence allocate root ns memory using kzalloc().

Fixes: 2530236303d9e ("net/mlx5_core: Flow steering tree initialization")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agotipc: Avoid copying bytes beyond the supplied data
Chris Packham [Mon, 20 May 2019 03:45:36 +0000 (15:45 +1200)]
tipc: Avoid copying bytes beyond the supplied data

TLV_SET is called with a data pointer and a len parameter that tells us
how many bytes are pointed to by data. When invoking memcpy() we need
to careful to only copy len bytes.

Previously we would copy TLV_LENGTH(len) bytes which would copy an extra
4 bytes past the end of the data pointer which newer GCC versions
complain about.

 In file included from test.c:17:
 In function 'TLV_SET',
     inlined from 'test' at test.c:186:5:
 /usr/include/linux/tipc_config.h:317:3:
 warning: 'memcpy' forming offset [33, 36] is out of the bounds [0, 32]
 of object 'bearer_name' with type 'char[32]' [-Warray-bounds]
     memcpy(TLV_DATA(tlv_ptr), data, tlv_len);
     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 test.c: In function 'test':
 test.c::161:10: note:
 'bearer_name' declared here
     char bearer_name[TIPC_MAX_BEARER_NAME];
          ^~~~~~~~~~~

We still want to ensure any padding bytes at the end are initialised, do
this with a explicit memset() rather than copy bytes past the end of
data. Apply the same logic to TCM_SET.

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet/mlx5: Avoid double free in fs init error unwinding path
Parav Pandit [Fri, 10 May 2019 15:26:23 +0000 (10:26 -0500)]
net/mlx5: Avoid double free in fs init error unwinding path

[ Upstream commit 9414277a5df3669c67e818708c0f881597e0118e ]

In below code flow, for ingress acl table root ns memory leads
to double free.

mlx5_init_fs
  init_ingress_acls_root_ns()
    init_ingress_acl_root_ns
       kfree(steering->esw_ingress_root_ns);
       /* steering->esw_ingress_root_ns is not marked NULL */
  mlx5_cleanup_fs
    cleanup_ingress_acls_root_ns
       steering->esw_ingress_root_ns non NULL check passes.
       kfree(steering->esw_ingress_root_ns);
       /* double free */

Similar issue exist for other tables.

Hence zero out the pointers to not process the table again.

Fixes: 9b93ab981e3bf ("net/mlx5: Separate ingress/egress namespaces for each vport")
Fixes: 40c3eebb49e51 ("net/mlx5: Add support in RDMA RX steering")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agousbnet: fix kernel crash after disconnect
Kloetzke Jan [Tue, 21 May 2019 13:18:40 +0000 (13:18 +0000)]
usbnet: fix kernel crash after disconnect

[ Upstream commit ad70411a978d1e6e97b1e341a7bde9a79af0c93d ]

When disconnecting cdc_ncm the kernel sporadically crashes shortly
after the disconnect:

  [   57.868812] Unable to handle kernel NULL pointer dereference at virtual address 00000000
  ...
  [   58.006653] PC is at 0x0
  [   58.009202] LR is at call_timer_fn+0xec/0x1b4
  [   58.013567] pc : [<0000000000000000>] lr : [<ffffff80080f5130>] pstate: 00000145
  [   58.020976] sp : ffffff8008003da0
  [   58.024295] x29: ffffff8008003da0 x28: 0000000000000001
  [   58.029618] x27: 000000000000000a x26: 0000000000000100
  [   58.034941] x25: 0000000000000000 x24: ffffff8008003e68
  [   58.040263] x23: 0000000000000000 x22: 0000000000000000
  [   58.045587] x21: 0000000000000000 x20: ffffffc68fac1808
  [   58.050910] x19: 0000000000000100 x18: 0000000000000000
  [   58.056232] x17: 0000007f885aff8c x16: 0000007f883a9f10
  [   58.061556] x15: 0000000000000001 x14: 000000000000006e
  [   58.066878] x13: 0000000000000000 x12: 00000000000000ba
  [   58.072201] x11: ffffffc69ff1db30 x10: 0000000000000020
  [   58.077524] x9 : 8000100008001000 x8 : 0000000000000001
  [   58.082847] x7 : 0000000000000800 x6 : ffffff8008003e70
  [   58.088169] x5 : ffffffc69ff17a28 x4 : 00000000ffff138b
  [   58.093492] x3 : 0000000000000000 x2 : 0000000000000000
  [   58.098814] x1 : 0000000000000000 x0 : 0000000000000000
  ...
  [   58.205800] [<          (null)>]           (null)
  [   58.210521] [<ffffff80080f5298>] expire_timers+0xa0/0x14c
  [   58.215937] [<ffffff80080f542c>] run_timer_softirq+0xe8/0x128
  [   58.221702] [<ffffff8008081120>] __do_softirq+0x298/0x348
  [   58.227118] [<ffffff80080a6304>] irq_exit+0x74/0xbc
  [   58.232009] [<ffffff80080e17dc>] __handle_domain_irq+0x78/0xac
  [   58.237857] [<ffffff8008080cf4>] gic_handle_irq+0x80/0xac
  ...

The crash happens roughly 125..130ms after the disconnect. This
correlates with the 'delay' timer that is started on certain USB tx/rx
errors in the URB completion handler.

The problem is a race of usbnet_stop() with usbnet_start_xmit(). In
usbnet_stop() we call usbnet_terminate_urbs() to cancel all URBs in
flight. This only makes sense if no new URBs are submitted
concurrently, though. But the usbnet_start_xmit() can run at the same
time on another CPU which almost unconditionally submits an URB. The
error callback of the new URB will then schedule the timer after it was
already stopped.

The fix adds a check if the tx queue is stopped after the tx list lock
has been taken. This should reliably prevent the submission of new URBs
while usbnet_terminate_urbs() does its job. The same thing is done on
the rx side even though it might be safe due to other flags that are
checked there.

Signed-off-by: Jan Klötzke <Jan.Kloetzke@preh.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agor8169: fix MAC address being lost in PCI D3
Heiner Kallweit [Wed, 29 May 2019 05:44:01 +0000 (07:44 +0200)]
r8169: fix MAC address being lost in PCI D3

[ Upstream commit 59715171fbd0172a579576f46821031800a63bc5 ]

(At least) RTL8168e forgets its MAC address in PCI D3. To fix this set
the MAC address when resuming. For resuming from runtime-suspend we
had this in place already, for resuming from S3/S5 it was missing.

The commit referenced as being fixed isn't wrong, it's just the first
one where the patch applies cleanly.

Fixes: 0f07bd850d36 ("r8169: use dev_get_drvdata where possible")
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reported-by: Albert Astals Cid <aacid@kde.org>
Tested-by: Albert Astals Cid <aacid@kde.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet: stmmac: fix reset gpio free missing
Jisheng Zhang [Wed, 22 May 2019 10:05:09 +0000 (10:05 +0000)]
net: stmmac: fix reset gpio free missing

[ Upstream commit 49ce881c0d4c4a7a35358d9dccd5f26d0e56fc61 ]

Commit 984203ceff27 ("net: stmmac: mdio: remove reset gpio free")
removed the reset gpio free, when the driver is unbinded or rmmod,
we miss the gpio free.

This patch uses managed API to request the reset gpio, so that the
gpio could be freed properly.

Fixes: 984203ceff27 ("net: stmmac: mdio: remove reset gpio free")
Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet: sched: don't use tc_action->order during action dump
Vlad Buslov [Thu, 23 May 2019 06:32:31 +0000 (09:32 +0300)]
net: sched: don't use tc_action->order during action dump

[ Upstream commit 4097e9d250fb17958c1d9b94538386edd3f20144 ]

Function tcf_action_dump() relies on tc_action->order field when starting
nested nla to send action data to userspace. This approach breaks in
several cases:

- When multiple filters point to same shared action, tc_action->order field
  is overwritten each time it is attached to filter. This causes filter
  dump to output action with incorrect attribute for all filters that have
  the action in different position (different order) from the last set
  tc_action->order value.

- When action data is displayed using tc action API (RTM_GETACTION), action
  order is overwritten by tca_action_gd() according to its position in
  resulting array of nl attributes, which will break filter dump for all
  filters attached to that shared action that expect it to have different
  order value.

Don't rely on tc_action->order when dumping actions. Set nla according to
action position in resulting array of actions instead.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet: phy: marvell10g: report if the PHY fails to boot firmware
Russell King [Tue, 28 May 2019 09:34:42 +0000 (10:34 +0100)]
net: phy: marvell10g: report if the PHY fails to boot firmware

[ Upstream commit 3d3ced2ec5d71b99d72ae6910fbdf890bc2eccf0 ]

Some boards do not have the PHY firmware programmed in the 3310's flash,
which leads to the PHY not working as expected.  Warn the user when the
PHY fails to boot the firmware and refuse to initialise.

Fixes: 20b2af32ff3f ("net: phy: add Marvell Alaska X 88X3310 10Gigabit PHY support")
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet: mvpp2: fix bad MVPP2_TXQ_SCHED_TOKEN_CNTR_REG queue value
Antoine Tenart [Wed, 29 May 2019 13:59:48 +0000 (15:59 +0200)]
net: mvpp2: fix bad MVPP2_TXQ_SCHED_TOKEN_CNTR_REG queue value

[ Upstream commit 21808437214637952b61beaba6034d97880fbeb3 ]

MVPP2_TXQ_SCHED_TOKEN_CNTR_REG() expects the logical queue id but
the current code is passing the global tx queue offset, so it ends
up writing to unknown registers (between 0x8280 and 0x82fc, which
seemed to be unused by the hardware). This fixes the issue by using
the logical queue id instead.

Fixes: 3f518509dedc ("ethernet: Add new driver for Marvell Armada 375 network unit")
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet: mvneta: Fix err code path of probe
Jisheng Zhang [Mon, 27 May 2019 11:04:17 +0000 (11:04 +0000)]
net: mvneta: Fix err code path of probe

[ Upstream commit d484e06e25ebb937d841dac02ac1fe76ec7d4ddd ]

Fix below issues in err code path of probe:
1. we don't need to unregister_netdev() because the netdev isn't
registered.
2. when register_netdev() fails, we also need to destroy bm pool for
HWBM case.

Fixes: dc35a10f68d3 ("net: mvneta: bm: add support for hardware buffer management")
Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet-gro: fix use-after-free read in napi_gro_frags()
Eric Dumazet [Wed, 29 May 2019 22:36:10 +0000 (15:36 -0700)]
net-gro: fix use-after-free read in napi_gro_frags()

[ Upstream commit a4270d6795b0580287453ea55974d948393e66ef ]

If a network driver provides to napi_gro_frags() an
skb with a page fragment of exactly 14 bytes, the call
to gro_pull_from_frag0() will 'consume' the fragment
by calling skb_frag_unref(skb, 0), and the page might
be freed and reused.

Reading eth->h_proto at the end of napi_frags_skb() might
read mangled data, or crash under specific debugging features.

BUG: KASAN: use-after-free in napi_frags_skb net/core/dev.c:5833 [inline]
BUG: KASAN: use-after-free in napi_gro_frags+0xc6f/0xd10 net/core/dev.c:5841
Read of size 2 at addr ffff88809366840c by task syz-executor599/8957

CPU: 1 PID: 8957 Comm: syz-executor599 Not tainted 5.2.0-rc1+ #32
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 kasan_report+0x12/0x20 mm/kasan/common.c:614
 __asan_report_load_n_noabort+0xf/0x20 mm/kasan/generic_report.c:142
 napi_frags_skb net/core/dev.c:5833 [inline]
 napi_gro_frags+0xc6f/0xd10 net/core/dev.c:5841
 tun_get_user+0x2f3c/0x3ff0 drivers/net/tun.c:1991
 tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2037
 call_write_iter include/linux/fs.h:1872 [inline]
 do_iter_readv_writev+0x5f8/0x8f0 fs/read_write.c:693
 do_iter_write fs/read_write.c:970 [inline]
 do_iter_write+0x184/0x610 fs/read_write.c:951
 vfs_writev+0x1b3/0x2f0 fs/read_write.c:1015
 do_writev+0x15b/0x330 fs/read_write.c:1058

Fixes: a50e233c50db ("net-gro: restore frag0 optimization")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet: fec: fix the clk mismatch in failed_reset path
Andy Duan [Thu, 23 May 2019 01:55:28 +0000 (01:55 +0000)]
net: fec: fix the clk mismatch in failed_reset path

[ Upstream commit ce8d24f9a5965a58c588f9342689702a1024433c ]

Fix the clk mismatch in the error path "failed_reset" because
below error path will disable clk_ahb and clk_ipg directly, it
should use pm_runtime_put_noidle() instead of pm_runtime_put()
to avoid to call runtime resume callback.

Reported-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
Tested-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agonet: dsa: mv88e6xxx: fix handling of upper half of STATS_TYPE_PORT
Rasmus Villemoes [Wed, 29 May 2019 07:02:11 +0000 (07:02 +0000)]
net: dsa: mv88e6xxx: fix handling of upper half of STATS_TYPE_PORT

[ Upstream commit 84b3fd1fc9592d431e23b077e692fa4e3fd0f086 ]

Currently, the upper half of a 4-byte STATS_TYPE_PORT statistic ends
up in bits 47:32 of the return value, instead of bits 31:16 as they
should.

Fixes: 6e46e2d821bb ("net: dsa: mv88e6xxx: Fix u64 statistics")
Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agomlxsw: spectrum_acl: Avoid warning after identical rules insertion
Jiri Pirko [Wed, 29 May 2019 07:59:44 +0000 (10:59 +0300)]
mlxsw: spectrum_acl: Avoid warning after identical rules insertion

[ Upstream commit ef74422020aa8c224b00a927e3e47faac4d8fae3 ]

When identical rules are inserted, the latter one goes to C-TCAM. For
that, a second eRP with the same mask is created. These 2 eRPs by the
nature cannot be merged and also one cannot be parent of another.
Teach mlxsw_sp_acl_erp_delta_fill() about this possibility and handle it
gracefully.

Reported-by: Alex Kushnarov <alexanderk@mellanox.com>
Fixes: c22291f7cf45 ("mlxsw: spectrum: acl: Implement delta for ERP")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agollc: fix skb leak in llc_build_and_send_ui_pkt()
Eric Dumazet [Tue, 28 May 2019 00:35:52 +0000 (17:35 -0700)]
llc: fix skb leak in llc_build_and_send_ui_pkt()

[ Upstream commit 8fb44d60d4142cd2a440620cd291d346e23c131e ]

If llc_mac_hdr_init() returns an error, we must drop the skb
since no llc_build_and_send_ui_pkt() caller will take care of this.

BUG: memory leak
unreferenced object 0xffff8881202b6800 (size 2048):
  comm "syz-executor907", pid 7074, jiffies 4294943781 (age 8.590s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    1a 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
  backtrace:
    [<00000000e25b5abe>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
    [<00000000e25b5abe>] slab_post_alloc_hook mm/slab.h:439 [inline]
    [<00000000e25b5abe>] slab_alloc mm/slab.c:3326 [inline]
    [<00000000e25b5abe>] __do_kmalloc mm/slab.c:3658 [inline]
    [<00000000e25b5abe>] __kmalloc+0x161/0x2c0 mm/slab.c:3669
    [<00000000a1ae188a>] kmalloc include/linux/slab.h:552 [inline]
    [<00000000a1ae188a>] sk_prot_alloc+0xd6/0x170 net/core/sock.c:1608
    [<00000000ded25bbe>] sk_alloc+0x35/0x2f0 net/core/sock.c:1662
    [<000000002ecae075>] llc_sk_alloc+0x35/0x170 net/llc/llc_conn.c:950
    [<00000000551f7c47>] llc_ui_create+0x7b/0x140 net/llc/af_llc.c:173
    [<0000000029027f0e>] __sock_create+0x164/0x250 net/socket.c:1430
    [<000000008bdec225>] sock_create net/socket.c:1481 [inline]
    [<000000008bdec225>] __sys_socket+0x69/0x110 net/socket.c:1523
    [<00000000b6439228>] __do_sys_socket net/socket.c:1532 [inline]
    [<00000000b6439228>] __se_sys_socket net/socket.c:1530 [inline]
    [<00000000b6439228>] __x64_sys_socket+0x1e/0x30 net/socket.c:1530
    [<00000000cec820c1>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
    [<000000000c32554f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

BUG: memory leak
unreferenced object 0xffff88811d750d00 (size 224):
  comm "syz-executor907", pid 7074, jiffies 4294943781 (age 8.600s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 f0 0c 24 81 88 ff ff 00 68 2b 20 81 88 ff ff  ...$.....h+ ....
  backtrace:
    [<0000000053026172>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
    [<0000000053026172>] slab_post_alloc_hook mm/slab.h:439 [inline]
    [<0000000053026172>] slab_alloc_node mm/slab.c:3269 [inline]
    [<0000000053026172>] kmem_cache_alloc_node+0x153/0x2a0 mm/slab.c:3579
    [<00000000fa8f3c30>] __alloc_skb+0x6e/0x210 net/core/skbuff.c:198
    [<00000000d96fdafb>] alloc_skb include/linux/skbuff.h:1058 [inline]
    [<00000000d96fdafb>] alloc_skb_with_frags+0x5f/0x250 net/core/skbuff.c:5327
    [<000000000a34a2e7>] sock_alloc_send_pskb+0x269/0x2a0 net/core/sock.c:2225
    [<00000000ee39999b>] sock_alloc_send_skb+0x32/0x40 net/core/sock.c:2242
    [<00000000e034d810>] llc_ui_sendmsg+0x10a/0x540 net/llc/af_llc.c:933
    [<00000000c0bc8445>] sock_sendmsg_nosec net/socket.c:652 [inline]
    [<00000000c0bc8445>] sock_sendmsg+0x54/0x70 net/socket.c:671
    [<000000003b687167>] __sys_sendto+0x148/0x1f0 net/socket.c:1964
    [<00000000922d78d9>] __do_sys_sendto net/socket.c:1976 [inline]
    [<00000000922d78d9>] __se_sys_sendto net/socket.c:1972 [inline]
    [<00000000922d78d9>] __x64_sys_sendto+0x2a/0x30 net/socket.c:1972
    [<00000000cec820c1>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
    [<000000000c32554f>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoipv6: Fix redirect with VRF
David Ahern [Wed, 22 May 2019 22:12:18 +0000 (15:12 -0700)]
ipv6: Fix redirect with VRF

[ Upstream commit 31680ac265802397937d75461a2809a067b9fb93 ]

IPv6 redirect is broken for VRF. __ip6_route_redirect walks the FIB
entries looking for an exact match on ifindex. With VRF the flowi6_oif
is updated by l3mdev_update_flow to the l3mdev index and the
FLOWI_FLAG_SKIP_NH_OIF set in the flags to tell the lookup to skip the
device match. For redirects the device match is requires so use that
flag to know when the oif needs to be reset to the skb device index.

Fixes: ca254490c8df ("net: Add VRF support to IPv6 stack")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoipv6: Consider sk_bound_dev_if when binding a raw socket to an address
Mike Manning [Mon, 20 May 2019 18:57:17 +0000 (19:57 +0100)]
ipv6: Consider sk_bound_dev_if when binding a raw socket to an address

[ Upstream commit 72f7cfab6f93a8ea825fab8ccfb016d064269f7f ]

IPv6 does not consider if the socket is bound to a device when binding
to an address. The result is that a socket can be bound to eth0 and
then bound to the address of eth1. If the device is a VRF, the result
is that a socket can only be bound to an address in the default VRF.

Resolve by considering the device if sk_bound_dev_if is set.

Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoipv4/igmp: fix build error if !CONFIG_IP_MULTICAST
Eric Dumazet [Thu, 23 May 2019 01:35:16 +0000 (18:35 -0700)]
ipv4/igmp: fix build error if !CONFIG_IP_MULTICAST

[ Upstream commit 903869bd10e6719b9df6718e785be7ec725df59f ]

ip_sf_list_clear_all() needs to be defined even if !CONFIG_IP_MULTICAST

Fixes: 3580d04aa674 ("ipv4/igmp: fix another memory leak in igmpv3_del_delrec()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoipv4/igmp: fix another memory leak in igmpv3_del_delrec()
Eric Dumazet [Wed, 22 May 2019 23:51:22 +0000 (16:51 -0700)]
ipv4/igmp: fix another memory leak in igmpv3_del_delrec()

[ Upstream commit 3580d04aa674383c42de7b635d28e52a1e5bc72c ]

syzbot reported memory leaks [1] that I have back tracked to
a missing cleanup from igmpv3_del_delrec() when
(im->sfmode != MCAST_INCLUDE)

Add ip_sf_list_clear_all() and kfree_pmc() helpers to explicitely
handle the cleanups before freeing.

[1]

BUG: memory leak
unreferenced object 0xffff888123e32b00 (size 64):
  comm "softirq", pid 0, jiffies 4294942968 (age 8.010s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 e0 00 00 01 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000006105011b>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
    [<000000006105011b>] slab_post_alloc_hook mm/slab.h:439 [inline]
    [<000000006105011b>] slab_alloc mm/slab.c:3326 [inline]
    [<000000006105011b>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
    [<000000004bba8073>] kmalloc include/linux/slab.h:547 [inline]
    [<000000004bba8073>] kzalloc include/linux/slab.h:742 [inline]
    [<000000004bba8073>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline]
    [<000000004bba8073>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085
    [<00000000a46a65a0>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475
    [<000000005956ca89>] do_ip_setsockopt.isra.0+0x1795/0x1930 net/ipv4/ip_sockglue.c:957
    [<00000000848e2d2f>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246
    [<00000000b9db185c>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
    [<000000003028e438>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3130
    [<0000000015b65589>] __sys_setsockopt+0x98/0x120 net/socket.c:2078
    [<00000000ac198ef0>] __do_sys_setsockopt net/socket.c:2089 [inline]
    [<00000000ac198ef0>] __se_sys_setsockopt net/socket.c:2086 [inline]
    [<00000000ac198ef0>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
    [<000000000a770437>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
    [<00000000d3adb93b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 9c8bb163ae78 ("igmp, mld: Fix memory leak in igmpv3/mld_del_delrec()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hangbin Liu <liuhangbin@gmail.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoinet: switch IP ID generator to siphash
Eric Dumazet [Wed, 27 Mar 2019 19:40:33 +0000 (12:40 -0700)]
inet: switch IP ID generator to siphash

[ Upstream commit df453700e8d81b1bdafdf684365ee2b9431fb702 ]

According to Amit Klein and Benny Pinkas, IP ID generation is too weak
and might be used by attackers.

Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
having 64bit key and Jenkins hash is risky.

It is time to switch to siphash and its 128bit keys.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agocxgb4: offload VLAN flows regardless of VLAN ethtype
Raju Rangoju [Thu, 23 May 2019 15:11:44 +0000 (20:41 +0530)]
cxgb4: offload VLAN flows regardless of VLAN ethtype

[ Upstream commit b5730061d1056abf317caea823b94d6e12b5b4f6 ]

VLAN flows never get offloaded unless ivlan_vld is set in filter spec.
It's not compulsory for vlan_ethtype to be set.

So, always enable ivlan_vld bit for offloading VLAN flows regardless of
vlan_ethtype is set or not.

Fixes: ad9af3e09c (cxgb4: add tc flower match support for vlan)
Signed-off-by: Raju Rangoju <rajur@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agobonding/802.3ad: fix slave link initialization transition states
Jarod Wilson [Fri, 24 May 2019 13:49:28 +0000 (09:49 -0400)]
bonding/802.3ad: fix slave link initialization transition states

[ Upstream commit 334031219a84b9994594015aab85ed7754c80176 ]

Once in a while, with just the right timing, 802.3ad slaves will fail to
properly initialize, winding up in a weird state, with a partner system
mac address of 00:00:00:00:00:00. This started happening after a fix to
properly track link_failure_count tracking, where an 802.3ad slave that
reported itself as link up in the miimon code, but wasn't able to get a
valid speed/duplex, started getting set to BOND_LINK_FAIL instead of
BOND_LINK_DOWN. That was the proper thing to do for the general "my link
went down" case, but has created a link initialization race that can put
the interface in this odd state.

The simple fix is to instead set the slave link to BOND_LINK_DOWN again,
if the link has never been up (last_link_up == 0), so the link state
doesn't bounce from BOND_LINK_DOWN to BOND_LINK_FAIL -- it hasn't failed
in this case, it simply hasn't been up yet, and this prevents the
unnecessary state change from DOWN to FAIL and getting stuck in an init
failure w/o a partner mac.

Fixes: ea53abfab960 ("bonding/802.3ad: fix link_failure_count tracking")
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
CC: netdev@vger.kernel.org
Tested-by: Heesoon Kim <Heesoon.Kim@stratus.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoMerge remote-tracking branch 'stable/linux-5.0.y' into zygo-5.0.x-zb64
Zygo Blaxell [Fri, 31 May 2019 18:25:31 +0000 (14:25 -0400)]
Merge remote-tracking branch 'stable/linux-5.0.y' into zygo-5.0.x-zb64

6 years agoLinux 5.0.20
Greg Kroah-Hartman [Fri, 31 May 2019 13:45:24 +0000 (06:45 -0700)]
Linux 5.0.20

6 years agoNFS: Fix a double unlock from nfs_match,get_client
Benjamin Coddington [Thu, 9 May 2019 11:25:21 +0000 (07:25 -0400)]
NFS: Fix a double unlock from nfs_match,get_client

[ Upstream commit c260121a97a3e4df6536edbc2f26e166eff370ce ]

Now that nfs_match_client drops the nfs_client_lock, we should be
careful
to always return it in the same condition: locked.

Fixes: 950a578c6128 ("NFS: make nfs_match_client killable")
Reported-by: syzbot+228a82b263b5da91883d@syzkaller.appspotmail.com
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
6 years agodrm/sun4i: dsi: Enforce boundaries on the start delay
Maxime Ripard [Mon, 11 Feb 2019 14:41:24 +0000 (15:41 +0100)]
drm/sun4i: dsi: Enforce boundaries on the start delay

[ Upstream commit efa31801203ac2f5c6a82a28cb991c7163ee0f1d ]

The Allwinner BSP makes sure that we don't end up with a null start delay
or with a delay larger than vtotal.

The former condition is likely to happen now with the reworked start delay,
so make sure we enforce the same boundaries.

Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
Reviewed-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
Link: https://patchwork.freedesktop.org/patch/msgid/c9889cf5f7a3d101ef380905900b45a182596f56.1549896081.git-series.maxime.ripard@bootlin.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
6 years agoice: Put __ICE_PREPARED_FOR_RESET check in ice_prepare_for_reset
Brett Creeley [Wed, 13 Feb 2019 18:51:14 +0000 (10:51 -0800)]
ice: Put __ICE_PREPARED_FOR_RESET check in ice_prepare_for_reset

[ Upstream commit 5abac9d7e1bb9a373673811154774d4c89a7f85e ]

Currently we check if the __ICE_PREPARED_FOR_RESET bit is set prior to
calling ice_prepare_for_reset in ice_reset_subtask(), but we aren't
checking that bit in ice_do_reset() before calling
ice_prepare_for_reset(). This is not consistent and can cause issues if
ice_prepare_for_reset() is called prior to ice_do_reset(). Fix this by
checking if the __ICE_PREPARED_FOR_RESET bit is set internal to
ice_prepare_for_reset().

Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
6 years agovfio-ccw: Prevent quiesce function going into an infinite loop
Farhan Ali [Tue, 16 Apr 2019 21:23:14 +0000 (17:23 -0400)]
vfio-ccw: Prevent quiesce function going into an infinite loop

[ Upstream commit d1ffa760d22aa1d8190478e5ef555c59a771db27 ]

The quiesce function calls cio_cancel_halt_clear() and if we
get an -EBUSY we go into a loop where we:
- wait for any interrupts
- flush all I/O in the workqueue
- retry cio_cancel_halt_clear

During the period where we are waiting for interrupts or
flushing all I/O, the channel subsystem could have completed
a halt/clear action and turned off the corresponding activity
control bits in the subchannel status word. This means the next
time we call cio_cancel_halt_clear(), we will again start by
calling cancel subchannel and so we can be stuck between calling
cancel and halt forever.

Rather than calling cio_cancel_halt_clear() immediately after
waiting, let's try to disable the subchannel. If we succeed in
disabling the subchannel then we know nothing else can happen
with the device.

Suggested-by: Eric Farman <farman@linux.ibm.com>
Signed-off-by: Farhan Ali <alifm@linux.ibm.com>
Message-Id: <4d5a4b98ab1b41ac6131b5c36de18b76c5d66898.1555449329.git.alifm@linux.ibm.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
6 years agodrm/sun4i: dsi: Change the start delay calculation
Maxime Ripard [Mon, 11 Feb 2019 14:41:23 +0000 (15:41 +0100)]
drm/sun4i: dsi: Change the start delay calculation

[ Upstream commit da676c6aa6413d59ab0a80c97bbc273025e640b2 ]

The current calculation for the video start delay in the current DSI driver
is that it is the total vertical size, minus the front porch and sync length,
plus 1. This equals to the active vertical size plus the back porch plus 1.

That 1 is coming in the Allwinner BSP from an variable that is set to 1.
However, if we look at the Allwinner BSP more closely, and especially in
the "legacy" code for the display (in drivers/video/sunxi/legacy/), we can
see that this variable is actually computed from the porches and the sync
minus 10, clamped between 8 and 100.

This fixes the start delay symptom we've seen on some panels (vblank
timeouts with vertical white stripes at the bottom of the panel).

Reviewed-by: Paul Kocialkowski <paul.kocialkowski@bootlin.com>
Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
Link: https://patchwork.freedesktop.org/patch/msgid/6e5f72e68f47ca0223877464bf12f0c3f3978de8.1549896081.git-series.maxime.ripard@bootlin.com
Signed-off-by: Sasha Levin <sashal@kernel.org>