git.hungrycats.org Git

btrfs: add stripes filter

BTRFS: support NFSv2 export (bnc#929871).

suse-commit: e41cbadff994083175fdd876777b52bafff209e4
(cherry picked from commit 51e974c9ec17e282640119dd71608051e74d12ac)

Btrfs: update fix for read corruption of compressed and shared extents

My previous fix in commit 005efedf2c7d ("Btrfs: fix read corruption of
compressed and shared extents") was effective only if the compressed
extents cover a file range with a length that is not a multiple of 16
pages. That's because the detection of when we reached a different range
of the file that shares the same compressed extent as the previously
processed range was done at extent_io.c:__do_contiguous_readpages(),
which covers subranges with a length up to 16 pages, because
extent_readpages() groups the pages in clusters no larger than 16 pages.
So fix this by tracking the start of the previously processed file
range's extent map at extent_readpages().

The following test case for fstests reproduces the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1 # failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner

  rm -f $seqres.full

  test_clone_and_read_compressed_extent()
  {
      local mount_opts=$1

      _scratch_mkfs >>$seqres.full 2>&1
      _scratch_mount $mount_opts

      # Create our test file with a single extent of 64Kb that is going to
      # be compressed no matter which compression algo is used (zlib/lzo).
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \
          $SCRATCH_MNT/foo | _filter_xfs_io

      # Now clone the compressed extent into an adjacent file offset.
      $CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \
          $SCRATCH_MNT/foo $SCRATCH_MNT/foo

      echo "File digest before unmount:"
      md5sum $SCRATCH_MNT/foo | _filter_scratch

      # Remount the fs or clear the page cache to trigger the bug in
      # btrfs. Because the extent has an uncompressed length that is a
      # multiple of 16 pages, all the pages belonging to the second range
      # of the file (64K to 128K), which points to the same extent as the
      # first range (0K to 64K), had their contents full of zeroes instead
      # of the byte 0xaa. This was a bug exclusively in the read path of
      # compressed extents, the correct data was stored on disk, btrfs
      # just failed to fill in the pages correctly.
      _scratch_remount

      echo "File digest after remount:"
      # Must match the digest we got before.
      md5sum $SCRATCH_MNT/foo | _filter_scratch
  }

  echo -e "\nTesting with zlib compression..."
  test_clone_and_read_compressed_extent "-o compress=zlib"

  _scratch_unmount

  echo -e "\nTesting with lzo compression..."
  test_clone_and_read_compressed_extent "-o compress=lzo"

  status=0
  exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Timofey Titovets <nefelim4ag@gmail.com>
(cherry picked from commit 90f232390d5ca1f6160587f91b0ceef6acf40de5)

make pending_ordered not suck

(cherry picked from commit e15dc76f828511963b35745ddbd9b58723026669)

Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called

We need not check path before btrfs_free_path() is called because
path is checked in btrfs_free_path().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 527afb4493c2892ce89fb74648e72a30b68ba120)

btrfs: Add raid56 support for updating
num_tolerated_disk_barrier_failures in btrfs_balance

Code for updating fs_info->num_tolerated_disk_barrier_failures in
btrfs_balance() lacks raid56 support.

Reason:
Above code was wroten in 2012-08-01, together with
btrfs_calc_num_tolerated_disk_barrier_failures()'s first version.

Then, btrfs_calc_num_tolerated_disk_barrier_failures() got updated
later to support raid56, but code in btrfs_balance() was not
updated together.

Fix:
Merge above similar code to a common function:
btrfs_get_num_tolerated_disk_barrier_failures()
and make it support both case.

It can fix this bug with a bonus of cleanup, and make these code
never in above no-sync state from now on.

Suggested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 943c6e9925d90dc80207322b5799d95fb90ffec0)

btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures

1: Use ARRAY_SIZE(types) to replace a static-value variant:
   int num_types = 4;

2: Use 'continue' on condition to reduce one level tab
   if (!XXX) {
       code;
       ...
   }
   ->
   if (XXX)
       continue;
   code;
   ...

3: Put setting 'num_tolerated_disk_barrier_failures = 2' to
   (num_tolerated_disk_barrier_failures > 2) condition to make
   make logic neat.
   if (num_tolerated_disk_barrier_failures > 0 && XXX)
       num_tolerated_disk_barrier_failures = 0;
   else if (num_tolerated_disk_barrier_failures > 1) {
       if (XXX)
           num_tolerated_disk_barrier_failures = 1;
       else if (XXX)
           num_tolerated_disk_barrier_failures = 2;
   ->
   if (num_tolerated_disk_barrier_failures > 0 && XXX)
       num_tolerated_disk_barrier_failures = 0;
   if (num_tolerated_disk_barrier_failures > 1 && XXX)
       num_tolerated_disk_barrier_failures = ;
   if (num_tolerated_disk_barrier_failures > 2 && XXX)
       num_tolerated_disk_barrier_failures = 2;

4: Remove comment of:
   num_mirrors - 1: if RAID1 or RAID10 is configured and more
   than 2 mirrors are used.
   which is not fit with code.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 2c4580454fffbf184fdb9292aa19ab1ffc224add)

btrfs: Remove noused chunk_tree and chunk_objectid from scrub_enumerate_chunks and scrub_chunk

These variables are not used from introduced version, remove them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 8c204c9657c32ec5a259ebf852a767afe7efdafa)

Btrfs: fix read corruption of compressed and shared extents

If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.

Consider the following example:

  File layout
  [0 - 8K]                      [8K - 24K]
      |                             |
      |                             |
   points to extent X,         points to extent X,
   offset 4K, length of 8K     offset 0, length 16K

  [extent X, compressed length = 4K uncompressed length = 16K]

If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).

So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.

The following test case for fstests triggers the issue:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1 # failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner

  rm -f $seqres.full

  test_clone_and_read_compressed_extent()
  {
      local mount_opts=$1

      _scratch_mkfs >>$seqres.full 2>&1
      _scratch_mount $mount_opts

      # Create a test file with a single extent that is compressed (the
      # data we write into it is highly compressible no matter which
      # compression algorithm is used, zlib or lzo).
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K"        \
                      -c "pwrite -S 0xbb 4K 8K"        \
                      -c "pwrite -S 0xcc 12K 4K"       \
                      $SCRATCH_MNT/foo | _filter_xfs_io

      # Now clone our extent into an adjacent offset.
      $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
          $SCRATCH_MNT/foo $SCRATCH_MNT/foo

      # Same as before but for this file we clone the extent into a lower
      # file offset.
      $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K"         \
                      -c "pwrite -S 0xbb 12K 8K"        \
                      -c "pwrite -S 0xcc 20K 4K"        \
                      $SCRATCH_MNT/bar | _filter_xfs_io

      $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
          $SCRATCH_MNT/bar $SCRATCH_MNT/bar

      echo "File digests before unmounting filesystem:"
      md5sum $SCRATCH_MNT/foo | _filter_scratch
      md5sum $SCRATCH_MNT/bar | _filter_scratch

      # Evicting the inode or clearing the page cache before reading
      # again the file would also trigger the bug - reads were returning
      # all bytes in the range corresponding to the second reference to
      # the extent with a value of 0, but the correct data was persisted
      # (it was a bug exclusively in the read path). The issue happened
      # only if the same readpages() call targeted pages belonging to the
      # first and second ranges that point to the same compressed extent.
      _scratch_remount

      echo "File digests after mounting filesystem again:"
      # Must match the same digests we got before.
      md5sum $SCRATCH_MNT/foo | _filter_scratch
      md5sum $SCRATCH_MNT/bar | _filter_scratch
  }

  echo -e "\nTesting with zlib compression..."
  test_clone_and_read_compressed_extent "-o compress=zlib"

  _scratch_unmount

  echo -e "\nTesting with lzo compression..."
  test_clone_and_read_compressed_extent "-o compress=lzo"

  status=0
  exit

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
(cherry picked from commit 1c44d0493183fc27cdf030b1d437442d1f751bec)

Btrfs: fix stale directory entries after fsync log replay
(bsc#942925).

suse-commit: 8152b37d7e8d49cf32f532d70e9be8a87e3fb92d
(cherry picked from commit 4de4f38569bdc4f8c0edb1c2fca08b63052fd7a2)

Btrfs: make fsync fast again

Filipe removed the main reason I made fsync fast, which was to move the waiting
on ordered IO while we're logging extents. This patch preserves his need to
catch IO errors and my need for fsync performance to not suck donkey balls.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
(cherry picked from commit 67921b028b4a52ae65e6260228d311fa39659e99)

Btrfs: don't ignore log btree writeback errors

If an error happens during writeback of log btree extents, make sure the
error is returned to the caller (fsync), so that it takes proper action
(commit current transaction) instead of writing a superblock that points
to log btrees with all or some nodes that weren't durably persisted.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 5ab5e44a36164f0366a98b47289c868d8fbcb256)

Btrfs: don't attach unnecessary extents to transaction on fsync

We don't need to attach ordered extents that have completed to the current
transaction. Doing so only makes us hold memory for longer than necessary
and delaying the iput of the inode until the transaction is committed (for
each created ordered extent we do an igrab and then schedule an asynchronous
iput when the ordered extent's reference count drops to 0), preventing the
inode from being evictable until the transaction commits.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 7558c8bc17481c1f856e009af8503ab40fec348a)

Btrfs: deal with error on secondary log properly

If we have an fsync at the same time in two seperate subvolumes we could end up
with the tree log pointing at invalid blocks. We need to notice if our writeout
failed in anyway, if it did then we need to do a full transaction commit and
return an error on the subvolume that gave us the io error. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
(cherry picked from commit d3e532cc080acaf30bac093998e4db4901cf690b)

btrfs: Remove unused arguments in tree-log.c

Following arguments are not used in tree-log.c:
insert_one_name(): path, type
wait_log_commit(): trans
wait_for_writer(): trans

This patch remove them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 60d53eb3107c8e8960e8d7c22aa4e69aac7a8fe6)

Revert "Btrfs: don't ignore log btree writeback errors"

This reverts commit e89abcfabf775603cb2eae86a79748313f6e5f73.

Btrfs: fix file read corruption after extent cloning and fsync

If we partially clone one extent of a file into a lower offset of the
file, fsync the file, power fail and then mount the fs to trigger log
replay, we can get multiple checksum items in the csum tree that overlap
each other and result in checksum lookup failures later. Those failures
can make file data read requests assume a checksum value of 0, but they
will not return an error (-EIO for example) to userspace exactly because
the expected checksum value 0 is a special value that makes the read bio
endio callback return success and set all the bytes of the corresponding
page with the value 0x01 (at fs/btrfs/inode.c:__readpage_endio_check()).
From a userspace perspective this is equivalent to file corruption
because we are not returning what was written to the file.

Details about how this can happen, and why, are included inline in the
following reproducer test case for fstests and the comment added to
tree-log.c.

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1 # failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_cloner
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with a single 100K extent starting at file
  # offset 800K. We fsync the file here to make the fsync log tree gets
  # a single csum item that covers the whole 100K extent, which causes
  # the second fsync, done after the cloning operation below, to not
  # leave in the log tree two csum items covering two sub-ranges
  # ([0, 20K[ and [20K, 100K[)) of our extent.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 800K 100K"  \
                  -c "fsync"                     \
                   $SCRATCH_MNT/foo | _filter_xfs_io

  # Now clone part of our extent into file offset 400K. This adds a file
  # extent item to our inode's metadata that points to the 100K extent
  # we created before, using a data offset of 20K and a data length of
  # 20K, so that it refers to the sub-range [20K, 40K[ of our original
  # extent.
  $CLONER_PROG -s $((800 * 1024 + 20 * 1024)) -d $((400 * 1024)) \
      -l $((20 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo

  # Now fsync our file to make sure the extent cloning is durably
  # persisted. This fsync will not add a second csum item to the log
  # tree containing the checksums for the blocks in the sub-range
  # [20K, 40K[ of our extent, because there was already a csum item in
  # the log tree covering the whole extent, added by the first fsync
  # we did before.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  echo "File digest before power failure:"
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  # Silently drop all writes and ummount to simulate a crash/power
  # failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again, mount to trigger log replay and validate file
  # contents.
  # The fsync log replay first processes the file extent item
  # corresponding to the file offset 400K (the one which refers to the
  # [20K, 40K[ sub-range of our 100K extent) and then processes the file
  # extent item for file offset 800K. It used to happen that when
  # processing the later, it erroneously left in the csum tree 2 csum
  # items that overlapped each other, 1 for the sub-range [20K, 40K[ and
  # 1 for the whole range of our extent. This introduced a problem where
  # subsequent lookups for the checksums of blocks within the range
  # [40K, 100K[ of our extent would not find anything because lookups in
  # the csum tree ended up looking only at the smaller csum item, the
  # one covering the subrange [20K, 40K[. This made read requests assume
  # an expected checksum with a value of 0 for those blocks, which caused
  # checksum verification failure when the read operations finished.
  # However those checksum failure did not result in read requests
  # returning an error to user space (like -EIO for e.g.) because the
  # expected checksum value had the special value 0, and in that case
  # btrfs set all bytes of the corresponding pages with the value 0x01
  # and produce the following warning in dmesg/syslog:
  #
  #  "BTRFS warning (device dm-0): csum failed ino 257 off 917504 csum\
  #   1322675045 expected csum 0"
  #
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  echo "File digest after log replay:"
  # Must match the same digest he had after cloning the extent and
  # before the power failure happened.
  md5sum $SCRATCH_MNT/foo | _filter_scratch

  _unmount_flakey

  status=0
  exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
(cherry picked from commit b776fbaea280916fcdaa69bd9aef959aa3691e19)

zygo: fs/btrfs: make noise for EIO: run find fs/btrfs -type f -print0 | xargs -0 perl -i.bak -pe 's/\bEIO\b/EIEIO/g unless /EIEIO/;'

zygo: fs/btrfs: make noise for EIO: the inline function

(cherry picked from commit 7fcb906872148dc2b293d328e8afb902d1843d47)

zygo: fs/btrfs: make noise for EIO: the inline function in both necessary header files

(cherry picked from commit df25a98ca7c6659e55498721e3d7104687008c5b)

zygo: fs/btrfs: make noise for EIO: the one exception to the GSR rule

(cherry picked from commit 4b5daa5700309ceed3c27d071de962b1394a7208)

zygo: fs/btrfs: make noise for EIO: generalized exception markup

Btrfs: fix list transaction->pending_ordered corruption

commit d3efe08400317888f559bbedf0e42cd31575d0ef upstream.

When we call btrfs_commit_transaction(), we splice the list "ordered"
of our transaction handle into the transaction's "pending_ordered"
list, but we don't re-initialize the "ordered" list of our transaction
handle, this means it still points to the same elements it used to
before the splice. Then we check if the current transaction's state is
>= TRANS_STATE_COMMIT_START and if it is we end up calling
btrfs_end_transaction() which simply splices again the "ordered" list
of our handle into the transaction's "pending_ordered" list, leaving
multiple pointers to the same ordered extents which results in list
corruption when we are iterating, removing and freeing ordered extents
at btrfs_wait_pending_ordered(), resulting in access to dangling
pointers / use-after-free issues.
Similarly, btrfs_end_transaction() can end up in some cases calling
btrfs_commit_transaction(), and both did a list splice of the transaction
handle's "ordered" list into the transaction's "pending_ordered" without
re-initializing the handle's "ordered" list, resulting in exactly the
same problem.

This produces the following warning on a kernel with linked list
debugging enabled:

[109749.265416] ------------[ cut here ]------------
[109749.266410] WARNING: CPU: 7 PID: 324 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
[109749.267969] list_del corruption. prev->next should be ffff8800ba087e20, but was fffffff8c1f7c35d
(...)
[109749.287505] Call Trace:
[109749.288135]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
[109749.298080]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
[109749.331605]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
[109749.334849]  [<ffffffff81260642>] ? __list_del_entry+0x5a/0x98
[109749.337093]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
[109749.337847]  [<ffffffff81260642>] __list_del_entry+0x5a/0x98
[109749.338678]  [<ffffffffa053e8bf>] btrfs_wait_pending_ordered+0x46/0xdb [btrfs]
[109749.340145]  [<ffffffffa058a65f>] ? __btrfs_run_delayed_items+0x149/0x163 [btrfs]
[109749.348313]  [<ffffffffa054077d>] btrfs_commit_transaction+0x36b/0xa10 [btrfs]
[109749.349745]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
[109749.350819]  [<ffffffffa055370d>] btrfs_sync_file+0x36f/0x3fc [btrfs]
[109749.351976]  [<ffffffff8118ec98>] vfs_fsync_range+0x8f/0x9e
[109749.360341]  [<ffffffff8118ecc3>] vfs_fsync+0x1c/0x1e
[109749.368828]  [<ffffffff8118ee1d>] do_fsync+0x34/0x4e
[109749.369790]  [<ffffffff8118f045>] SyS_fsync+0x10/0x14
[109749.370925]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
[109749.382274] ---[ end trace 48e0d07f7c03d95a ]---

On a non-debug kernel this leads to invalid memory accesses, causing a
crash. Fix this by using list_splice_init() instead of list_splice() in
btrfs_commit_transaction() and btrfs_end_transaction().

Fixes: 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3"
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 98f7bfe6a25a7b822efc7fcc72b7b494fefd438f)

Btrfs: fix race between caching kthread and returning inode to inode cache

commit ae9d8f17118551bedd797406a6768b87c2146234 upstream.

While the inode cache caching kthread is calling btrfs_unpin_free_ino(),
we could have a concurrent call to btrfs_return_ino() that adds a new
entry to the root's free space cache of pinned inodes. This concurrent
call does not acquire the fs_info->commit_root_sem before adding a new
entry if the caching state is BTRFS_CACHE_FINISHED, which is a problem
because the caching kthread calls btrfs_unpin_free_ino() after setting
the caching state to BTRFS_CACHE_FINISHED and therefore races with
the task calling btrfs_return_ino(), which is adding a new entry, while
the former (caching kthread) is navigating the cache's rbtree, removing
and freeing nodes from the cache's rbtree without acquiring the spinlock
that protects the rbtree.

This race resulted in memory corruption due to double free of struct
btrfs_free_space objects because both tasks can end up doing freeing the
same objects. Note that adding a new entry can result in merging it with
other entries in the cache, in which case those entries are freed.
This is particularly important as btrfs_free_space structures are also
used for the block group free space caches.

This memory corruption can be detected by a debugging kernel, which
reports it with the following trace:

[132408.501148] slab error in verify_redzone_free(): cache `btrfs_free_space': double free detected
[132408.505075] CPU: 15 PID: 12248 Comm: btrfs-ino-cache Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
[132408.505075] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[132408.505075]  ffff880023e7d320 ffff880163d73cd8 ffffffff8145eec7 ffffffff81095dce
[132408.505075]  ffff880009735d40 ffff880163d73ce8 ffffffff81154e1e ffff880163d73d68
[132408.505075]  ffffffff81155733 ffffffffa054a95a ffff8801b6099f00 ffffffffa0505b5f
[132408.505075] Call Trace:
[132408.505075]  [<ffffffff8145eec7>] dump_stack+0x4f/0x7b
[132408.505075]  [<ffffffff81095dce>] ? console_unlock+0x356/0x3a2
[132408.505075]  [<ffffffff81154e1e>] __slab_error.isra.28+0x25/0x36
[132408.505075]  [<ffffffff81155733>] __cache_free+0xe2/0x4b6
[132408.505075]  [<ffffffffa054a95a>] ? __btrfs_add_free_space+0x2f0/0x343 [btrfs]
[132408.505075]  [<ffffffffa0505b5f>] ? btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075]  [<ffffffff810f3b30>] ? time_hardirqs_off+0x15/0x28
[132408.505075]  [<ffffffff81084d42>] ? trace_hardirqs_off+0xd/0xf
[132408.505075]  [<ffffffff811563a1>] ? kfree+0xb6/0x14e
[132408.505075]  [<ffffffff811563d0>] kfree+0xe5/0x14e
[132408.505075]  [<ffffffffa0505b5f>] btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075]  [<ffffffffa0505e08>] caching_kthread+0x29e/0x2d9 [btrfs]
[132408.505075]  [<ffffffffa0505b6a>] ? btrfs_unpin_free_ino+0x99/0x99 [btrfs]
[132408.505075]  [<ffffffff8106698f>] kthread+0xef/0xf7
[132408.505075]  [<ffffffff810f3b08>] ? time_hardirqs_on+0x15/0x28
[132408.505075]  [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
[132408.505075]  [<ffffffff814653d2>] ret_from_fork+0x42/0x70
[132408.505075]  [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
[132408.505075] ffff880023e7d320: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b.
[132409.501654] slab: double free detected in cache 'btrfs_free_space', objp ffff880023e7d320
[132409.503355] ------------[ cut here ]------------
[132409.504241] kernel BUG at mm/slab.c:2571!

Therefore fix this by having btrfs_unpin_free_ino() acquire the lock
that protects the rbtree while doing the searches and removing entries.

Fixes: 1c70d8fb4dfa ("Btrfs: fix inode caching vs tree log")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 9547e86be4af61fb535d1487d6a0a607bb67f392)

Btrfs: use kmem_cache_free when freeing entry in inode cache

commit c3f4a1685bb87e59c886ee68f7967eae07d4dffa upstream.

The free space entries are allocated using kmem_cache_zalloc(),
through __btrfs_add_free_space(), therefore we should use
kmem_cache_free() and not kfree() to avoid any confusion and
any potential problem. Looking at the kfree() definition at
mm/slab.c it has the following comment:

  /*
   * (...)
   *
   * Don't free memory not originally allocated by kmalloc()
   * or you will run into trouble.
   */

So better be safe and use kmem_cache_free().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 6f953ad80fba78eed807fe61d96c4dde50708698)

Btrfs: don't invalidate root dentry when subvolume deletion fails

commit 64ad6c488975d7516230cf7849190a991fd615ae upstream.

Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate"),
mounted subvolumes can be deleted because d_invalidate() won't fail.
However, we run into problems when we attempt to delete the default
subvolume while it is mounted as the root filesystem:

# btrfs subvol list /
ID 257 gen 306 top level 5 path rootvol
ID 267 gen 334 top level 5 path snap1
# btrfs subvol get-default /
ID 267 gen 334 top level 5 path snap1
# btrfs inspect-internal rootid /
267
# mount -o subvol=/ /dev/vda1 /mnt
# btrfs subvol del /mnt/snap1
Delete subvolume (no-commit): '/mnt/snap1'
ERROR: cannot delete '/mnt/snap1' - Operation not permitted
# findmnt /
findmnt: can't read /proc/mounts: No such file or directory
# ls /proc
#

Markus reported that this same scenario simply led to a kernel oops.

This happens because in btrfs_ioctl_snap_destroy(), we call
d_invalidate() before we check may_destroy_subvol(), which means that we
detach the submounts and drop the dentry before erroring out. Instead,
we should only invalidate the dentry once the deletion has succeeded.
Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate()
will prune the dcache for the deleted subvolume.

Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate")
Reported-by: Markus Schauler <mschauler@gmail.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 54b1fb57886f3b2a7be247937efa26d6e305aaa4)

Btrfs: teach backref walking about backrefs with underflowed offset values

When cloning/deduplicating file extents (through the clone and extent_same
ioctls) we can get data back references with offset values that are a
result of an unsigned integer arithmetic underflow, that is, values that
are much larger then they could be otherwise.

This is not a problem when decrementing or dropping the back references
(happens when we overwrite the extents or punch a hole for example, through
__btrfs_drop_extents()), since we compute the same too large offset value,
but it is a problem for the backref walking code, used by an incremental
send and the ioctls that are used by the btrfs tool "inspect-internal"
commands, as it makes it miss the corresponding file extent items because
the search key is set for an extent item that starts at an offset matching
the exceptionally large offset value of the data back reference. For an
incremental send this causes the send ioctl to fail with -EIO.

So teach the backref walking code to deal with these cases by setting the
search key's offset to 0 if the backref's offset value is larger than
LLONG_MAX (the largest possible file offset). This makes sure the backref
walking code finds the corresponding file extent items at the expense of
scanning more items and leafs in the btree.

Fixing the clone/dedup ioctls to not produce such underflowed results would
require major changes breaking backward compatibility, updating user space
tools, etc.

Simple reproducer case for fstests:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  tmp=/tmp/$$
  status=1 # failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -fr $send_files_dir
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _require_cloner
  _need_to_be_root

  send_files_dir=$TEST_DIR/btrfs-test-$seq

  rm -f $seqres.full
  rm -fr $send_files_dir
  mkdir $send_files_dir

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  # Create our test file with a single extent of 64K starting at file
  # offset 128K.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 128K 64K" $SCRATCH_MNT/foo \
      | _filter_xfs_io

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
      $SCRATCH_MNT/mysnap1

  # Now clone parts of the original extent into lower offsets of the file.
  #
  # The first clone operation adds a file extent item to file offset 0
  # that points to our initial extent with a data offset of 16K. The
  # corresponding data back reference in the extent tree has an offset of
  # 18446744073709535232, which is the result of file_offset - data_offset
  # = 0 - 16K.
  #
  # The second clone operation adds a file extent item to file offset 16K
  # that points to our initial extent with a data offset of 48K. The
  # corresponding data back reference in the extent tree has an offset of
  # 18446744073709518848, which is the result of file_offset - data_offset
  # = 16K - 48K.
  #
  # Those large back reference offsets (result of unsigned arithmetic
  # underflow) confused the back reference walking code (used by an
  # incremental send and the multiple inspect-internal ioctls) and made it
  # miss the back references, which for the case of an incremental send it
  # made it fail with -EIO and print a message like the following to
  # dmesg:
  #
  # "BTRFS error (device sdc): did not find backref in send_root. \
  #  inode=257, offset=0, disk_byte=12845056 found extent=12845056"
  #
  $CLONER_PROG -s $(((128 + 16) * 1024)) -d 0 -l $((16 * 1024)) \
      $SCRATCH_MNT/foo $SCRATCH_MNT/foo
  $CLONER_PROG -s $(((128 + 48) * 1024)) -d $((16 * 1024)) \
      -l $((16 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
      $SCRATCH_MNT/mysnap2

  _run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap
  _run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \
      -f $send_files_dir/2.snap

  echo "File digest in the original filesystem:"
  md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch

  # Now recreate the filesystem by receiving both send streams and verify
  # we get the same file contents that the original filesystem had.
  _scratch_unmount
  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
  _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/2.snap

  echo "File digest in the new filesystem:"
  md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch

  status=0
  exit

The test's expected golden output is:

  wrote 65536/65536 bytes at offset 131072
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File digest in the original filesystem:
  6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
  File digest in the new filesystem:
  6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo

But it failed with:

    (...)
    @@ -1,7 +1,5 @@
     QA output created by 097
     wrote 65536/65536 bytes at offset 131072
     XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
    -File digest in the original filesystem:
    -6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
    -File digest in the new filesystem:
    -6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
    ...

  $ cat /home/fdmanana/git/hub/xfstests/results//btrfs/097.full
  (...)
  ERROR: send ioctl failed with -5: Input/output error

Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 877043f3b0f3260f9b258f21acdf450067e7d3af)

Revert "Btrfs: fix list transaction->pending_ordered corruption"

This reverts commit ebcbf46d4e8ea5f30b9341e5c02f3e2a62234ae0.

Revert "Btrfs: fix memory corruption on failure to submit bio for direct IO"

This reverts commit 64572b42fd4da2c0463b509371eced787f8d5c74.

Revert "Btrfs: fix wrong check for btrfs_force_chunk_alloc()"

This reverts commit 11bc5538713f03d22d1d93dff4ba233bd1b79f57.

Revert "Btrfs: fix warning of bytes_may_use"

This reverts commit b73743a7880ab626c84a164c77b4ca128aae410b.

Revert "Btrfs: fix hang when failing to submit bio of directIO"

This reverts commit af56227eef4e2c37f6a87118fba2a5903c1b4770.

Revert "Btrfs: fix crash on close_ctree() if cleaner starts new transaction"

This reverts commit 3fc4349f4fd64ce4c96a335e62442b2891f21fb6.

Revert "Btrfs: use kmem_cache_free when freeing entry in inode cache"

This reverts commit 5d7d4f8aa0e9924aafbf076a1d358bdfbd7a7df0.

Revert "Btrfs: fix list transaction->pending_ordered corruption"

This reverts commit ad0ea7ba9787e0637ce2ecf44dd5dfffaee4ff4d.

Revert "Btrfs: fix stale directory entries after fsync log replay"

This reverts commit 9813e8077c31f31f21d76d4f29377e9e27ba2f53.

Revert "Btrfs: add missing inode item update in fallocate()"

This reverts commit e2bd43b59b269f8dd15358bf9a3c6f5954bf0cbf.

Revert "Btrfs: fix race between caching kthread and returning inode to inode cache"

This reverts commit 0e89186e9d5171622bba781c01e6b234b3ba4a81.

zygo: unGPL pci_ignore_hotplug

Revert "Revert "Btrfs: fix list transaction->pending_ordered corruption""

This reverts commit ee84f696e37b04843e3740c8c77d342f0f398f4c.

Merge tag 'v4.0.9' into zygo-4.0.9-zb64

This is the 4.0.9 stable release

# gpg: Signature made Tue Jul 21 13:11:02 2015 EDT using RSA key ID 6092693E
# gpg: Good signature from "Greg Kroah-Hartman (Linux kernel stable release signing key) <greg@kroah.com>"
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 647F 2865 4894 E3BD 4571 99BE 38DB BDC8 6092 693E

Linux 4.0.9

Input: pixcir_i2c_ts - fix receive error

commit 469d7d22cea146e40efe8c330e5164b4d8f13934 upstream.

The i2c_master_recv() uses readsize to receive data from i2c but compares
to size of rdbuf which is always 27. This would cause problem when the
max_fingers is not 5. Change the comparison value to readsize instead.

Fixes: 36874c7e219 ("Input: pixcir_i2c_ts - support up to 5 fingers and
hardware tracking IDs:)

Signed-off-by: Frodo Lai <frodo_lai@bcmcom.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

of/pci: Fix pci_address_to_pio() conversion of CPU address to I/O port

commit 5dbb4c6167229c8d4f528e8ec26699a7305000a3 upstream.

41f8bba7f555 ("of/pci: Add pci_register_io_range() and
pci_pio_to_address()") added support for systems with several I/O ranges
described by OF bindings. It modified pci_address_to_pio() look up the
io_range for a given CPU physical address, but the conversion was wrong.

Fix the conversion of address to I/O port.

[bhelgaas: changelog]
Fixes: 41f8bba7f555 ("of/pci: Add pci_register_io_range() and pci_pio_to_address()")
Signed-off-by: Zhichang Yuan <yuanzhichang@hisilicon.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Liviu Dudau <Liviu.Dudau@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

PCI: pciehp: Wait for hotplug command completion where necessary

commit a5dd4b4b0570b3bf880d563969b245dfbd170c1e upstream.

The commit referenced below deferred waiting for command completion until
the start of the next command, allowing hardware to do the latching
asynchronously.  Unfortunately, being ready to accept a new command is the
only indication we have that the previous command is completed.  In cases
where we need that state change to be enabled, we must still wait for
completion.  For instance, pciehp_reset_slot() attempts to disable anything
that might generate a surprise hotplug on slots that support presence
detection.  If we don't wait for those settings to latch before the
secondary bus reset, we negate any value in attempting to prevent the
spurious hotplug.

Create a base function with optional wait and helper functions so that
pcie_write_cmd() turns back into the "safe" interface which waits before
and after issuing a command and add pcie_write_cmd_nowait(), which
eliminates the trailing wait for asynchronous completion.  The following
functions are returned to their previous behavior:

  pciehp_power_on_slot
  pciehp_power_off_slot
  pcie_disable_notification
  pciehp_reset_slot

The rationale is that pciehp_power_on_slot() enables the link and therefore
relies on completion of power-on.  pciehp_power_off_slot() and
pcie_disable_notification() need a wait because data structures may be
freed after these calls and continued signaling from the device would be
unexpected.  And, of course, pciehp_reset_slot() needs to wait for the
scenario outlined above.

Fixes: 3461a068661c ("PCI: pciehp: Wait for hotplug command completion lazily")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

PCI: Add pci_bus_addr_t

commit 3a9ad0b4fdcd57f775d3615004c8c64c021a9e7d upstream.

David Ahern reported that d63e2e1f3df9 ("sparc/PCI: Clip bridge windows
to fit in upstream windows") fails to boot on sparc/T5-8:

  pci 0000:06:00.0: reg 0x184: can't handle BAR above 4GB (bus address 0x110204000)

The problem is that sparc64 assumed that dma_addr_t only needed to hold DMA
addresses, i.e., bus addresses returned via the DMA API (dma_map_single(),
etc.), while the PCI core assumed dma_addr_t could hold *any* bus address,
including raw BAR values.  On sparc64, all DMA addresses fit in 32 bits, so
dma_addr_t is a 32-bit type.  However, BAR values can be 64 bits wide, so
they don't fit in a dma_addr_t.  d63e2e1f3df9 added new checking that
tripped over this mismatch.

Add pci_bus_addr_t, which is wide enough to hold any PCI bus address,
including both raw BAR values and DMA addresses.  This will be 64 bits
on 64-bit platforms and on platforms with a 64-bit dma_addr_t.  Then
dma_addr_t only needs to be wide enough to hold addresses from the DMA API.

[bhelgaas: changelog, bugzilla, Kconfig to ensure pci_bus_addr_t is at
least as wide as dma_addr_t, documentation]
Fixes: d63e2e1f3df9 ("sparc/PCI: Clip bridge windows to fit in upstream windows")
Fixes: 23b13bc76f35 ("PCI: Fail safely if we can't handle BARs larger than 4GB")
Link: http://lkml.kernel.org/r/CAE9FiQU1gJY1LYrxs+ma5LCTEEe4xmtjRG0aXJ9K_Tsu+m9Wuw@mail.gmail.com
Link: http://lkml.kernel.org/r/1427857069-6789-1-git-send-email-yinghai@kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=96231
Reported-by: David Ahern <david.ahern@oracle.com>
Tested-by: David Ahern <david.ahern@oracle.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

PCI: Propagate the "ignore hotplug" setting to parent

commit 0824965140fff1bf640a987dc790d1594a8e0699 upstream.

Refine the mechanism introduced by commit f244d8b623da ("ACPIPHP / radeon /
nouveau: Fix VGA switcheroo problem related to hotplug") to propagate the
ignore_hotplug setting of the device to its parent bridge in case hotplug
notifications related to the graphics adapter switching are given for the
bridge rather than for the device itself (they need to be ignored in both
cases).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=61891
Link: https://bugs.freedesktop.org/show_bug.cgi?id=88927
Fixes: b440bde74f04 ("PCI: Add pci_ignore_hotplug() to ignore hotplug events for a device")
Reported-and-tested-by: tiagdtd-lava <tiagdtd-lava@yahoo.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mtd: dc21285: use raw spinlock functions for nw_gpio_lock

commit e5babdf928e5d0c432a8d4b99f20421ce14d1ab6 upstream.

Since commit bd31b85960a7 (which is in 3.2-rc1) nw_gpio_lock is a raw spinlock
that needs usage of the corresponding raw functions.

This fixes:

  drivers/mtd/maps/dc21285.c: In function 'nw_en_write':
  drivers/mtd/maps/dc21285.c:41:340: warning: passing argument 1 of 'spinlock_check' from incompatible pointer type
    spin_lock_irqsave(&nw_gpio_lock, flags);

  In file included from include/linux/seqlock.h:35:0,
                   from include/linux/time.h:5,
                   from include/linux/stat.h:18,
                   from include/linux/module.h:10,
                   from drivers/mtd/maps/dc21285.c:8:
  include/linux/spinlock.h:299:102: note: expected 'struct spinlock_t *' but argument is of type 'struct raw_spinlock_t *'
   static inline raw_spinlock_t *spinlock_check(spinlock_t *lock)
                                                                                                        ^
  drivers/mtd/maps/dc21285.c:43:25: warning: passing argument 1 of 'spin_unlock_irqrestore' from incompatible pointer type
    spin_unlock_irqrestore(&nw_gpio_lock, flags);
                           ^
  In file included from include/linux/seqlock.h:35:0,
                   from include/linux/time.h:5,
                   from include/linux/stat.h:18,
                   from include/linux/module.h:10,
                   from drivers/mtd/maps/dc21285.c:8:
  include/linux/spinlock.h:370:91: note: expected 'struct spinlock_t *' but argument is of type 'struct raw_spinlock_t *'
   static inline void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)

Fixes: bd31b85960a7 ("locking, ARM: Annotate low level hw locks as raw")
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mtd: fix: avoid race condition when accessing mtd->usecount

commit 073db4a51ee43ccb827f54a4261c0583b028d5ab upstream.

On A MIPS 32-cores machine a BUG_ON was triggered because some acesses to
mtd->usecount were done without taking mtd_table_mutex.
kernel: Call Trace:
kernel: [<ffffffff80401818>] __put_mtd_device+0x20/0x50
kernel: [<ffffffff804086f4>] blktrans_release+0x8c/0xd8
kernel: [<ffffffff802577e0>] __blkdev_put+0x1a8/0x200
kernel: [<ffffffff802579a4>] blkdev_close+0x1c/0x30
kernel: [<ffffffff8022006c>] __fput+0xac/0x250
kernel: [<ffffffff80171208>] task_work_run+0xd8/0x120
kernel: [<ffffffff8012c23c>] work_notifysig+0x10/0x18
kernel:
kernel:
Code: 2442ffff ac8202d8 000217fe <00020336> dc820128 10400003
00000000 0040f809 00000000
kernel: ---[ end trace 080fbb4579b47a73 ]---

Fixed by taking the mutex in blktrans_open and blktrans_release.

Note that this locking is already suggested in
include/linux/mtd/blktrans.h:

struct mtd_blktrans_ops {
...
/* Called with mtd_table_mutex held; no race with add/remove */
int (*open)(struct mtd_blktrans_dev *dev);
void (*release)(struct mtd_blktrans_dev *dev);
...
};

But we weren't following it.

Originally reported by (and patched by) Zhang and Giuseppe,
independently. Improved and rewritten.

Reported-by: Zhang Xingcai <zhangxingcai@huawei.com>
Reported-by: Giuseppe Cantavenera <giuseppe.cantavenera.ext@nokia.com>
Tested-by: Giuseppe Cantavenera <giuseppe.cantavenera.ext@nokia.com>
Acked-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

leds / PM: fix hibernation on arm when gpio-led used with CPU led trigger

commit 084609bf727981c7a2e6e69aefe0052c9d793300 upstream.

Setting a dev_pm_ops suspend/resume pair of callbacks but not a set of
hibernation callbacks means those pm functions will not be
called upon hibernation - that leads to system crash on ARM during
freezing if gpio-led is used in combination with CPU led trigger.
It may happen after freeze_noirq stage (GPIO is suspended)
and before syscore_suspend stage (CPU led trigger is suspended)
- usually when disable_nonboot_cpus() is called.

Log:
  PM: noirq freeze of devices complete after 1.425 msecs
  Disabling non-boot CPUs ...
    ^ system may crash or stuck here with message (TI AM572x)

  WARNING: CPU: 0 PID: 3100 at drivers/bus/omap_l3_noc.c:148 l3_interrupt_handler+0x22c/0x370()
  44000000.ocp:L3 Custom Error: MASTER MPU TARGET L4_PER1_P3 (Idle): Data Access in Supervisor mode during Functional access

  CPU1: shutdown
    ^ or here

Fix this by using SIMPLE_DEV_PM_OPS, which appropriately
assigns the suspend and hibernation callbacks and move
led_suspend/led_resume under CONFIG_PM_SLEEP to avoid
build warnings.

Fixes: 73e1ab41a80d (leds: Convert led class driver from legacy pm ops to dev_pm_ops)
Signed-off-by: Grygorii Strashko <Grygorii.Strashko@linaro.org>
Acked-by: Jacek Anaszewski <j.anaszewski@samsung.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

video: mxsfb: Make sure axi clock is enabled when accessing registers

commit 2fa3b4c4a78a5db3502ab9e32630ea660ff923d0 upstream.

The LCDIF engines embedded in i.MX6sl and i.MX6sx SoCs need the axi clock
as the engine's system clock.  The clock should be enabled when accessing
LCDIF registers, otherwise the kernel would hang up.  We should also keep
the clock enabled when the engine is being active to scan out frames from
memory.  This patch makes sure the axi clock is enabled when accessing
registers so that the kernel hang up issue can be fixed.

Reported-by: Peter Chen <peter.chen@freescale.com>
Tested-by: Peter Chen <peter.chen@freescale.com>
Signed-off-by: Liu Ying <Ying.Liu@freescale.com>
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

genirq: devres: Fix testing return value of request_any_context_irq()

commit 63781394c540dd9e666a6b21d70b64dd52bce76e upstream.

request_any_context_irq() returns a negative value on failure.
It returns either IRQC_IS_HARDIRQ or IRQC_IS_NESTED on success.
So fix testing return value of request_any_context_irq().

Also fixup the return value of devm_request_any_context_irq() to make it
consistent with request_any_context_irq().

Fixes: 0668d3065128 ("genirq: Add devm_request_any_context_irq()")
Signed-off-by: Axel Lin <axel.lin@ingics.com>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Link: http://lkml.kernel.org/r/1431334978.17783.4.camel@ingics.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

IB/srp: Fix reconnection failure handling

commit a44074f14ba1ea0747ea737026eb929b81993dc3 upstream.

Although it is possible to let SRP I/O continue if a reconnect
results in a reduction of the number of channels, the current
code does not handle this scenario correctly. Instead of making
the reconnect code more complex, consider this as a reconnection
failure.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

IB/srp: Fix connection state tracking

commit c014c8cd31b161e12deb81c0f7f477811bd1eddc upstream.

Reception of a DREQ message only causes the state of a single
channel to change. Hence move the 'connected' member variable
from the target to the channel data structure. This patch
avoids that following false positive warning can be reported
by srp_destroy_qp():

WARNING: at drivers/infiniband/ulp/srp/ib_srp.c:617 srp_destroy_qp+0xa6/0x120 [ib_srp]()
Call Trace:
[<ffffffff8106e10f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff8106e16a>] warn_slowpath_null+0x1a/0x20
[<ffffffffa0440226>] srp_destroy_qp+0xa6/0x120 [ib_srp]
[<ffffffffa0440322>] srp_free_ch_ib+0x82/0x1e0 [ib_srp]
[<ffffffffa044408b>] srp_create_target+0x7ab/0x998 [ib_srp]
[<ffffffff81346f60>] dev_attr_store+0x20/0x30
[<ffffffff811dd90f>] sysfs_write_file+0xef/0x170
[<ffffffff8116d248>] vfs_write+0xc8/0x190
[<ffffffff8116d411>] sys_write+0x51/0x90

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

IB/srp: Fix a connection setup race

commit 8de9fe3a1d4ac8c3e4953fa4b7d81f863f5196ad upstream.

Avoid that receiving a DREQ while RDMA channels are being
established causes target->qp_in_error to be reset.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

IB/srp: Remove an extraneous scsi_host_put() from an error path

commit fb49c8bbaae70b14fea2b4590a90a21539f88526 upstream.

Fix a scsi_get_host() / scsi_host_put() imbalance in the error
path of srp_create_target(). See also patch "IB/srp: Avoid that
I/O hangs due to a cable pull during LUN scanning" (commit ID
34aa654ecb8e).

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

scsi_transport_srp: Fix a race condition

commit 535fb906225fb7436cb658144d0c0cea14a26f3e upstream.

Avoid that srp_terminate_io() can get invoked while srp_queuecommand()
is in progress. This patch avoids that an I/O timeout can trigger the
following kernel warning:

WARNING: at drivers/infiniband/ulp/srp/ib_srp.c:1447 srp_terminate_io+0xef/0x100 [ib_srp]()
Call Trace:
[<ffffffff814c65a2>] dump_stack+0x4e/0x68
[<ffffffff81051f71>] warn_slowpath_common+0x81/0xa0
[<ffffffff8105204a>] warn_slowpath_null+0x1a/0x20
[<ffffffffa075f51f>] srp_terminate_io+0xef/0x100 [ib_srp]
[<ffffffffa07495da>] __rport_fail_io_fast+0xba/0xc0 [scsi_transport_srp]
[<ffffffffa0749a90>] rport_fast_io_fail_timedout+0xe0/0xf0 [scsi_transport_srp]
[<ffffffff8106e09b>] process_one_work+0x1db/0x780
[<ffffffff8106e75b>] worker_thread+0x11b/0x450
[<ffffffff81073c64>] kthread+0xe4/0x100
[<ffffffff814cf26c>] ret_from_fork+0x7c/0xb0

See also patch "scsi_transport_srp: Add transport layer error
handling" (commit ID 29c17324803c).

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: James Bottomley <JBottomley@Odin.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

scsi_transport_srp: Introduce srp_wait_for_queuecommand()

commit be34c62ddf39d1931780b07a6f4241393e4ba2ee upstream.

Introduce the helper function srp_wait_for_queuecommand().
Move the definition of scsi_request_fn_active(). Add a comment
above srp_wait_for_queuecommand() that support for scsi-mq needs
to be added.

This patch does not change any functionality. A second call to
srp_wait_for_queuecommand() will be introduced in the next patch.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: James Bottomley <JBottomley@Odin.com>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

spi: pl022: Specify 'num-cs' property as required in devicetree binding

commit ea6055c46eda1e19e02209814955e13f334bbe1b upstream.

Since commit 39a6ac11df65 ("spi/pl022: Devicetree support w/o platform data")
the 'num-cs' parameter cannot be passed through platform data when probing
with devicetree. Instead, it's a required devicetree property.

Fix the binding documentation so the property is properly specified.

Fixes: 39a6ac11df65 ("spi/pl022: Devicetree support w/o platform data")
Signed-off-by: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

spi: orion: Fix maximum baud rates for Armada 370/XP

commit ce2f6ea1cbd41d78224f703af980a6ceeb0eb56a upstream.

The commit df59fa7f4bca "spi: orion: support armada extended baud
rates" was too optimistic for the maximum baud rate that the Armada
SoCs can support. According to the hardware datasheet the maximum
frequency supported by the Armada 370 SoC is tclk/4. But for the
Armada XP, Armada 38x and Armada 39x SoCs the limitation is 50MHz and
for the Armada 375 it is tclk/15.

Currently the armada-370-spi compatible is only used by the Armada 370
and the Armada XP device tree. On Armada 370, tclk cannot be higher
than 200MHz. In order to be able to handle both SoCs, we can take the
minimum of 50MHz and tclk/4.

A proper solution is adding a compatible string for each SoC, but it
can't be done as a fix for compatibility reason (we can't modify
device tree that have been already released) and it will be part of a
separate patch.

Fixes: df59fa7f4bca (spi: orion: support armada extended baud rates)
Reported-by: Kostya Porotchkin <kostap@marvell.com>
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

spi: fix race freeing dummy_tx/rx before it is unmapped

commit 8e76ef88f607174082023f50b87fe12dcdbe5db5 upstream.

Fix a race (with some kernel configurations) where a queued
master->pump_messages runs and frees dummy_tx/rx before
spi_unmap_msg is running (or is finished).

This results in the following messages:
  BUG: Bad page state in process
  page:db7ba030 count:0 mapcount:0 mapping:  (null) index:0x0
  flags: 0x200(arch_1)
  page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
  ...

Reported-by: Noralf Trønnes <noralf@tronnes.org>
Suggested-by: Noralf Trønnes <noralf@tronnes.org>
Tested-by: Noralf Trønnes <noralf@tronnes.org>
Signed-off-by: Martin Sperl <kernel@martin.sperl.org>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

livepatch: add module locking around kallsyms calls

commit 9a1bd63cdae4b623494c4ebaf723a91c35ec49fb upstream.

The list of loaded modules is walked through in
module_kallsyms_on_each_symbol (called by kallsyms_on_each_symbol). The
module_mutex lock should be acquired to prevent potential corruptions
in the list.

This was uncovered with new lockdep asserts in module code introduced by
the commit 0be964be0d45 ("module: Sanitize RCU usage and locking") in
recent next- trees.

Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

regulator: core: fix constraints output buffer

commit a7068e3932eee8268c4ce4e080a338ee7b8a27bf upstream.

The buffer for condtraints debug isn't big enough to hold the output
in all cases. So fix this issue by increasing the buffer.

Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

regulator: max77686: fix gpio_enabled shift wrapping bug

commit c53403a37cf083ce85da720f18918f73580d0064 upstream.

The code should handle more than 32 bits here because "id"
can be a value up to MAX77686_REGULATORS (currently 34).

Convert the gpio_enabled type to DECLARE_BITMAP and use
test_bit/set_bit.

Fixes: 3307e9025d29 ("regulator: max77686: Add GPIO control")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

regmap: Fix possible shift overflow in regmap_field_init()

commit 921cc29473a0d7c109105c1876ddb432f4a4be7d upstream.

The way the mask is generated in regmap_field_init() is wrong.
Indeed, a field initialized with msb = 31 and lsb = 0 provokes a shift
overflow while calculating the mask field.

On some 32 bits architectures, such as x86, the generated mask is 0,
instead of the expected 0xffffffff.

This patch uses GENMASK() to fix the problem, as this macro is already safe
regarding shift overflow.

Signed-off-by: Maxime Coquelin <maxime.coquelin@st.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

regmap: Fix regmap_bulk_read in BE mode

commit 15b8d2c41fe5839582029f65c5f7004db451cc2b upstream.

In big endian mode regmap_bulk_read gives incorrect data
for byte reads.

This is because memcpy of a single byte from an address
after full word read gives different results when
endianness differs. ie. we get little-end in LE and big-end in BE.

Signed-off-by: Arun Chandran <achandran@mvista.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

iser-target: release stale iser connections

commit 2f1b6b7d9a815f341b18dfd26a363f37d4d3c96a upstream.

When receiving a new iser connect request we serialize
the pending requests by adding the newly created iser connection
to the np accept list and let the login thread process the connect
request one by one (np_accept_wait).

In case we received a disconnect request before the iser_conn
has begun processing (still linked in np_accept_list) we should
detach it from the list and clean it up and not have the login
thread process a stale connection. We do it only when the connection
state is not already terminating (initiator driven disconnect) as
this might lead us to access np_accept_mutex after the np was released
in live shutdown scenarios.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jenny Falkovich <jennyf@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mm, thp: respect MPOL_PREFERRED policy with non-local node

commit 0867a57c4f80a566dda1bac975b42fcd857cb489 upstream.

Since commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on
local node"), we handle THP allocations on page fault in a special way -
for non-interleave memory policies, the allocation is only attempted on
the node local to the current CPU, if the policy's nodemask allows the
node.

This is motivated by the assumption that THP benefits cannot offset the
cost of remote accesses, so it's better to fallback to base pages on the
local node (which might still be available, while huge pages are not due
to fragmentation) than to allocate huge pages on a remote node.

The nodemask check prevents us from violating e.g.  MPOL_BIND policies
where the local node is not among the allowed nodes.  However, the
current implementation can still give surprising results for the
MPOL_PREFERRED policy when the preferred node is different than the
current CPU's local node.

In such case we should honor the preferred node and not use the local
node, which is what this patch does.  If hugepage allocation on the
preferred node fails, we fall back to base pages and don't try other
nodes, with the same motivation as is done for the local node hugepage
allocations.  The patch also moves the MPOL_INTERLEAVE check around to
simplify the hugepage specific test.

The difference can be demonstrated using in-tree transhuge-stress test
on the following 2-node machine where half memory on one node was
occupied to show the difference.

> numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 7878 MB
node 0 free: 3623 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 8045 MB
node 1 free: 7818 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Before the patch:
> numactl -p0 -C0 ./transhuge-stress
transhuge-stress: 2.197 s/loop, 0.276 ms/page,   7249.168 MiB/s 7962 succeed,    0 failed, 1786 different pages

> numactl -p0 -C12 ./transhuge-stress
transhuge-stress: 2.962 s/loop, 0.372 ms/page,   5376.172 MiB/s 7962 succeed,    0 failed, 3873 different pages

Number of successful THP allocations corresponds to free memory on node 0 in
the first case and node 1 in the second case, i.e. -p parameter is ignored and
cpu binding "wins".

After the patch:
> numactl -p0 -C0 ./transhuge-stress
transhuge-stress: 2.183 s/loop, 0.274 ms/page,   7295.516 MiB/s 7962 succeed,    0 failed, 1760 different pages

> numactl -p0 -C12 ./transhuge-stress
transhuge-stress: 2.878 s/loop, 0.361 ms/page,   5533.638 MiB/s 7962 succeed,    0 failed, 1750 different pages

> numactl -p1 -C0 ./transhuge-stress
transhuge-stress: 4.628 s/loop, 0.581 ms/page,   3440.893 MiB/s 7962 succeed,    0 failed, 3918 different pages

The -p parameter is respected regardless of cpu binding.

> numactl -C0 ./transhuge-stress
transhuge-stress: 2.202 s/loop, 0.277 ms/page,   7230.003 MiB/s 7962 succeed,    0 failed, 1750 different pages

> numactl -C12 ./transhuge-stress
transhuge-stress: 3.020 s/loop, 0.379 ms/page,   5273.324 MiB/s 7962 succeed,    0 failed, 3916 different pages

Without -p parameter, hugepage restriction to CPU-local node works as before.

Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()

commit 8a8c35fadfaf55629a37ef1a8ead1b8fb32581d2 upstream.

Beginning at commit d52d3997f843 ("ipv6: Create percpu rt6_info"), the
following INFO splat is logged:

  ===============================
  [ INFO: suspicious RCU usage. ]
  4.1.0-rc7-next-20150612 #1 Not tainted
  -------------------------------
  kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section!
  other info that might help us debug this:
  rcu_scheduler_active = 1, debug_locks = 0
   3 locks held by systemd/1:
   #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff815f0c8f>] rtnetlink_rcv+0x1f/0x40
   #1:  (rcu_read_lock_bh){......}, at: [<ffffffff816a34e2>] ipv6_add_addr+0x62/0x540
   #2:  (addrconf_hash_lock){+...+.}, at: [<ffffffff816a3604>] ipv6_add_addr+0x184/0x540
  stack backtrace:
  CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1
  Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20   04/17/2014
  Call Trace:
    dump_stack+0x4c/0x6e
    lockdep_rcu_suspicious+0xe7/0x120
    ___might_sleep+0x1d5/0x1f0
    __might_sleep+0x4d/0x90
    kmem_cache_alloc+0x47/0x250
    create_object+0x39/0x2e0
    kmemleak_alloc_percpu+0x61/0xe0
    pcpu_alloc+0x370/0x630

Additional backtrace lines are truncated.  In addition, the above splat
is followed by several "BUG: sleeping function called from invalid
context at mm/slub.c:1268" outputs.  As suggested by Martin KaFai Lau,
these are the clue to the fix.  Routine kmemleak_alloc_percpu() always
uses GFP_KERNEL for its allocations, whereas it should follow the gfp
from its callers.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mm: kmemleak: allow safe memory scanning during kmemleak disabling

commit c5f3b1a51a591c18c8b33983908e7fdda6ae417e upstream.

The kmemleak scanning thread can run for minutes.  Callbacks like
kmemleak_free() are allowed during this time, the race being taken care
of by the object->lock spinlock.  Such lock also prevents a memory block
from being freed or unmapped while it is being scanned by blocking the
kmemleak_free() -> ...  -> __delete_object() function until the lock is
released in scan_object().

When a kmemleak error occurs (e.g.  it fails to allocate its metadata),
kmemleak_enabled is set and __delete_object() is no longer called on
freed objects.  If kmemleak_scan is running at the same time,
kmemleak_free() no longer waits for the object scanning to complete,
allowing the corresponding memory block to be freed or unmapped (in the
case of vfree()).  This leads to kmemleak_scan potentially triggering a
page fault.

This patch separates the kmemleak_free() enabling/disabling from the
overall kmemleak_enabled nob so that we can defer the disabling of the
object freeing tracking until the scanning thread completed.  The
kmemleak_free_part() is deliberately ignored by this patch since this is
only called during boot before the scanning thread started.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org>
Tested-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

arm64: vdso: work-around broken ELF toolchains in Makefile

commit 6f1a6ae87c0c60d7c462ef8fd071f291aa7a9abb upstream.

When building the kernel with a bare-metal (ELF) toolchain, the -shared
option may not be passed down to collect2, resulting in silent corruption
of the vDSO image (in particular, the DYNAMIC section is omitted).

The effect of this corruption is that the dynamic linker fails to find
the vDSO symbols and libc is instead used for the syscalls that we
intended to optimise (e.g. gettimeofday). Functionally, there is no
issue as the sigreturn trampoline is still intact and located by the
kernel.

This patch fixes the problem by explicitly passing -shared to the linker
when building the vDSO.

Reported-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
Reported-by: James Greenlaigh <james.greenhalgh@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

arm64: mm: Fix freeing of the wrong memmap entries with !SPARSEMEM_VMEMMAP

commit b9bcc919931611498e856eae9bf66337330d04cc upstream.

The memmap freeing code in free_unused_memmap() computes the end of
each memblock by adding the memblock size onto the base. However,
if SPARSEMEM is enabled then the value (start) used for the base
may already have been rounded downwards to work out which memmap
entries to free after the previous memblock.

This may cause memmap entries that are in use to get freed.

In general, you're not likely to hit this problem unless there
are at least 2 memblocks and one of them is not aligned to a
sparsemem section boundary. Note that carve-outs can increase
the number of memblocks by splitting the regions listed in the
device tree.

This problem doesn't occur with SPARSEMEM_VMEMMAP, because the
vmemmap code deals with freeing the unused regions of the memmap
instead of requiring the arch code to do it.

This patch gets the memblock base out of the memblock directly when
computing the block end address to ensure the correct value is used.

Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

arm64: entry: fix context tracking for el0_sp_pc

commit 46b0567c851cf85d6ba6f23eef385ec9111d09bc upstream.

Commit 6c81fe7925cc4c42 ("arm64: enable context tracking") did not
update el0_sp_pc to use ct_user_exit, but this appears to have been
unintentional. In commit 6ab6463aeb5fbc75 ("arm64: adjust el0_sync so
that a function can be called") we made x0 available, and in the return
to userspace we call ct_user_enter in the kernel_exit macro.

Due to this, we currently don't correctly inform RCU of the user->kernel
transition, and may erroneously account for time spent in the kernel as
if we were in an extended quiescent state when CONFIG_CONTEXT_TRACKING
is enabled.

As we do record the kernel->user transition, a userspace application
making accesses from an unaligned stack pointer can demonstrate the
imbalance, provoking the following warning:

------------[ cut here ]------------
WARNING: CPU: 2 PID: 3660 at kernel/context_tracking.c:75 context_tracking_enter+0xd8/0xe4()
Modules linked in:
CPU: 2 PID: 3660 Comm: a.out Not tainted 4.1.0-rc7+ #8
Hardware name: ARM Juno development board (r0) (DT)
Call trace:
[<ffffffc000089914>] dump_backtrace+0x0/0x124
[<ffffffc000089a48>] show_stack+0x10/0x1c
[<ffffffc0005b3cbc>] dump_stack+0x84/0xc8
[<ffffffc0000b3214>] warn_slowpath_common+0x98/0xd0
[<ffffffc0000b330c>] warn_slowpath_null+0x14/0x20
[<ffffffc00013ada4>] context_tracking_enter+0xd4/0xe4
[<ffffffc0005b534c>] preempt_schedule_irq+0xd4/0x114
[<ffffffc00008561c>] el1_preempt+0x4/0x28
[<ffffffc0001b8040>] exit_files+0x38/0x4c
[<ffffffc0000b5b94>] do_exit+0x430/0x978
[<ffffffc0000b614c>] do_group_exit+0x40/0xd4
[<ffffffc0000c0208>] get_signal+0x23c/0x4f4
[<ffffffc0000890b4>] do_signal+0x1ac/0x518
[<ffffffc000089650>] do_notify_resume+0x5c/0x68
---[ end trace 963c192600337066 ]---

This patch adds the missing ct_user_exit to the el0_sp_pc entry path,
correcting the context tracking for this case.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Fixes: 6c81fe7925cc ("arm64: enable context tracking")
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

arm64: Do not attempt to use init_mm in reset_context()

commit 565630d503ef24e44c252bed55571b3a0d68455f upstream.

After secondary CPU boot or hotplug, the active_mm of the idle thread is
&init_mm. The init_mm.pgd (swapper_pg_dir) is only meant for TTBR1_EL1
and must not be set in TTBR0_EL1. Since when active_mm == &init_mm the
TTBR0_EL1 is already set to the reserved value, there is no need to
perform any context reset.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mei: txe: reduce suspend/resume time

commit fe292283c23329218e384bffc6cb4bfa3fd92277 upstream.

HW has to be in known state before the initialisation
sequence is started. The polling step for settling aliveness
was set to 200ms while in practise this can be done in up to 30msecs.

Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Barak Yoresh <barak.yoresh@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mei: me: wait for power gating exit confirmation

commit 3dc196eae1db548f05e53e5875ff87b8ff79f249 upstream.

Fix the hbm power gating state machine so it will wait till it receives
confirmation interrupt for the PG_ISOLATION_EXIT message.

In process of the suspend flow the devices first have to exit from the
power gating state (runtime pm resume).
If we do not handle the confirmation interrupt after sending
PG_ISOLATION_EXIT message, we may receive it already after the suspend
flow has changed the device state and interrupt will be interpreted as a
spurious event, consequently link reset will be invoked which will
prevent the device from completing the suspend flow

kernel: [6603] mei_reset:136: mei_me 0000:00:16.0: powering down: end of reset
kernel: [476] mei_me_irq_thread_handler:643: mei_me 0000:00:16.0: function called after ISR to handle the interrupt processing.
kernel: mei_me 0000:00:16.0: FW not ready: resetting

Cc: Gabriele Mazzotta <gabriele.mzt@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=86241
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=770397
Tested-by: Gabriele Mazzotta <gabriele.mzt@gmail.com>
Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ARC: add compiler barrier to LLSC based cmpxchg

commit d57f727264f1425a94689bafc7e99e502cb135b5 upstream.

When auditing cmpxchg call sites, Chuck noted that gcc was optimizing
away some of the desired LDs.

| do {
| new = old = *ipi_data_ptr;
| new |= 1U << msg;
| } while (cmpxchg(ipi_data_ptr, old, new) != old);

was generating to below

| 8015cef8: ld         r2,[r4,0]  <-- First LD
| 8015cefc: bset       r1,r2,r1
|
| 8015cf00: llock      r3,[r4]  <-- atomic op
| 8015cf04: brne       r3,r2,8015cf10
| 8015cf08: scond      r1,[r4]
| 8015cf0c: bnz        8015cf00
|
| 8015cf10: brne       r3,r2,8015cf00  <-- Branch doesn't go to orig LD

Although this was fixed by adding a ACCESS_ONCE in this call site, it
seems safer (for now at least) to add compiler barrier to LLSC based
cmpxchg

Reported-by: Chuck Jordan <cjordan@synopsys.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ARC: add smp barriers around atomics per Documentation/atomic_ops.txt

commit 2576c28e3f623ed401db7e6197241865328620ef upstream.

- arch_spin_lock/unlock were lacking the ACQUIRE/RELEASE barriers
   Since ARCv2 only provides load/load, store/store and all/all, we need
   the full barrier

- LLOCK/SCOND based atomics, bitops, cmpxchg, which return modified
   values were lacking the explicit smp barriers.

- Non LLOCK/SCOND varaints don't need the explicit barriers since that
   is implicity provided by the spin locks used to implement the
   critical section (the spin lock barriers in turn are also fixed in
   this commit as explained above

Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

tools selftests: Fix 'clean' target with make 3.81

commit 60df4642a83546fa6ea8286f5094ce8c0906c3ec upstream.

Make 3.81 doesn't have the 'undefine' command. Using undefine
to clear LDFLAGS fails when make version 3.81 is used. Fix it
to use override to clear LDFLAGS.

Tested-by: Shuah Khan <shuahkh@osg.samsung.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Link: http://lkml.kernel.org/r/20150514151225.GH23588@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

iio: accel: kxcjk-1013: add the "KXCJ9000" ACPI id

commit 61e2c70da9cfc79e8485eafa0f98b5919b04bbe1 upstream.

This id has been seen in the DSDT of the Teclast X98 Air 3G tablet based
on Intel Bay Trail.

Signed-off-by: Antonio Ospite <ao2@ao2.it>
Cc: Bastien Nocera <hadess@hadess.net>
Reviewed-by: Daniel Baluta <daniel.baluta@intel.com>
Signed-off-by: Jonathan Cameron <jic23@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ACPI / PM: Add missing pm_generic_complete() invocation

commit 3d56402d3fa8d10749eeb36293dd1992bd5ad0c3 upstream.

Add missing invocation of pm_generic_complete() to
acpi_subsys_complete() to allow ->complete callbacks provided
by the drivers of devices using the ACPI PM domain to be executed
during system resume.

Fixes: f25c0ae2b4c4 (ACPI / PM: Avoid resuming devices in ACPI PM domain during system suspend)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ACPI / init: Switch over platform to the ACPI mode later

commit b064a8fa77dfead647564c46ac8fc5b13bd1ab73 upstream.

Commit 73f7d1ca3263 "ACPI / init: Run acpi_early_init() before
timekeeping_init()" moved the ACPI subsystem initialization,
including the ACPI mode enabling, to an earlier point in the
initialization sequence, to allow the timekeeping subsystem
use ACPI early. Unfortunately, that resulted in boot regressions
on some systems and the early ACPI initialization was moved toward
its original position in the kernel initialization code by commit
c4e1acbb35e4 "ACPI / init: Invoke early ACPI initialization later".

However, that turns out to be insufficient, as boot is still broken
on the Tyan S8812 mainboard.

To fix that issue, split the ACPI early initialization code into
two pieces so the majority of it still located in acpi_early_init()
and the part switching over the platform into the ACPI mode goes into
a new function, acpi_subsystem_init(), executed at the original early
ACPI initialization spot.

That fixes the Tyan S8812 boot problem, but still allows ACPI
tables to be loaded earlier which is useful to the EFI code in
efi_enter_virtual_mode().

Link: https://bugzilla.kernel.org/show_bug.cgi?id=97141
Fixes: 73f7d1ca3263 "ACPI / init: Run acpi_early_init() before timekeeping_init()"
Reported-and-tested-by: Marius Tolzmann <tolzmann@molgen.mpg.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org>
Reviewed-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ALSA: hda - Fix the dock headphone output on Fujitsu Lifebook E780

commit 4df3fd1700abbb53bd874143dfd1f9ac9e7cbf4b upstream.

Fujitsu Lifebook E780 sets the sequence number 0x0f to only only of
the two headphones, thus the driver tries to assign another as the
line-out, and this results in the inconsistent mapping between the
created jack ctl and the actual I/O. Due to this, PulseAudio doesn't
handle it properly and gets the silent output.

The fix is to ignore the non-HP sequencer checks.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=99681
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ALSA: hda - Add headset support to Acer Aspire V5

commit 7819717b11346b8a5420b223b46600e394049c66 upstream.

Acer Aspire V5 with ALC282 codec needs the similar quirk like Dell
laptops to support the headset mic. The headset mic pin is 0x19 and
it's not exposed by BIOS, thus we need to fix the pincfg as well.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=96201
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ALSA: hda - set proper caps for newer AMD hda audio in KB/KV

commit 650474fb737c3e0ea0f6ab8e43c2cd161080ce5c upstream.

Fixes audio problems on newer asics.

Noticed by: Kelly Anderson <kelly@xilka.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ALSA: hda - Fix Dock Headphone on Thinkpad X250 seen as a Line Out

commit ec56af67a10a0d82b79027878a81fce08d002d50 upstream.

Thinkpad X250, when attached to a dock, has two headphone outs but
no line out. Make sure we don't try to turn this into one headphone
and one line out (since that disables the headphone amp on the dock).

Alsa-info at http://www.alsa-project.org/db/?f=36f8764e1d782397928feec715d0ef90dfddd4c1

Signed-off-by: David Henningsson <david.henningsson@canonical.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ALSA: pcm: Fix pcm_class sysfs output

commit 60b93030b44a8c2cd015cebe5624fd7552ec67ec upstream.

The pcm_class sysfs of each PCM substream gives only "none" since the
recent code change to embed the struct device. Fix the code to point
directly to the embedded device object properly.

Fixes: ef46c7af93f9 ('ALSA: pcm: Embed struct device')
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Disable write buffering on Toshiba ToPIC95

commit 2fb22a8042fe96b4220843f79241c116d90922c4 upstream.

Disable write buffering on the Toshiba ToPIC95 if it is enabled by
somebody (it is not supposed to be a power-on default according to
the datasheet). On the ToPIC95, practically no 32-bit Cardbus card
will work under heavy load without locking up the whole system if
this is left enabled. I tried about a dozen. It does not affect
16-bit cards. This is similar to the O2 bugs in early controller
revisions it seems.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=55961
Signed-off-by: Ryan C. Underwood <nemesis@icequake.net>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

ipr: Increase default adapter init stage change timeout

commit 45c44b5ff9caa743ed9c2bfd44307c536c9caf1e upstream.

Increase the default init stage change timeout from 15 seconds to 30 seconds.
This resolves issues we have seen with some adapters not transitioning
to the first init stage within 15 seconds, which results in adapter
initialization failures.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

rcu: Correctly handle non-empty Tiny RCU callback list with none ready

commit 6e91f8cb138625be96070b778d9ba71ce520ea7e upstream.

If, at the time __rcu_process_callbacks() is invoked,  there are callbacks
in Tiny RCU's callback list, but none of them are ready to be invoked,
the current list-management code will knit the non-ready callbacks out
of the list.  This can result in hangs and possibly worse.  This commit
therefore inserts a check for there being no callbacks that can be
invoked immediately.

This bug is unlikely to occur -- you have to get a new callback between
the time rcu_sched_qs() or rcu_bh_qs() was called, but before we get to
__rcu_process_callbacks().  It was detected by the addition of RCU-bh
testing to rcutorture, which in turn was instigated by Iftekhar Ahmed's
mutation testing.  Although this bug was made much more likely by
915e8a4fe45e (rcu: Remove fastpath from __rcu_process_callbacks()), this
did not cause the bug, but rather made it much more probable.   That
said, it takes more than 40 hours of rcutorture testing, on average,
for this bug to appear, so this fix cannot be considered an emergency.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

gpio: crystalcove: set IRQCHIP_SKIP_SET_WAKE for the irqchip

commit 61e749d7e1627d375156553ea0ae83c4f6bb5a9b upstream.

The CrystalCove GPIO irqchip doesn't have irq_set_wake callback defined
so we should set IRQCHIP_SKIP_SET_WAKE for it or it would cause an irq
desc's wake_depth unbalanced warning during system resume phase from the
gpio_keys driver, which is the driver for the power button of the ASUS
T100 laptop.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

sysfs: Create mountpoints with sysfs_create_mount_point

commit f9bb48825a6b5d02f4cabcc78967c75db903dcdc upstream.

This allows for better documentation in the code and
it allows for a simpler and fully correct version of
fs_fully_visible to be written.

The mount points converted and their filesystems are:
/sys/hypervisor/s390/       s390_hypfs
/sys/kernel/config/         configfs
/sys/kernel/debug/          debugfs
/sys/firmware/efi/efivars/  efivarfs
/sys/fs/fuse/connections/   fusectl
/sys/fs/pstore/             pstore
/sys/kernel/tracing/        tracefs
/sys/fs/cgroup/             cgroup
/sys/kernel/security/       securityfs
/sys/fs/selinux/            selinuxfs
/sys/fs/smackfs/            smackfs

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mnt: Modify fs_fully_visible to deal with locked ro nodev and atime

commit 8c6cf9cc829fcd0b179b59f7fe288941d0e31108 upstream.

Ignore an existing mount if the locked readonly, nodev or atime
attributes are less permissive than the desired attributes
of the new mount.

On success ensure the new mount locks all of the same readonly, nodev and
atime attributes as the old mount.

The nosuid and noexec attributes are not checked here as this change
is destined for stable and enforcing those attributes causes a
regression in lxc and libvirt-lxc where those applications will not
start and there are no known executables on sysfs or proc and no known
way to create exectuables without code modifications

Fixes: e51db73532955 ("userns: Better restrictions on when proc and sysfs can be mounted")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mnt: Refactor the logic for mounting sysfs and proc in a user namespace

commit 1b852bceb0d111e510d1a15826ecc4a19358d512 upstream.

Fresh mounts of proc and sysfs are a very special case that works very
much like a bind mount. Unfortunately the current structure can not
preserve the MNT_LOCK... mount flags. Therefore refactor the logic
into a form that can be modified to preserve those lock bits.

Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount
of the filesystem be fully visible in the current mount namespace,
before the filesystem may be mounted.

Move the logic for calling fs_fully_visible from proc and sysfs into
fs/namespace.c where it has greater access to mount namespace state.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

mnt: Update fs_fully_visible to test for permanently empty directories

commit 7236c85e1be51a9e25ba0f6e087a66ca89605a49 upstream.

fs_fully_visible attempts to make fresh mounts of proc and sysfs give
the mounter no more access to proc and sysfs than if they could have
by creating a bind mount.  One aspect of proc and sysfs that makes
this particularly tricky is that there are other filesystems that
typically mount on top of proc and sysfs.  As those filesystems are
mounted on empty directories in practice it is safe to ignore them.
However testing to ensure filesystems are mounted on empty directories
has not been something the in kernel data structures have supported so
the current test for an empty directory which checks to see
if nlink <= 2 is a bit lacking.

proc and sysfs have recently been modified to use the new empty_dir
infrastructure to create all of their dedicated mount points.  Instead
of testing for S_ISDIR(inode->i_mode) && i_nlink <= 2 to see if a
directory is empty, test for is_empty_dir_inode(inode).  That small
change guaranteess mounts found on proc and sysfs really are safe to
ignore, because the directories are not only empty but nothing can
ever be added to them.  This guarantees there is nothing to worry
about when mounting proc and sysfs.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

sysfs: Add support for permanently empty directories to serve as mount points.

commit 87d2846fcf88113fae2341da1ca9a71f0d916f2c upstream.

Add two functions sysfs_create_mount_point and
sysfs_remove_mount_point that hang a permanently empty directory off
of a kobject or remove a permanently emptpy directory hanging from a
kobject. Export these new functions so modular filesystems can use
them.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

kernfs: Add support for always empty directories.

commit ea015218f2f7ace2dad9cedd21ed95bdba2886d7 upstream.

Add a new function kernfs_create_empty_dir that can be used to create
directory that can not be modified.

Update the code to use make_empty_dir_inode when reporting a
permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

proc: Allow creating permanently empty directories that serve as mount points

commit eb6d38d5427b3ad42f5268da0f1dd31bb0af1264 upstream.

Add a new function proc_create_mount_point that when used to creates a
directory that can not be added to.

Add a new function is_empty_pde to test if a function is a mount
point.

Update the code to use make_empty_dir_inode when reporting
a permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Update /proc/openprom and /proc/fs/nfsd to be permanently empty directories.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

sysctl: Allow creating permanently empty directories that serve as mountpoints.

commit f9bd6733d3f11e24f3949becf277507d422ee1eb upstream.

Add a magic sysctl table sysctl_mount_point that when used to
create a directory forces that directory to be permanently empty.

Update the code to use make_empty_dir_inode when accessing permanently
empty directories.

Update the code to not allow adding to permanently empty directories.

Update /proc/sys/fs/binfmt_misc to be a permanently empty directory.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

fs: Add helper functions for permanently empty directories.

commit fbabfd0f4ee2e8847bf56edf481249ad1bb8c44d upstream.

To ensure it is safe to mount proc and sysfs I need to check if
filesystems that are mounted on top of them are mounted on truly empty
directories. Given that some directories can gain entries over time,
knowing that a directory is empty right now is insufficient.

Therefore add supporting infrastructure for permantently empty
directories that proc and sysfs can use when they create mount points
for filesystems and fs_fully_visible can use to test for permanently
empty directories to ensure that nothing will be gained by mounting a
fresh copy of proc or sysfs.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

zygo: turn down the preemption a little, get a bonus i180 driver. Leave ATOMIC_SLEEP debug enabled for now.

Revert "Btrfs: fix list transaction->pending_ordered corruption"

This reverts commit d3efe08400317888f559bbedf0e42cd31575d0ef.

Btrfs: fix race between caching kthread and returning inode to inode cache

While the inode cache caching kthread is calling btrfs_unpin_free_ino(),
we could have a concurrent call to btrfs_return_ino() that adds a new
entry to the root's free space cache of pinned inodes. This concurrent
call does not acquire the fs_info->commit_root_sem before adding a new
entry if the caching state is BTRFS_CACHE_FINISHED, which is a problem
because the caching kthread calls btrfs_unpin_free_ino() after setting
the caching state to BTRFS_CACHE_FINISHED and therefore races with
the task calling btrfs_return_ino(), which is adding a new entry, while
the former (caching kthread) is navigating the cache's rbtree, removing
and freeing nodes from the cache's rbtree without acquiring the spinlock
that protects the rbtree.

This race resulted in memory corruption due to double free of struct
btrfs_free_space objects because both tasks can end up doing freeing the
same objects. Note that adding a new entry can result in merging it with
other entries in the cache, in which case those entries are freed.
This is particularly important as btrfs_free_space structures are also
used for the block group free space caches.

This memory corruption can be detected by a debugging kernel, which
reports it with the following trace:

[132408.501148] slab error in verify_redzone_free(): cache `btrfs_free_space': double free detected
[132408.505075] CPU: 15 PID: 12248 Comm: btrfs-ino-cache Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
[132408.505075] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[132408.505075]  ffff880023e7d320 ffff880163d73cd8 ffffffff8145eec7 ffffffff81095dce
[132408.505075]  ffff880009735d40 ffff880163d73ce8 ffffffff81154e1e ffff880163d73d68
[132408.505075]  ffffffff81155733 ffffffffa054a95a ffff8801b6099f00 ffffffffa0505b5f
[132408.505075] Call Trace:
[132408.505075]  [<ffffffff8145eec7>] dump_stack+0x4f/0x7b
[132408.505075]  [<ffffffff81095dce>] ? console_unlock+0x356/0x3a2
[132408.505075]  [<ffffffff81154e1e>] __slab_error.isra.28+0x25/0x36
[132408.505075]  [<ffffffff81155733>] __cache_free+0xe2/0x4b6
[132408.505075]  [<ffffffffa054a95a>] ? __btrfs_add_free_space+0x2f0/0x343 [btrfs]
[132408.505075]  [<ffffffffa0505b5f>] ? btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075]  [<ffffffff810f3b30>] ? time_hardirqs_off+0x15/0x28
[132408.505075]  [<ffffffff81084d42>] ? trace_hardirqs_off+0xd/0xf
[132408.505075]  [<ffffffff811563a1>] ? kfree+0xb6/0x14e
[132408.505075]  [<ffffffff811563d0>] kfree+0xe5/0x14e
[132408.505075]  [<ffffffffa0505b5f>] btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
[132408.505075]  [<ffffffffa0505e08>] caching_kthread+0x29e/0x2d9 [btrfs]
[132408.505075]  [<ffffffffa0505b6a>] ? btrfs_unpin_free_ino+0x99/0x99 [btrfs]
[132408.505075]  [<ffffffff8106698f>] kthread+0xef/0xf7
[132408.505075]  [<ffffffff810f3b08>] ? time_hardirqs_on+0x15/0x28
[132408.505075]  [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
[132408.505075]  [<ffffffff814653d2>] ret_from_fork+0x42/0x70
[132408.505075]  [<ffffffff810668a0>] ? __kthread_parkme+0xad/0xad
[132408.505075] ffff880023e7d320: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b.
[132409.501654] slab: double free detected in cache 'btrfs_free_space', objp ffff880023e7d320
[132409.503355] ------------[ cut here ]------------
[132409.504241] kernel BUG at mm/slab.c:2571!

Therefore fix this by having btrfs_unpin_free_ino() acquire the lock
that protects the rbtree while doing the searches and removing entries.

Fixes: 1c70d8fb4dfa ("Btrfs: fix inode caching vs tree log")
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit ae9d8f17118551bedd797406a6768b87c2146234)