Since mixed block groups accounting isn't byte-accurate and f_bree is an
unsigned integer, it could overflow. Avoid this.
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Suggested-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 41b34accb265e3a20211a7a8ef3625678f1c6ec7)
Josef Bacik [Tue, 12 Apr 2016 16:54:40 +0000 (12:54 -0400)]
Btrfs: remove BUG_ON()'s in btrfs_map_block
btrfs_map_block can go horribly wrong in the face of fs corruption, lets agree
to not be assholes and panic at any possible chance things are all fucked up.
Signed-off-by: Josef Bacik <jbacik@fb.com>
[ removed type casts ] Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit e042d1ec4417981dfe9331e47b76f17929bc2ffe)
David Sterba [Wed, 27 Apr 2016 00:55:15 +0000 (02:55 +0200)]
btrfs: preallocate compression workspaces
Preallocate one workspace for each compression type so we can guarantee
forward progress in the worst case. A failure cannot be a hard error as
we might not use compression at all on the filesystem. If we can't
allocate the workspaces later when need them, it might actually
deadlock, but in such situation the system has effectively not enough
memory to operate properly.
Vincent Stehlé [Tue, 10 May 2016 12:56:20 +0000 (14:56 +0200)]
Btrfs: fix fspath error deallocation
Make sure to deallocate fspath with vfree() in case of error in
init_ipath().
fspath is allocated with vmalloc() in init_data_container() since
commit 425d17a290c0 ("Btrfs: use larger limit for translation of logical to
inode").
Signed-off-by: Vincent Stehlé <vincent.stehle@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 72928f2476d08c79f132b4f44a17c9a011dd98e3)
Filipe Manana [Wed, 6 Apr 2016 16:11:56 +0000 (17:11 +0100)]
Btrfs: fix for incorrect directory entries after fsync log replay
If we move a directory to a new parent and later log that parent and don't
explicitly log the old parent, when we replay the log we can end up with
entries for the moved directory in both the old and new parent directories.
Besides being ilegal to have directories with multiple hard links in linux,
it also resulted in the leaving the inode item with a link count of 1.
A similar issue also happens if we move a regular file - after the log tree
is replayed the file has a link in both the old and new parent directories,
when it should be only at the new directory.
Sample reproducer:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/x
$ mkdir /mnt/y
$ touch /mnt/x/foo
$ mkdir /mnt/y/z
$ sync
$ ln /mnt/x/foo /mnt/x/bar
$ mv /mnt/y/z /mnt/x/z
< power fail >
$ mount /dev/sdc /mnt
$ ls -1Ri /mnt
/mnt:
257 x
258 y
/mnt/x:
259 bar
259 foo
260 z
/mnt/x/z:
/mnt/y:
260 z
/mnt/y/z:
$ umount /dev/sdc
$ btrfs check /dev/sdc
Checking filesystem on /dev/sdc
UUID: a67e2c4a-a4b4-4fdc-b015-9d9af1e344be
checking extents
checking free space cache
checking fs roots
root 5 inode 260 errors 2000, link count wrong
unresolved ref dir 257 index 4 namelen 1 name z filetype 2 errors 0
unresolved ref dir 258 index 2 namelen 1 name z filetype 2 errors 0
(...)
Attempting to remove the directory becomes impossible:
$ mount /dev/sdc /mnt
$ rmdir /mnt/y/z
$ ls -lh /mnt/y
ls: cannot access /mnt/y/z: No such file or directory
total 0
d????????? ? ? ? ? ? z
$ rmdir /mnt/x/z
rmdir: failed to remove ‘/mnt/x/z’: Stale file handle
$ ls -lh /mnt/x
ls: cannot access /mnt/x/z: Stale file handle
total 0
-rw-r--r-- 2 root root 0 Apr 6 18:06 bar
-rw-r--r-- 2 root root 0 Apr 6 18:06 foo
d????????? ? ? ? ? ? z
So make sure that on rename we set the last_unlink_trans value for our
inode, even if it's a directory, to the value of the current transaction's
ID and that if the new parent directory is logged that we fallback to a
transaction commit.
A test case for fstests is being submitted as well.
Filipe Manana [Mon, 25 Apr 2016 03:45:02 +0000 (04:45 +0100)]
Btrfs: fix empty symlink after creating symlink and fsync parent dir
If we create a symlink, fsync its parent directory, crash/power fail and
mount the filesystem, we end up with an empty symlink, which not only is
useless it's also not allowed in linux (the man page symlink(2) is well
explicit about that). So we just need to make sure to fully log an inode
if it's a symlink, to ensure its inline extent gets logged, ensuring the
same behaviour as ext3, ext4, xfs, reiserfs, f2fs, nilfs2, etc.
Dan Fuhry [Thu, 17 Mar 2016 14:23:38 +0000 (15:23 +0100)]
btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT
Two new flags, RENAME_EXCHANGE and RENAME_WHITEOUT, provide for new
behavior in the renameat2() syscall. This behavior is primarily used by
overlayfs. This patch adds support for these flags to btrfs, enabling it to
be used as a fully functional upper layer for overlayfs.
RENAME_EXCHANGE support was written by Davide Italiano originally
submitted on 2 April 2015.
Signed-off-by: Davide Italiano <dccitaliano@gmail.com> Signed-off-by: Dan Fuhry <dfuhry@datto.com>
[ remove unlikely ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit cdd1fedf8261cd7a73c0596298902ff4f0f04492)
Filipe Manana [Thu, 5 May 2016 09:26:26 +0000 (10:26 +0100)]
Btrfs: fix number of transaction units for renames with whiteout
When we do a rename with the whiteout flag, we need to create the whiteout
inode, which in the worst case requires 5 transaction units (1 inode item,
1 inode ref, 2 dir items and 1 xattr if selinux is enabled). So bump the
number of transaction units from 11 to 16 if the whiteout flag is set.
Filipe Manana [Mon, 9 May 2016 12:15:27 +0000 (13:15 +0100)]
Btrfs: fix race between fsync and direct IO writes for prealloc extents
When we do a direct IO write against a preallocated extent (fallocate)
that does not go beyond the i_size of the inode, we do the write operation
without holding the inode's i_mutex (an optimization that landed in
commit 38851cc19adb ("Btrfs: implement unlocked dio write")). This allows
for a very tiny time window where a race can happen with a concurrent
fsync using the fast code path, as the direct IO write path creates first
a new extent map (no longer flagged as a prealloc extent) and then it
creates the ordered extent, while the fast fsync path first collects
ordered extents and then it collects extent maps. This allows for the
possibility of the fast fsync path to collect the new extent map without
collecting the new ordered extent, and therefore logging an extent item
based on the extent map without waiting for the ordered extent to be
created and complete. This can result in a situation where after a log
replay we end up with an extent not marked anymore as prealloc but it was
only partially written (or not written at all), exposing random, stale or
garbage data corresponding to the unwritten pages and without any
checksums in the csum tree covering the extent's range.
This is an extension of what was done in commit de0ee0edb21f ("Btrfs: fix
race between fsync and lockless direct IO writes").
So fix this by creating first the ordered extent and then the extent
map, so that this way if the fast fsync patch collects the new extent
map it also collects the corresponding ordered extent.
Scott Talbert [Mon, 9 May 2016 13:14:28 +0000 (09:14 -0400)]
btrfs: fix memory leak during RAID 5/6 device replacement
A 'struct bio' is allocated in scrub_missing_raid56_pages(), but it was never
freed anywhere.
Signed-off-by: Scott Talbert <scott.talbert@hgst.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 4673272f43ae790ab9ec04e38a7542f82bb8f020)
and if the offset is beyond EOF, we'll get 'path' pointed to the place
of potentail insertion, and ret == 1.
This may confuse applications using ioctl(FIEL_IOC_FIEMAP).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 2d324f59f343967a03eeb2690f0ff178304d0687)
Filipe Manana [Fri, 20 May 2016 00:57:20 +0000 (01:57 +0100)]
Btrfs: fix race between readahead and device replace/removal
The list of devices is protected by the device_list_mutex and the device
replace code, in its finishing phase correctly takes that mutex before
removing the source device from that list. However the readahead code was
iterating that list without acquiring the respective mutex leading to
crashes later on due to invalid memory accesses:
So fix this by taking the device_list_mutex in the readahead code. We
can't use here the lighter approach of using a rcu_read_lock() and
rcu_read_unlock() pair together with a list_for_each_entry_rcu() call
because we end up doing calls to sleeping functions (kzalloc()) in the
respective code path.
Filipe Manana [Fri, 20 May 2016 03:34:23 +0000 (04:34 +0100)]
Btrfs: fix race between device replace and block group removal
When it's finishing, the device replace code iterates all extent maps
representing block group and for each one that has a stripe that refers
to the source device, it replaces its device with the target device.
However when it replaces the source device with the target device it,
the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID),
only after its ID is changed to match the one from the source device.
This leads to races with the chunk removal code that can temporarly see
a device with an ID of 0ULL and then attempt to use that ID to remove
items from the device tree and fail, causing a transaction abort:
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates fs_info->mapping_tree and
replaces the device in every extent
map's map->stripes[] with
dev_replace->tgtdev, which still has
an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
iterates over all stripes from
the extent map
--> calls btrfs_free_dev_extent()
passing it the target device
that still has an ID of 0ULL
--> btrfs_free_dev_extent() fails
--> aborts current transaction
finishes setting up the target device,
namely it sets tgtdev->devid to the value
of srcdev->devid (which is necessarily > 0)
frees the srcdev
unlocks fs_info->chunk_mutex
So fix this by taking the device list mutex while processing the stripes
for the chunk's extent map. This is similar to the race between device
replace and block group creation that was fixed by commit 50460e37186a
("Btrfs: fix race when finishing dev replace leading to transaction abort").
Filipe Manana [Sat, 14 May 2016 15:32:35 +0000 (16:32 +0100)]
Btrfs: fix unprotected assignment of the left cursor for device replace
We were assigning new values to fields of the device replace object
without holding the respective lock after processing each device extent.
This is important for the left cursor field which can be accessed by a
concurrent task running __btrfs_map_block (which, correctly, takes the
device replace lock).
So change these fields while holding the device replace lock.
Filipe Manana [Wed, 18 May 2016 19:29:44 +0000 (20:29 +0100)]
Btrfs: fix race between device replace and chunk allocation
While iterating and copying extents from the source device, the device
replace code keeps adjusting a left cursor that is used to make sure that
once we finish processing a device extent, any future writes to extents
from the corresponding block group will get into both the source and
target devices. This left cursor is also used for resuming the device
replace operation at mount time.
However using this left cursor to decide whether writes go into both
devices or only the source device is not enough to guarantee we don't
miss copying extents into the target device. There are two cases where
the current approach fails. The first one is related to when there are
holes in the device and they get allocated for new block groups while
the device replace operation is iterating the device extents (more on
this explained below). The second one is that when that loop over the
device extents finishes, we start dellaloc, wait for all ordered extents
and then commit the current transaction, we might have got new block
groups allocated that are now using a device extent that has an offset
greater then or equals to the value of the left cursor, in which case
writes to extents belonging to these new block groups will get issued
only to the source device.
For the first case where the current approach of using a left cursor
fails, consider the source device currently has the following layout:
[ extent bg A ] [ hole, unallocated space ] [extent bg B ]
3Gb 4Gb 5Gb
While we are iterating the device extents from the source device using
the commit root of the device tree, the following happens:
CPU 1 CPU 2
<we are at transaction N>
scrub_enumerate_chunks()
--> searches the device tree for
extents belonging to the source
device using the device tree's
commit root
--> 1st iteration finds extent belonging to
block group A
--> sets block group A to RO mode
(btrfs_inc_block_group_ro)
--> sets cursor left to found_key.offset
which is 3Gb
--> scrub_chunk() starts
copies all allocated extents from
block group's A stripe at source
device into target device
btrfs_alloc_chunk()
--> allocates device extent
in the range [4Gb, 5Gb[
from the source device for
a new block group C
extent allocated from block
group C for a direct IO,
buffered write or btree node/leaf
extent is written to, perhaps
in response to a writepages()
call from the VM or directly
through direct IO
the write is made only against
the source device and not against
the target device because the
extent's offset is in the interval
[4Gb, 5Gb[ which is larger then
the value of cursor_left (3Gb)
--> scrub_chunks() finishes
--> updates left cursor from 3Gb to
4Gb
--> btrfs_dec_block_group_ro() sets
block group A back to RW mode
<we are still at transaction N>
--> 2nd iteration finds extent belonging to
block group B - it did not find the new
extent in the range [4Gb, 5Gb[ for block
group C because we are using the device
tree's commit root or even because the
block group's items are not all yet
inserted in the respective btrees, that is,
the block group is still attached to some
transaction handle's new_bgs list and
btrfs_create_pending_block_groups() was
not called yet against that transaction
handle, so the device extent items were
not yet inserted into the devices tree
<we are still at transaction N>
--> so we end not copying anything from the newly
allocated device extent from the source device
to the target device
So fix this by making __btrfs_map_block() always redirect writes to the
target device as well, independently of the left cursor's value. With
this change the left cursor is now used only for the purpose of tracking
progress and allow a mount operation to resume a device replace.
Filipe Manana [Fri, 27 May 2016 16:42:05 +0000 (17:42 +0100)]
Btrfs: fix race between device replace and discard
While we are finishing a device replace operation, we can make a discard
operation (fs mounted with -o discard) do an invalid memory access like
the one reported by the following trace:
This happens because when we call btrfs_map_block() from
btrfs_discard_extent() to get a btrfs_bio structure, the device replace
operation has not finished yet, but before we use the device of one of the
stripes from the returned btrfs_bio structure, the device object is freed.
btrfs_commit_transaction()
btrfs_finish_extent_commit()
btrfs_discard_extent()
btrfs_map_block()
--> returns a struct btrfs_bio
with a stripe that has a
device field pointing to
source device of the replace
operation (the device that
is being replaced)
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates the mapping tree and for each
extent map that has a stripe pointing to
the source device, it updates the stripe
to point to the target device instead
btrfs_rm_dev_replace_blocked()
--> waits for fs_info->bio_counter to go down to 0
btrfs_rm_dev_replace_remove_srcdev()
--> removes source device from the list of devices
btrfs_rm_dev_replace_free_srcdev()
--> frees the source device
--> iterates over all stripes
of the returned struct
btrfs_bio
--> for each stripe it
dereferences its device
pointer
--> it ends up finding a
pointer to the device
used as the source
device for the replace
operation and that was
already freed
So fix this by surrounding the call to btrfs_map_block(), and the code
that uses the returned struct btrfs_bio, with calls to
btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), so that
the finishing phase of the device replace operation blocks until the
the bio counter decreases to zero before it frees the source device.
This is the same approach we do at btrfs_map_bio() for example.
Filipe Manana [Fri, 27 May 2016 21:21:27 +0000 (22:21 +0100)]
Btrfs: fix race between device replace and read repair
While we are finishing a device replace operation we can have a concurrent
task trying to do a read repair operation, in which case it will call
btrfs_map_block() to get a struct btrfs_bio which can have a stripe that
points to the source device of the device replace operation. This allows
for the read repair task to dereference the stripe's device pointer after
the device replace operation has freed the source device, resulting in
an invalid memory access. This is similar to the problem solved by my
previous patch in the same series and named "Btrfs: fix race between
device replace and discard".
So fix this by surrounding the call to btrfs_map_block() and the code
that uses the returned struct btrfs_bio with calls to
btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the
proper serialization with the finishing phase of the device replace
operation.
Chris Mason [Sat, 19 Sep 2015 18:28:25 +0000 (11:28 -0700)]
Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent
When dealing with inline extents, btrfs_get_extent will incorrectly try
to insert a duplicate extent_map. The dup hits -EEXIST from
add_extent_map, but then we try to merge with the existing one and end
up trying to insert a zero length extent_map.
This actually works most of the time, except when there are extent maps
past the end of the inline extent. rocksdb will trigger this sometimes
because it preallocates an extent and then truncates down.
Jeff Mahoney [Wed, 16 Sep 2015 13:34:53 +0000 (15:34 +0200)]
btrfs: advertise which crc32c implementation is being used at module load
Since several architectures support hardware-accelerated crc32c
calculation, it would be nice to confirm that btrfs is actually using it.
We can see an elevated use count for the module, but it doesn't actually
show who the users are. This patch simply prints the name of the driver
after successfully initializing the shash.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
[ added a helper and used in module load-time message ] Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 5f9e1059d9347191b271bf7d13bd83db57594d2a)
Component mirror_num of struct btrfsic_block is defined
as unsigned int. Use %u as format specifier.
Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 16ff4b454f1b56e8d89a9075feed0dd6ac510c3d)
Josef Bacik [Fri, 27 May 2016 17:03:04 +0000 (13:03 -0400)]
Btrfs: don't BUG_ON() in btrfs_orphan_add
This is just a screwup for developers, so change it to an ASSERT() so developers
notice when things go wrong and deal with the error appropriately if ASSERT()
isn't enabled. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 3b6571c180da85e43550c608e954ab7b2a31d954)
Josef Bacik [Fri, 25 Mar 2016 17:26:00 +0000 (13:26 -0400)]
Btrfs: don't do nocow check unless we have to
Before we write into prealloc/nocow space we have to make sure that there are no
references to the extents we are writing into, which means checking the extent
tree and csum tree in the case of nocow. So we don't want to do the nocow dance
unless we can't reserve data space, since it's a serious drag on performance.
With the following sequence
we get the worst case scenario where we have to fall back on to doing the check
anyway.
Without this patch
lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec
With this patch
lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec
We get twice the throughput, half of the runtime, and half of the average
latency. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
[ PAGE_CACHE_ removal related fixups ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit c6887cd11149d7325328749f06719071e6c725c6)
Wei Yongjun [Fri, 17 Jun 2016 17:20:40 +0000 (17:20 +0000)]
Btrfs: fix error return code in btrfs_init_test_fs()
Fix to return a negative error code from the kern_mount() error handling
case instead of 0(ret is set to 0 by register_filesystem), as done
elsewhere in this function.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 04e1b65af2085d4102b2b5d2fd1e050f8ee63092)
Liu Bo [Sat, 18 Jun 2016 02:16:21 +0000 (19:16 -0700)]
Btrfs: fix error handling in map_private_extent_buffer
map_private_extent_buffer() can return -EINVAL in two different cases,
1. when the requested contents span two pages if nodesize is larger
than pagesize,
2. when it detects something insane.
The 2nd one used to be only a WARN_ON(1), and we decided to return a error
to callers, but we didn't fix up all its callers, which will be
addressed by this patch.
Without this, btrfs may end up with 'general protection', ie.
reading invalid memory.
Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 415b35a55b57a701afe7391d32a6bb0193b7d3da)
Wang Xiaoguang [Wed, 22 Jun 2016 01:57:01 +0000 (09:57 +0800)]
btrfs: fix disk_i_size update bug when fallocate() fails
When doing truncate operation, btrfs_setsize() will first call
truncate_setsize() to set new inode->i_size, but if later
btrfs_truncate() fails, btrfs_setsize() will call
"i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
inmemory inode size, now bug occurs. It's because for truncate
case btrfs_ordered_update_i_size() directly uses inode->i_size
to update BTRFS_I(inode)->disk_i_size, indeed we should use the
"offset" argument to update disk_i_size. Here is the call graph:
==>btrfs_truncate()
====>btrfs_truncate_inode_items()
======>btrfs_ordered_update_i_size(inode, last_size, NULL);
Here btrfs_ordered_update_i_size()'s offset argument is last_size.
And below test case can reveal this bug:
dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
dev=$(losetup --show -f fs.img)
mkdir -p /mnt/mntpoint
mkfs.btrfs -f $dev
mount $dev /mnt/mntpoint
cd /mnt/mntpoint
truncate --size 0 testfile
ls -l testfile
du -sh testfile
exit
In this case, truncate operation will fail for enospc reason and
"du -sh testfile" returns value greater than 0, but testfile's
size is 0, we need to reflect correct inode->i_size.
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit c0d2f6104e8ab2eb75e58e72494ad4b69c5227f8)
Josef Bacik [Wed, 20 Jul 2016 23:48:45 +0000 (16:48 -0700)]
Btrfs: avoid deadlocks during reservations in btrfs_truncate_block
The new enospc code makes it possible to deadlock if we don't use
FLUSH_LIMIT during reservations inside a transaction. This enforces
the correct flush type to avoid both deadlocks and assertions
Chris Mason [Tue, 19 Jul 2016 12:52:36 +0000 (05:52 -0700)]
Btrfs: fix delalloc accounting after copy_from_user faults
Commit 56244ef151c3cd11 was almost but not quite enough to fix the
reservation math after btrfs_copy_from_user returned partial copies.
Some users are still seeing warnings in btrfs_destroy_inode, and with a
long enough test run I'm able to trigger them as well.
This patch fixes the accounting math again, bringing it much closer to
the way it was before the sectorsize conversion Chandan did. The
problem is accounting for the offset into the page/sector when we do a
partial copy. This one just uses the dirty_sectors variable which
should already be updated properly.
Liu Bo [Thu, 23 Jun 2016 01:31:27 +0000 (18:31 -0700)]
Btrfs: check inconsistence between chunk and block group
With btrfs-corrupt-block, one can drop one chunk item and mounting
will end up with a panic in btrfs_full_stripe_len().
This doesn't not remove the BUG_ON, but instead checks it a bit
earlier when we find the block group item.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 6fb37b756acce6d6e045f79c3764206033f617b4)
Liu Bo [Thu, 23 Jun 2016 23:32:45 +0000 (16:32 -0700)]
Btrfs: error out if generic_bin_search get invalid arguments
With btrfs-corrupt-block, one can set btree node/leaf's field, if
we assign a negative value to node/leaf, we can get various hangs,
eg. if extent_root's nritems is -2ULL, then we get stuck in
btrfs_read_block_groups() because it has a while loop and
btrfs_search_slot() on extent_root will always return the first
child.
This lets us know what's happening and returns a EINVAL to callers
instead of returning the first item.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 5e24e9af01abcb151173bb133f1a72b94239c670)
It turns out that we free fs root twice, btrfs_init_fs_root() calls
free_anon_bdev(root->anon_dev) and later then btrfs_get_fs_root() cals
free_fs_root which does another free_anon_bdev() and it ends up with the
above warning.
Instead of reset root->anon_dev to 0 after free_anon_bdev(), we can let
btrfs_init_fs_root() return directly since its callers have already done
the free job by calling free_fs_root().
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 876d2cf141b42f9be70a383502a324b64b23f33a)
Salah Triki [Sun, 3 Jul 2016 04:40:10 +0000 (05:40 +0100)]
btrfs: Replace -ENOENT by -ERANGE in btrfs_get_acl()
size contains the value returned by posix_acl_from_xattr(), which
returns -ERANGE, -ENODATA, zero, or an integer greater than zero. So
replace -ENOENT by -ERANGE.
Signed-off-by: Salah Triki <salah.triki@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit a60617d0ae51c22f652b6fcf4ca56558c76e5aad)
Nikolay Borisov [Thu, 23 Jun 2016 18:17:08 +0000 (21:17 +0300)]
btrfs: Fix slab accounting flags
BTRFS is using a variety of slab caches to satisfy internal needs.
Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
meaning allocations from the caches are going to be accounted as
SReclaimable. At the same time btrfs is not registering any shrinkers
whatsoever, thus preventing memory from the slabs to be shrunk. This
means those caches are not in fact reclaimable.
To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
inode cache, since this one is being freed by the generic VFS super_block
shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
to better document the lifetime of the objects (it just translates
to SLAB_RECLAIM_ACCOUNT).
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit fba4b697710eb2a4bee456b9d39e9239c66f8bee)
btrfs/073 invokes scrub ioctl in a tight loop. In subpage-blocksize
scenario this results in a lot of "scrub: size assumption sectorsize !=
PAGE_SIZE " messages being printed on the console. To reduce the number
of such messages this commit uses btrfs_err_rl() instead of
btrfs_err().
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 751bebbe0ad261b3ac5d9e75eaec4d5fe9e23160)
Wang Xiaoguang [Mon, 11 Jul 2016 11:30:04 +0000 (19:30 +0800)]
btrfs: fix free space calculation in dump_space_info()
In btrfs, btrfs_space_info's bytes_may_use is treated as fs used
space, as what we do in reserve_metadata_bytes() or
btrfs_alloc_data_chunk_ondemand(), so in dump_space_info(), when
calculating free space, we should also subtract btrfs_space_info's
bytes_may_use.
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 39581a3a1af7e03af41dde62a6485a934deb109b)
Liu Bo [Tue, 12 Jul 2016 01:52:57 +0000 (18:52 -0700)]
Btrfs: change BUG_ON()'s to ASSERT()'s in backref_cache_cleanup()
Since it is just an in-memory building of the backrefs of several
btree blocks, nothing is fatal other than memory leaks, so this
changes BUG_ON()'s to ASSERT()'s.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit f49070957ffed84feb7944550f7edd53672b5201)
Liu Bo [Mon, 11 Jul 2016 17:39:07 +0000 (10:39 -0700)]
Btrfs: fix eb memory leak due to readpage failure
eb->io_pages is set in read_extent_buffer_pages().
In case of readpage failure, for pages that have been added to bio,
it calls bio_endio and later readpage_io_failed_hook() does the work.
When this eb's page (couldn't be the 1st page) fails to add itself to bio
due to failure in merge_bio(), it cannot decrease eb->io_pages via bio_endio,
and ends up with a memory leak eventually.
This lets __do_readpage propagate errors to callers and adds the
'atomic_dec(&eb->io_pages)'.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit baf863b9c29617cc9eaf24e039f58846e700db48)
Liu Bo [Tue, 12 Jul 2016 18:24:21 +0000 (11:24 -0700)]
Btrfs: fix unexpected balance crash due to BUG_ON
Mounting a btrfs can resume previous balance operations asynchronously.
An user got a crash when one drive has some corrupt sectors.
Since balance can cancel itself in case of any error, we can gracefully
return errors to upper layers and let balance do the cancel job.
Reported-by: sash <master.b.at.raven@chefmail.de> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 5a488b9d2c25d98cdd11d09de311bfc83ba09fbd)
Liu Bo [Tue, 12 Jul 2016 17:29:37 +0000 (10:29 -0700)]
Btrfs: fix panic in balance due to EIO
During build_backref_tree(), if we fail to read a btree node,
we can eventually run into BUG_ON(cache->nr_nodes) that we put
in backref_cache_cleanup(), meaning we have at least one
memory leak.
This frees the backref_node that we's allocated at the very
beginning of build_backref_tree().
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 0fd8c3dae14fb64947842472940b807ca0781da9)
Nikolay Borisov [Wed, 13 Jul 2016 13:19:14 +0000 (16:19 +0300)]
btrfs: Ratelimit "no csum found" info message
Recently during a crash it became apparent that this particular message
can be printed so many times that it causes the softlockup detector to
trigger. Fix it by ratelimiting it.
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit aee133afcdf77641be66c5c9f32975feca8e6bd0)
Nikolay Borisov [Wed, 13 Jul 2016 13:19:15 +0000 (16:19 +0300)]
btrfs: Add ratelimit to btrfs printing
This patch adds ratelimiting to all messages which are not using the _rl
version of the various printing APIs in btrfs. This is designed to be
used as a safety net, since a flood messages might cause the softlockup
detector to trigger. To reduce interference between different classes of
messages use a separate ratelimit state for every class of message.
Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 35f4e5e6f198a11c11606a742b309aa80271c66d)
Jeff Mahoney [Wed, 15 Jun 2016 14:25:38 +0000 (10:25 -0400)]
btrfs: introduce BTRFS_MAX_ITEM_SIZE
We use BTRFS_LEAF_DATA_SIZE - sizeof(struct btrfs_item) in
several places. This introduces a BTRFS_MAX_ITEM_SIZE macro to do the
same.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 14a1e067b45614d6236e3c82b36f62caef44ac62)
Filipe Manana [Tue, 14 Jun 2016 13:18:27 +0000 (14:18 +0100)]
Btrfs: add missing check for writeback errors on fsync
When we start an fsync we start ordered extents for all delalloc ranges.
However before attempting to log the inode, we only wait for those ordered
extents if we are not doing a full sync (bit BTRFS_INODE_NEEDS_FULL_SYNC
is set in the inode's flags). This means that if an ordered extent
completes with an IO error before we check if we can skip logging the
inode, we will not catch and report the IO error to user space. This is
because on an IO error, when the ordered extent completes we do not
update the inode, so if the inode was not previously updated by the
current transaction we end up not logging it through calls to fsync and
therefore not check its mapping flags for the presence of IO errors.
Fix this by checking for errors in the flags of the inode's mapping when
we notice we can skip logging the inode.
This caused sporadic failures in the test generic/331 (which explicitly
tests for IO errors during an fsync call).
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
(cherry picked from commit 0596a9048bf2aca2a74b312493f39e4d5ac3b653)
Filipe Manana [Mon, 6 Jun 2016 10:51:25 +0000 (11:51 +0100)]
Btrfs: be more precise on errors when getting an inode from disk
When we attempt to read an inode from disk, we end up always returning an
-ESTALE error to the caller regardless of the actual failure reason, which
can be an out of memory problem (when allocating a path), some error found
when reading from the fs/subvolume btree (like a genuine IO error) or the
inode does not exists. So lets start returning the real error code to the
callers so that they don't treat all -ESTALE errors as meaning that the
inode does not exists (such as during orphan cleanup). This will also be
needed for a subsequent patch in the same series dealing with a special
fsync case.
Filipe Manana [Mon, 6 Jun 2016 15:11:13 +0000 (16:11 +0100)]
Btrfs: improve performance on fsync against new inode after rename/unlink
With commit 56f23fdbb600 ("Btrfs: fix file/data loss caused by fsync after
rename and new inode") we got simple fix for a functional issue when the
following sequence of actions is done:
at transaction N
create file A at directory D
at transaction N + M (where M >= 1)
move/rename existing file A from directory D to directory E
create a new file named A at directory D
fsync the new file
power fail
The solution was to simply detect such scenario and fallback to a full
transaction commit when we detect it. However this turned out to had a
significant impact on throughput (and a bit on latency too) for benchmarks
using the dbench tool, which simulates real workloads from smbd (Samba)
servers. For example on a test vm (with a debug kernel):
Unpatched:
Throughput 19.1572 MB/sec 32 clients 32 procs max_latency=1005.229 ms
Patched:
Throughput 23.7015 MB/sec 32 clients 32 procs max_latency=809.206 ms
The patched results (this patch is applied) are similar to the results of
a kernel with the commit 56f23fdbb600 ("Btrfs: fix file/data loss caused
by fsync after rename and new inode") reverted.
This change avoids the fallback to a transaction commit and instead makes
sure all the names of the conflicting inode (the one that had a name in a
past transaction that matches the name of the new file in the same parent
directory) are logged so that at log replay time we don't lose neither the
new file nor the old file, and the old file gets the name it was renamed
to.
This also ends up avoiding a full transaction commit for a similar case
that involves an unlink instead of a rename of the old file:
at transaction N
create file A at directory D
at transaction N + M (where M >= 1)
remove file A
create a new file named A at directory D
fsync the new file
power fail
Paolo Valente [Wed, 27 Jul 2016 05:22:05 +0000 (07:22 +0200)]
block: add missing group association in bio-cloning functions
When a bio is cloned, the newly created bio must be associated with
the same blkcg as the original bio (if BLK_CGROUP is enabled). If
this operation is not performed, then the new bio is not associated
with any group, and the group of the current task is returned when
the group of the bio is requested.
Depending on the cloning frequency, this may cause a large
percentage of the bios belonging to a given group to be treated
as if belonging to other groups (in most cases as if belonging to
the root group). The expected group isolation may thereby be broken.
This commit adds the missing association in bio-cloning functions.
Fixes: da2f0f74cf7d ("Btrfs: add support for blkio controllers") Cc: stable@vger.kernel.org # v4.3+ Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Nikolay Borisov <kernel@kyup.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 20bd723ec6a3261df5e02250cd3a1fbb09a343f2)
If we hit this error when mounted with errors=continue or
errors=remount-ro:
EXT4-fs error (device loop0): ext4_mb_mark_diskspace_used:2940: comm ext4.exe: Allocating blocks 5090-6081 which overlap fs metadata
then ext4_mb_new_blocks() will call ext4_mb_release_context() and try to
continue. However, ext4_mb_release_context() is the wrong thing to call
here since we are still actually using the allocation context.
Instead, just error out. We could retry the allocation, but there is a
possibility of getting stuck in an infinite loop instead, so this seems
safer.
[ Fixed up so we don't return EAGAIN to userspace. --tytso ]
Fixes: 8556e8f3b6 ("ext4: Don't allow new groups to be added during block allocation") Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If we encounter a filesystem error during orphan cleanup, we should stop.
Otherwise, we may end up in an infinite loop where the same inode is
processed again and again.
EXT4-fs (loop0): warning: checktime reached, running e2fsck is recommended
EXT4-fs error (device loop0): ext4_mb_generate_buddy:758: group 2, block bitmap and bg descriptor inconsistent: 6117 vs 0 free clusters
Aborting journal on device loop0-8.
EXT4-fs (loop0): Remounting filesystem read-only
EXT4-fs error (device loop0) in ext4_free_blocks:4895: Journal has aborted
EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
EXT4-fs error (device loop0) in ext4_ext_remove_space:3068: IO failure
EXT4-fs error (device loop0) in ext4_ext_truncate:4667: Journal has aborted
EXT4-fs error (device loop0) in ext4_orphan_del:2927: Journal has aborted
EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
EXT4-fs (loop0): Inode 16 (00000000618192a0): orphan list check failed!
[...]
EXT4-fs (loop0): Inode 16 (0000000061819748): orphan list check failed!
[...]
EXT4-fs (loop0): Inode 16 (0000000061819bf0): orphan list check failed!
[...]
See-also: c9eb13a9105 ("ext4: fix hang when processing corrupted orphaned inode list") Cc: Jan Kara <jack@suse.cz> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If s_reserved_gdt_blocks is extremely large, it's possible for
ext4_init_block_bitmap(), which is called when ext4 sets up an
uninitialized block bitmap, to corrupt random kernel memory. Add the
same checks which e2fsck has --- it must never be larger than
blocksize / sizeof(__u32) --- and then add a backup check in
ext4_init_block_bitmap() in case the superblock gets modified after
the file system is mounted.
If ext4_fill_super() fails early, it's possible for ext4_evict_inode()
to call ext4_should_journal_data() before superblock options and flags
are fully set up. In that case, the iput() on the journal inode can
end up causing a BUG().
Work around this problem by reordering the tests so we only call
ext4_should_journal_data() after we know it's not the journal inode.
Fixes: 2d859db3e4 ("ext4: fix data corruption in inodes with journalled data") Fixes: 2b405bfa84 ("ext4: fix data=journal fast mount/umount hang") Cc: Jan Kara <jack@suse.cz> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 06bd3c36a733 (ext4: fix data exposure after a crash) uncovered a
deadlock in ext4_writepages() which was previously much harder to hit.
After this commit xfstest generic/130 reproduces the deadlock on small
filesystems.
The problem happens when ext4_do_update_inode() sets LARGE_FILE feature
and marks current inode handle as synchronous. That subsequently results
in ext4_journal_stop() called from ext4_writepages() to block waiting for
transaction commit while still holding page locks, reference to io_end,
and some prepared bio in mpd structure each of which can possibly block
transaction commit from completing and thus results in deadlock.
Fix the problem by releasing page locks, io_end reference, and
submitting prepared bio before calling ext4_journal_stop().
[ Changed to defer the call to ext4_journal_stop() only if the handle
is synchronous. --tytso ]
Reported-and-tested-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
An extent with lblock = 4294967295 and len = 1 will pass the
ext4_valid_extent() test:
ext4_lblk_t last = lblock + len - 1;
if (len == 0 || lblock > last)
return 0;
since last = 4294967295 + 1 - 1 = 4294967295. This would later trigger
the BUG_ON(es->es_lblk + es->es_len < es->es_lblk) in ext4_es_end().
We can simplify it by removing the - 1 altogether and changing the test
to use lblock + len <= lblock, since now if len = 0, then lblock + 0 ==
lblock and it fails, and if len > 0 then lblock + len > lblock in order
to pass (i.e. it doesn't overflow).
Fixes: 5946d0893 ("ext4: check for overlapping extents in ext4_valid_extent_entries()") Fixes: 2f974865f ("ext4: check for zero length extent explicitly") Cc: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As suggested by the serial port infrastructure documentation, the IRQ is
requested in ->startup(). However, it is never freed in the ->shutdown()
hook.
With simple systems that open the serial port once for all and always
have at least one process that keep the serial port opened, there was no
problem. But with a more complicated system (*cough* systemd *cough*),
the serial port is opened/closed many times, which at some point no
processes having the serial port open at all. Due to this ->startup()
gets called again, tries to request_irq() again, which fails.
Fixes: 30530791a7a0 ("serial: mvebu-uart: initial support for Armada-3700 serial port") Cc: Ofer Heifetz <oferh@marvell.com> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When there is more data to be processed, the current test in
scatterwalk_done may prevent us from calling pagedone even when
we should.
In particular, if we're on an SG entry spanning multiple pages
where the last page is not a full page, we will incorrectly skip
calling pagedone on the second last page.
This patch fixes this by adding a separate test for whether we've
reached the end of a page.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Using a v4.7-rc7 kernel on a HP ProLiant triggered following messages
pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1200 MHz, 2800 MHz
cpufreq: ondemand governor failed, too long transition latency of HW, fallback to performance governor
The last line was shown for each CPU in the system.
Testing v4.5 (where commit 790d849b was integrated) triggered
similar messages. Same behaviour on a 2nd HP Proliant system.
So commit 790d849bf (cpufreq: pcc-cpufreq: update default value of
cpuinfo_transition_latency) causes the system to use performance
governor which, I guess, was not the intention of the patch.
Enabling debug output in pcc-cpufreq provides following verbose output:
pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1200 MHz, 2800 MHz
pcc_get_offset: for CPU 0: pcc_cpu_data input_offset: 0x44, pcc_cpu_data output_offset: 0x48
init: policy->max is 2800000, policy->min is 1200000
get: get_freq for CPU 0
get: SUCCESS: (virtual) output_offset for cpu 0 is 0xffffc9000d7c0048, contains a value of: 0xff06. Speed is: 168000 MHz
cpufreq: ondemand governor failed, too long transition latency of HW, fallback to performance governor
target: CPU 0 should go to target freq: 2800000 (virtual) input_offset is 0xffffc9000d7c0044
target: was SUCCESSFUL for cpu 0
I am asking to revert 790d849bf to re-enable usage of ondemand
governor with pcc-cpufreq.
Fixes: 790d849bf (cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency) Signed-off-by: Andreas Herrmann <aherrmann@suse.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC
if there are free elements").
There has been a report about OOM killer invoked when swapping out to a
dm-crypt device. The primary reason seems to be that the swapout out IO
managed to completely deplete memory reserves. Ondrej was able to
bisect and explained the issue by pointing to f9054c70d28b ("mm,
mempool: only set __GFP_NOMEMALLOC if there are free elements").
The reason is that the swapout path is not throttled properly because
the md-raid layer needs to allocate from the generic_make_request path
which means it allocates from the PF_MEMALLOC context. dm layer uses
mempool_alloc in order to guarantee a forward progress which used to
inhibit access to memory reserves when using page allocator. This has
changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
there are free elements") which has dropped the __GFP_NOMEMALLOC
protection when the memory pool is depleted.
If we are running out of memory and the only way forward to free memory
is to perform swapout we just keep consuming memory reserves rather than
throttling the mempool allocations and allowing the pending IO to
complete up to a moment when the memory is depleted completely and there
is no way forward but invoking the OOM killer. This is less than
optimal.
The original intention of f9054c70d28b was to help with the OOM
situations where the oom victim depends on mempool allocation to make a
forward progress. David has mentioned the following backtrace:
We do not know more about why the mempool is depleted without being
replenished in time, though. In any case the dm layer shouldn't depend
on any allocations outside of the dedicated pools so a forward progress
should be guaranteed. If this is not the case then the dm should be
fixed rather than papering over the problem and postponing it to later
by accessing more memory reserves.
mempools are a mechanism to maintain dedicated memory reserves to
guaratee forward progress. Allowing them an unbounded access to the
page allocator memory reserves is going against the whole purpose of
this mechanism.
Bisected by Ondrej Kozina.
[akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20160721145309.GR26379@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Ondrej Kozina <okozina@redhat.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: NeilBrown <neilb@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Ondrej Kozina <okozina@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fuse_flush() calls write_inode_now() that triggers writeback, but actual
writeback will happen later, on fuse_sync_writes(). If an error happens,
fuse_writepage_end() will set error bit in mapping->flags. So, we have to
check mapping->flags after fuse_sync_writes().
In kernel bug 150021, a kernel panic was reported when restoring a
hibernate image. Only a picture of the oops was reported, so I can't
paste the whole thing here. But here are the most interesting parts:
The RIP is on the same page as RBP, so it apparently started executing
on the stack.
The bug was bisected to commit ef0f3ed5a4ac (x86/asm/power: Create
stack frames in hibernate_asm_64.S), which in retrospect seems quite
dangerous, since that code saves and restores the stack pointer from a
global variable ('saved_context').
There are a lot of moving parts in the hibernate save and restore paths,
so I don't know exactly what caused the panic. Presumably, a FRAME_END
was executed without the corresponding FRAME_BEGIN, or vice versa. That
would corrupt the return address on the stack and would be consistent
with the details of the above panic.
[ rjw: One major problem is that by the time the FRAME_BEGIN in
restore_registers() is executed, the stack pointer value may not
be valid any more. Namely, the stack area pointed to by it
previously may have been overwritten by some image memory contents
and that page frame may now be used for whatever different purpose
it had been allocated for before hibernation. In that case, the
FRAME_BEGIN will corrupt that memory. ]
Instead of doing the frame pointer save/restore around the bounds of the
affected functions, just do it around the call to swsusp_save().
That has the same effect of ensuring that if swsusp_save() sleeps, the
frame pointers will be correct. It's also a much more obviously safe
way to do it than the original patch. And objtool still doesn't report
any warnings.
Fixes: ef0f3ed5a4ac (x86/asm/power: Create stack frames in hibernate_asm_64.S) Link: https://bugzilla.kernel.org/show_bug.cgi?id=150021 Reported-by: Andre Reinke <andre.reinke@mailbox.org> Tested-by: Andre Reinke <andre.reinke@mailbox.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Usually, after we have found the proper microcode blob for the current
machine, we stash it away for later use with save_microcode_in_initrd().
However, with builtin microcode which doesn't come from the initrd, we
don't call that function because CONFIG_BLK_DEV_INITRD=n and even if
set, we don't have a valid initrd.
In order to fix this, let's make save_microcode_in_initrd() an
fs_initcall which runs before rootfs_initcall() as this was the time it
was called previously through:
Radix trees may be used not only for storing page cache pages, so
unconditionally accounting radix tree nodes to the current memory cgroup
is bad: if a radix tree node is used for storing data shared among
different cgroups we risk pinning dead memory cgroups forever.
So let's only account radix tree nodes if it was explicitly requested by
passing __GFP_ACCOUNT to INIT_RADIX_TREE. Currently, we only want to
account page cache entries, so mark mapping->page_tree so.
Fixes: 58e698af4c63 ("radix-tree: account radix_tree_node to memory cgroup") Link: http://lkml.kernel.org/r/1470057188-7864-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 53dad6d3a8e5 ("ipc: fix race with LSMs") updated ipc_rcu_putref()
to receive rcu freeing function but used generic ipc_rcu_free() instead
of msg_rcu_free() which does security cleaning.
Running LTP msgsnd06 with kmemleak gives the following:
This problem can occur in the following situation:
open()
- pread()
- .seq_start()
- iter = kmalloc() // succeeds
- seqf->private = iter
- .seq_stop()
- kfree(seqf->private)
- pread()
- .seq_start()
- iter = kmalloc() // fails
- .seq_stop()
- class_dev_iter_exit(seqf->private) // boom! old pointer
As the comment in disk_seqf_stop() says, stop is called even if start
failed, so we need to reinitialise the private pointer to NULL when seq
iteration stops.
An alternative would be to set the private pointer to NULL when the
kmalloc() in disk_seqf_start() fails.
x86_64 needs to use compat_sys_keyctl for 32-bit userspace rather than
calling sys_keyctl(). The latter will work in a lot of cases, thereby
hiding the issue.
Reported-by: Stephan Mueller <smueller@chronox.de> Tested-by: Stephan Mueller <smueller@chronox.de> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: keyrings@vger.kernel.org Cc: linux-security-module@vger.kernel.org Link: http://lkml.kernel.org/r/146961615805.14395.5581949237156769439.stgit@warthog.procyon.org.uk Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since commit 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure
after many small jobs") swap entries do not pin memcg->css.refcnt
directly. Instead, they pin memcg->id.ref. So we should adjust the
reference counters accordingly when moving swap charges between cgroups.
Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Link: http://lkml.kernel.org/r/9ce297c64954a42dc90b543bc76106c4a94f07e8.1470219853.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
An offline memory cgroup might have anonymous memory or shmem left
charged to it and no swap. Since only swap entries pin the id of an
offline cgroup, such a cgroup will have no id and so an attempt to
swapout its anon/shmem will not store memory cgroup info in the swap
cgroup map. As a result, memcg->swap or memcg->memsw will never get
uncharged from it and any of its ascendants.
Fix this by always charging swapout to the first ancestor cgroup that
hasn't released its id yet.
[hannes@cmpxchg.org: add comment to mem_cgroup_swapout]
[vdavydov@virtuozzo.com: use WARN_ON_ONCE() in mem_cgroup_id_get_online()] Link: http://lkml.kernel.org/r/20160803123445.GJ13263@esperanza Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Link: http://lkml.kernel.org/r/5336daa5c9a32e776067773d9da655d2dc126491.1470219853.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Classic BPF JIT was never ported completely to work on little endian
powerpc. However, it can be enabled and will crash the system when used.
As such, disable use of BPF JIT on ppc64le.
Fixes: 7c105b63bd98 ("powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config option.") Reported-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The PE primary bus cannot be got from its child devices when having
full hotplug in error recovery. The PE primary bus is cached, which
is done in commit <05ba75f84864> ("powerpc/eeh: Fix stale cached primary
bus"). In eeh_reset_device(), the flag (EEH_PE_PRI_BUS) is cleared
before the PCI hot remove. eeh_pe_bus_get() then returns NULL as the
PE primary bus in pnv_eeh_reset() and it crashes the kernel eventually.
This fixes the issue by clearing the flag (EEH_PE_PRI_BUS) before the
PCI hot add. With it, the PowerNV EEH reset backend (pnv_eeh_reset())
can get valid PE primary bus through eeh_pe_bus_get().
Fixes: 67086e32b564 ("powerpc/eeh: powerpc/eeh: Support error recovery for VF PE") Reported-by: Pridhiviraj Paidipeddi <ppaiddipe@in.ibm.com> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Presently, a corrupted or malicious UDF filesystem containing a very large
number (or cycle) of Logical Volume Integrity Descriptor extent
indirections may trigger a stack overflow and kernel panic in
udf_load_logicalvolint() on mount.
Replace the unnecessary recursion in udf_load_logicalvolint() with
simple iteration. Set an arbitrary limit of 1000 indirections (which would
have almost certainly overflowed the stack without this fix), and treat
such cases as if there were no LVID.
Signed-off-by: Alden Tondettar <alden.tondettar@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
edfe63ec97ed ("x86/mtrr: Fix Xorg crashes in Qemu sessions")
PAT is now set to disabled state when MTRRs are disabled.
Thus, reactivating the __pa(high_memory) check in
phys_mem_access_prot_allowed().
When CONFIG_DEBUG_VIRTUAL is set, __pa() calls __phys_addr(),
which in turn calls slow_virt_to_phys() for 'high_memory'.
Because 'high_memory' is set to (the max direct mapped virt
addr + 1), it is not a valid virtual address. Hence,
slow_virt_to_phys() returns 0 and hit the BUG_ON. Using
__pa_nodebug() instead of __pa() will fix this BUG_ON.
However, this code block, originally written for Pentiums and
earlier, is no longer adequate since a 32-bit Xen guest has
MTRRs disabled and supports ZONE_HIGHMEM. In this setup,
this code sets UC attribute for accessing RAM in high memory
range.
Delete this code block as it has been unused for a long time.
Reported-by: kernel test robot <ying.huang@linux.intel.com> Reviewed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1460403360-25441-1-git-send-email-toshi.kani@hpe.com Link: https://lkml.org/lkml/2016/4/1/608 Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Xen supports PAT without MTRRs for its guests. In order to
enable WC attribute, it was necessary for xen_start_kernel()
to call pat_init_cache_modes() to update PAT table before
starting guest kernel.
Now that the kernel initializes PAT table to the BIOS handoff
state when MTRR is disabled, this Xen-specific PAT init code
is no longer necessary. Delete it from xen_start_kernel().
Also change __init_cache_modes() to a static function since
PAT table should not be tweaked by other modules.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Juergen Gross <jgross@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Toshi Kani <toshi.kani@hp.com> Cc: elliott@hpe.com Cc: paul.gortmaker@windriver.com Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1458769323-24491-7-git-send-email-toshi.kani@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
get_mtrr_state() calls pat_init() on BSP even if MTRR is disabled.
This results in calling pat_init() on BSP only since APs do not call
pat_init() when MTRR is disabled. This inconsistency between BSP
and APs leads to undefined behavior.
Make BSP's calling condition to pat_init() consistent with AP's,
mtrr_ap_init() and mtrr_aps_init().
Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Toshi Kani <toshi.kani@hp.com> Cc: elliott@hpe.com Cc: konrad.wilk@oracle.com Cc: paul.gortmaker@windriver.com Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1458769323-24491-6-git-send-email-toshi.kani@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A Xorg failure on qemu32 was reported as a regression [1] caused by
commit 9cd25aac1f44 ("x86/mm/pat: Emulate PAT when it is disabled").
This patch fixes the Xorg crash.
Negative effects of this regression were the following two failures [2]
in Xorg on QEMU with QEMU CPU model "qemu32" (-cpu qemu32), which were
triggered by the fact that its virtual CPU does not support MTRRs.
#1. copy_process() failed in the check in reserve_pfn_range()
A WC map request was tracked as WC in memtype, which set a PTE as
UC (pgprot) per __cachemode2pte_tbl[]. This led to this error in
reserve_pfn_range() called from track_pfn_copy(), which obtained
a pgprot from a PTE. It converts pgprot to page_cache_mode, which
does not necessarily result in the original page_cache_mode since
__cachemode2pte_tbl[] redirects multiple types to UC.
#2. error path in copy_process() then hit WARN_ON_ONCE in
untrack_pfn().
These negative effects are caused by two separate bugs, but they
can be addressed in separate patches. Fixing the pat_init() issue
described below addresses the root cause, and avoids Xorg to hit
these cases.
When the CPU does not support MTRRs, MTRR does not call pat_init(),
which leaves PAT enabled without initializing PAT. This pat_init()
issue is a long-standing issue, but manifested as issue #1 (and then
hit issue #2) with the above-mentioned commit because the memtype
now tracks cache attribute with 'page_cache_mode'.
This pat_init() issue existed before the commit, but we used pgprot
in memtype. Hence, we did not have issue #1 before. But WC request
resulted in WT in effect because WC pgrot is actually WT when PAT
is not initialized. This is not how it was designed to work. When
PAT is set to disable properly, WC is converted to UC. The use of
WT can result in a system crash if the target range does not support
WT. Fortunately, nobody ran into such issue before.
To fix this pat_init() issue, PAT code has been enhanced to provide
pat_disable() interface. Call this interface when MTRRs are disabled.
By setting PAT to disable properly, PAT bypasses the memtype check,
and avoids issue #1.
9cd25aac1f44 ("x86/mm/pat: Emulate PAT when it is disabled")
... PAT needs to provide an interface that prevents the OS from
initializing the PAT MSR.
PAT MSR initialization must be done on all CPUs using the specific
sequence of operations defined in the Intel SDM. This requires MTRRs
to be enabled since pat_init() is called as part of MTRR init
from mtrr_rendezvous_handler().
Make pat_disable() as the interface that prevents the OS from
initializing the PAT MSR. MTRR will call this interface when it
cannot provide the SDM-defined sequence to initialize PAT.
This also assures that pat_disable() called from pat_bsp_init()
will set the PAT table properly when CPU does not support PAT.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Elliott <elliott@hpe.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: konrad.wilk@oracle.com Cc: paul.gortmaker@windriver.com Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1458769323-24491-3-git-send-email-toshi.kani@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9cd25aac1f44 ("x86/mm/pat: Emulate PAT when it is disabled")'
... PAT needs to support a case that PAT MSR is initialized with a
non-default value.
When pat_init() is called and PAT is disabled, it initializes the
PAT table with the BIOS default value. Xen, however, sets PAT MSR
with a non-default value to enable WC. This causes inconsistency
between the PAT table and PAT MSR when PAT is set to disable on Xen.
Change pat_init() to handle the PAT disable cases properly. Add
init_cache_modes() to handle two cases when PAT is set to disable.
1. CPU supports PAT: Set PAT table to be consistent with PAT MSR.
2. CPU does not support PAT: Set PAT table to be consistent with
PWT and PCD bits in a PTE.
Note, __init_cache_modes(), renamed from pat_init_cache_modes(),
will be changed to a static function in a later patch.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Toshi Kani <toshi.kani@hp.com> Cc: elliott@hpe.com Cc: konrad.wilk@oracle.com Cc: paul.gortmaker@windriver.com Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/1458769323-24491-2-git-send-email-toshi.kani@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Don't allow RNDADDTOENTCNT or RNDADDENTROPY to accept a negative
entropy value. It doesn't make any sense to subtract from the entropy
counter, and it can trigger a warning:
// autogenerated by syzkaller (http://github.com/google/syzkaller)
int main() {
int fd = open("/dev/random", O_RDWR);
int val = -5000;
ioctl(fd, RNDADDTOENTCNT, &val);
return 0;
}
It's harmless in that (a) only root can trigger it, and (b) after
complaining the code never does let the entropy count go negative, but
it's better to simply not allow this userspace from passing in a
negative entropy value altogether.
Use regulator_list_voltage_linear_range in rpm_smps_ldo_ops_fixed is
wrong because it is used for fixed regulator without any linear range.
The rpm_smps_ldo_ops_fixed is used for pm8941_lnldo which has fixed_uV
set and n_voltages = 1. In this case, regulator_list_voltage() can return
rdev->desc->fixed_uV without .list_voltage implementation.
Fixes: 3bfbb4d1a480 ("regulator: qcom_smd: add list_voltage callback") Signed-off-by: Axel Lin <axel.lin@ingics.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There are several computatations of the sc in the
ud receive routine.
Besides the code duplication, all are wrong when the
sc is greater than 15. In that case the code incorrectly
or's a 1 into the computed sc instead of 1 shifted left
by 4.
Fix precomputed sc5 by using an already implemented routine
hdr2sc() and deleting flawed duplicated code.
Cc: Stable <stable@vger.kernel.org> # 4.6+ Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
MIPS64 needs to use compat_sys_keyctl for 32-bit userspace rather than
calling sys_keyctl. The latter will work in a lot of cases, thereby hiding
the issue.
There are also bunch of AML methods that that the BIOS can use to access
these fields. Most of the systems in question AML methods accessing the
SMBI OpRegion are never used.
Now, because of this SMBI OpRegion many systems fail to load the SMBus
driver with an error looking like one below:
ACPI Warning: SystemIO range 0x0000000000003040-0x000000000000305F
conflicts with OpRegion 0x0000000000003040-0x000000000000304F
(\_SB.PCI0.SBUS.SMBI) (20160108/utaddress-255)
ACPI: If an ACPI driver is available for this device, you should use
it instead of the native driver
The reason is that this SMBI OpRegion conflicts with the PCI BAR used by
the SMBus driver.
It turns out that we can install a custom SystemIO address space handler
for the SMBus device to intercept all accesses through that OpRegion. This
allows us to share the PCI BAR with the AML code if it for some reason is
using it. We do not expect that this OpRegion handler will ever be called
but if it is we print a warning and prevent all access from the SMBus
driver itself.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=110041 Reported-by: Andy Lutomirski <luto@kernel.org> Reported-by: Pali Rohár <pali.rohar@gmail.com> Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Jean Delvare <jdelvare@suse.de> Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Tested-by: Pali Rohár <pali.rohar@gmail.com> Tested-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
tcp_select_initial_window() intends to advertise a window
scaling for the maximum possible window size. To do so,
it considers the maximum of net.ipv4.tcp_rmem[2] and
net.core.rmem_max as the only possible upper-bounds.
However, users with CAP_NET_ADMIN can use SO_RCVBUFFORCE
to set the socket's receive buffer size to values
larger than net.ipv4.tcp_rmem[2] and net.core.rmem_max.
Thus, SO_RCVBUFFORCE is effectively ignored by
tcp_select_initial_window().
To fix this, consider the maximum of net.ipv4.tcp_rmem[2],
net.core.rmem_max and socket's initial buffer space.
Fixes: b0573dea1fb3 ("[NET]: Introduce SO_{SND,RCV}BUFFORCE socket options") Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Suggested-by: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
macsec_decrypt() is not called when validation is disabled and so
macsec_skb_cb(skb)->rx_sa is not set; but it is used later in
macsec_post_decrypt(), ensure that it's always initialized.
Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver") Signed-off-by: Beniamino Galvani <bgalvani@redhat.com> Acked-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>