Filipe Manana [Sat, 13 Jun 2015 05:52:57 +0000 (06:52 +0100)]
Btrfs: fix race between caching kthread and returning inode to inode cache
While the inode cache caching kthread is calling btrfs_unpin_free_ino(),
we could have a concurrent call to btrfs_return_ino() that adds a new
entry to the root's free space cache of pinned inodes. This concurrent
call does not acquire the fs_info->commit_root_sem before adding a new
entry if the caching state is BTRFS_CACHE_FINISHED, which is a problem
because the caching kthread calls btrfs_unpin_free_ino() after setting
the caching state to BTRFS_CACHE_FINISHED and therefore races with
the task calling btrfs_return_ino(), which is adding a new entry, while
the former (caching kthread) is navigating the cache's rbtree, removing
and freeing nodes from the cache's rbtree without acquiring the spinlock
that protects the rbtree.
This race resulted in memory corruption due to double free of struct
btrfs_free_space objects because both tasks can end up doing freeing the
same objects. Note that adding a new entry can result in merging it with
other entries in the cache, in which case those entries are freed.
This is particularly important as btrfs_free_space structures are also
used for the block group free space caches.
This memory corruption can be detected by a debugging kernel, which
reports it with the following trace:
Therefore fix this by having btrfs_unpin_free_ino() acquire the lock
that protects the rbtree while doing the searches and removing entries.
Fixes: 1c70d8fb4dfa ("Btrfs: fix inode caching vs tree log") Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit ae9d8f17118551bedd797406a6768b87c2146234)
Filipe Manana [Wed, 15 Jul 2015 22:26:43 +0000 (23:26 +0100)]
Btrfs: fix stale directory entries after fsync log replay
We have another case where after an fsync log replay we get an inode with
a wrong link count (smaller than it should be) and a number of directory
entries greater than its link count. This happens when we add a new link
hard link to our inode A and then we fsync some other inode B that has
the side effect of logging the parent directory inode too. In this case
at log replay time we add the new hard link to our inode (the item with
key BTRFS_INODE_REF_KEY) when processing the parent directory but we
never adjust the link count of our inode A. As a result we get stale dir
entries for our inode A that can never be deleted and therefore it makes
it impossible to remove the parent directory (as its i_size can never
decrease back to 0).
A simple reproducer for fstests that triggers this issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs generic
_supported_os Linux
_require_scratch
_require_dm_flakey
_require_metadata_journaling $SCRATCH_DEV
# Create our test directory and files.
mkdir $SCRATCH_MNT/testdir
touch $SCRATCH_MNT/testdir/foo
touch $SCRATCH_MNT/testdir/bar
# Make sure everything done so far is durably persisted.
sync
# Create one hard link for file foo and another one for file bar. After
# that fsync only the file bar.
ln $SCRATCH_MNT/testdir/bar $SCRATCH_MNT/testdir/bar_link
ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/foo_link
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/bar
# Silently drop all writes on scratch device to simulate power failure.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again and mount the fs to trigger log/journal replay.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# Now verify both our files have a link count of 2.
echo "Link count for file foo: $(stat --format=%h $SCRATCH_MNT/testdir/foo)"
echo "Link count for file bar: $(stat --format=%h $SCRATCH_MNT/testdir/bar)"
# We should be able to remove all the links of our files in testdir, and
# after that the parent directory should become empty and therefore
# possible to remove it.
rm -f $SCRATCH_MNT/testdir/*
rmdir $SCRATCH_MNT/testdir
_unmount_flakey
# The fstests framework will call fsck against our filesystem which will verify
# that all metadata is in a consistent state.
status=0
exit
The test fails with:
-Link count for file foo: 2
+Link count for file foo: 1
Link count for file bar: 2
+rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo_link': Stale file handle
+rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
(...)
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
And fsck's output:
(...)
checking fs roots
root 5 inode 258 errors 2001, no inode item, link count wrong
unresolved ref dir 257 index 5 namelen 8 name foo_link filetype 1 errors 4, no inode ref
Checking filesystem on /dev/sdc
(...)
So fix this by marking inodes for link count fixup at log replay time
whenever a directory entry is replayed if the entry was created in the
transaction where the fsync was made and if it points to a non-directory
inode.
This isn't a new problem/regression, the issue exists for a long time,
possibly since the log tree feature was added (2008).
Filipe Manana [Tue, 14 Jul 2015 15:09:39 +0000 (16:09 +0100)]
Btrfs: fix file corruption after cloning inline extents
Using the clone ioctl (or extent_same ioctl, which calls the same extent
cloning function as well) we end up allowing copy an inline extent from
the source file into a non-zero offset of the destination file. This is
something not expected and that the btrfs code is not prepared to deal
with - all inline extents must be at a file offset equals to 0.
For example, the following excerpt of a test case for fstests triggers
a crash/BUG_ON() on a write operation after an inline extent is cloned
into a non-zero offset:
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test files. File foo has the same 2K of data at offset 4K
# as file bar has at its offset 0.
$XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \
-c "pwrite -S 0xbb 4k 2K" \
-c "pwrite -S 0xcc 8K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# File bar consists of a single inline extent (2K size).
$XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \
$SCRATCH_MNT/bar | _filter_xfs_io
# Now call the clone ioctl to clone the extent of file bar into file
# foo at its offset 4K. This made file foo have an inline extent at
# offset 4K, something which the btrfs code can not deal with in future
# IO operations because all inline extents are supposed to start at an
# offset of 0, resulting in all sorts of chaos.
# So here we validate that clone ioctl returns an EOPNOTSUPP, which is
# what it returns for other cases dealing with inlined extents.
$CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/foo
# Because of the inline extent at offset 4K, the following write made
# the kernel crash with a BUG_ON().
$XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io
status=0
exit
The stack trace of the BUG_ON() triggered by the last write is:
Fix this by returning the error EOPNOTSUPP if an attempt to copy an
inline extent into a non-zero offset happens, just like what is done for
other scenarios that would require copying/splitting inline extents,
which were introduced by the following commits:
00fdf13a2e9f ("Btrfs: fix a crash of clone with inline extents's split") 3f9e3df8da3c ("btrfs: replace error code from btrfs_drop_extents")
# gpg: Signature made Fri Jul 10 12:46:22 2015 EDT using RSA key ID 6092693E
# gpg: Good signature from "Greg Kroah-Hartman (Linux kernel stable release signing key) <greg@kroah.com>"
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 647F 2865 4894 E3BD 4571 99BE 38DB BDC8 6092 693E
Commit e4502c63f56aeca88 (ufs: deal with nfsd/iget races) made ufs
create inodes with I_NEW flag set. However ufs_mkdir() never cleared
this flag. Thus if someone ever tried to lookup the directory by inode
number, he would deadlock waiting for I_NEW to be cleared. Luckily this
mostly happens only if the filesystem is exported over NFS since
otherwise we have the inode attached to dentry and don't look it up by
inode number. In rare cases dentry can get freed without inode being
freed and then we'd hit the deadlock even without NFS export.
Fix the problem by clearing I_NEW before instantiating new directory
inode.
Fixes: e4502c63f56aeca887ced37f24e0def1ef11cec8 Reported-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit e4502c63f56aeca88 (ufs: deal with nfsd/iget races) introduced
unlock_new_inode() call into ufs_add_nondir(). However that function
gets called also from ufs_link() which hands it already initialized
inode and thus unlock_new_inode() complains. The problem is harmless but
annoying.
Fix the problem by opencoding necessary stuff in ufs_link()
Fixes: e4502c63f56aeca887ced37f24e0def1ef11cec8 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Limit the mounts fs_fully_visible considers to locked mounts.
Unlocked can always be unmounted so considering them adds hassle
but no security benefit.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The warning message in prepend_path is unclear and outdated. It was
added as a warning that the mechanism for generating names of pseudo
files had been removed from prepend_path and d_dname should be used
instead. Unfortunately the warning reads like a general warning,
making it unclear what to do with it.
Remove the warning. The transition it was added to warn about is long
over, and I added code several years ago which in rare cases causes
the warning to fire on legitimate code, and the warning is now firing
and scaring people for no good reason.
Reported-by: Ivan Delalande <colona@arista.com> Reported-by: Omar Sandoval <osandov@osandov.com> Fixes: f48cfddc6729e ("vfs: In d_path don't call d_dname on a mount point") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 0244756edc4b98c ("ufs: sb mutex merge + mutex_destroy") generated
deadlocks in read/write mode on mkdir.
This patch partially reverts it keeping fixes by Andrew Morton and
mutex_destroy()
[AV: fixed a missing bit in ufs_remount()]
Signed-off-by: Fabian Frederick <fabf@skynet.be> Reported-by: Ian Campbell <ian.campbell@citrix.com> Suggested-by: Jan Kara <jack@suse.cz> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Alexey Khoroshilov <khoroshilov@ispras.ru> Cc: Roger Pau Monne <roger.pau@citrix.com> Cc: Ian Jackson <Ian.Jackson@eu.citrix.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 9ef7db7f38d0 ("ufs: fix deadlocks introduced by sb
mutex merge") That patch tried to solve commit 0244756edc4b98c ("ufs: sb
mutex merge + mutex_destroy") which is itself partially reverted due to
multiple deadlocks.
Signed-off-by: Fabian Frederick <fabf@skynet.be> Suggested-by: Jan Kara <jack@suse.cz> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Alexey Khoroshilov <khoroshilov@ispras.ru> Cc: Roger Pau Monne <roger.pau@citrix.com> Cc: Ian Jackson <Ian.Jackson@eu.citrix.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
file_remove_suid() could mistakenly set S_NOSEC inode bit when root was
modifying the file. As a result following writes to the file by ordinary
user would avoid clearing suid or sgid bits.
Fix the bug by checking actual mode bits before setting S_NOSEC.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Writes were a bit racy, but hard to turn into a bug at the same time.
(Particularly because modern Linux doesn't use this feature anymore.)
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
[Actually the next patch makes it much, much easier to trigger the race
so I'm including this one for stable@ as well. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Eric noticed problems with vhost-scsi and virtio-ccw: vhost-scsi
complained about overwriting values in the config space, which
was triggered by a broken implementation of virtio-ccw's config
get/set routines. It was probably sheer luck that we did not hit
this before.
When writing a value to the config space, the WRITE_CONF ccw will
always write from the beginning of the config space up to and
including the value to be set. If the config space up to the value
has not yet been retrieved from the device, however, we'll end up
overwriting values. Keep track of the known config space and update
if needed to avoid this.
Moreover, READ_CONF will only read the number of bytes it has been
instructed to retrieve, so we must not copy more than that to the
buffer, or we might overwrite trailing values.
Reported-by: Eric Farman <farman@linux.vnet.ibm.com> Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: Eric Farman <farman@linux.vnet.ibm.com> Tested-by: Eric Farman <farman@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The REGSET_VX_LOW ELF notes should contain the lower 64 bit halfes of the
first sixteen 128 bit vector registers. Unfortunately currently we copy
the upper halfes.
Fix this and correctly copy the lower halfes.
Fixes: a62bc0739253 ("s390/kdump: add support for vector extension") Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit ea5f49692575 ("KVM: s390: only one external call may be pending
at a time") introduced a bug on machines that don't have SIGP
interpretation facility installed.
The injection of an external call will now always fail with -EBUSY
(if none is already pending).
This leads to the following symptoms:
- An external call will be injected but with the wrong "src cpu id",
as this id will not be remembered.
- The target vcpu will not be woken up, therefore the guest will hang if
it cannot deal with unexpected failures of the SIGP EXTERNAL CALL
instruction.
- If an external call is already pending, -EBUSY will not be reported.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
KVM guest kernels for trap & emulate run in user mode, with a modified
set of kernel memory segments. However the fixmap address is still in
the normal KSeg3 region at 0xfffe0000 regardless, causing problems when
cache alias handling makes use of them when handling copy on write.
Therefore define FIXADDR_TOP as 0x7ffe0000 in the guest kernel mapped
region when CONFIG_KVM_GUEST is defined.
The Foxconn K8M890-8237A has two PCI host bridges, and we can't assign
resources correctly without the information from _CRS that tells us which
address ranges are claimed by which bridge. In the bugs mentioned below,
we incorrectly assign a sound card address (this example is from 1033299):
We enable _CRS on all systems from 2008 and later. On older systems, we
ignore _CRS and assume the whole physical address space (excluding RAM and
other devices) is available for PCI devices, but on systems that support
physical address spaces larger than 4GB, it's doubtful that the area above
4GB is really available for PCI.
After d56dbf5bab8c ("PCI: Allocate 64-bit BARs above 4G when possible"), we
try to use that space above 4GB *first*, so we're more likely to put a
device there.
On Juan's Toshiba Satellite Pro U200, BIOS left the graphics, sound, 1394,
and card reader devices unassigned (but only after Windows had been
booted). Only the sound device had a 64-bit BAR, so it was the only device
placed above 4GB, and hence the only device that didn't work.
Keep _CRS enabled even on pre-2008 systems if they support physical address
space larger than 4GB.
Fixes: d56dbf5bab8c ("PCI: Allocate 64-bit BARs above 4G when possible") Reported-and-tested-by: Juan Dayer <jdayer@outlook.com> Reported-and-tested-by: Alan Horsfield <alan@hazelgarth.co.uk> Link: https://bugzilla.kernel.org/show_bug.cgi?id=99221 Link: https://bugzilla.opensuse.org/show_bug.cgi?id=907092 Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we take a PMU exception or a software event we call
perf_read_regs(). This overloads regs->result with a boolean that
describes if we should use the sampled instruction address register
(SIAR) or the regs.
If the exception is in kernel, we start with the kernel regs and
backtrace through the kernel stack. At this point we switch to the
userspace regs and backtrace the user stack with perf_callchain_user().
Unfortunately these regs have not got the perf_read_regs() treatment,
so regs->result could be anything. If it is non zero,
perf_instruction_pointer() decides to use the SIAR, and we get issues
like this:
On some archs, the local clockevent device stops in deep cpuidle states.
The broadcast framework is used to wakeup cpus in these idle states, in
which either an external clockevent device is used to send wakeup ipis
or the hrtimer broadcast framework kicks in in the absence of such a
device. One cpu is nominated as the broadcast cpu and this cpu sends
wakeup ipis to sleeping cpus at the appropriate time. This is the
implementation in the oneshot mode of broadcast.
In periodic mode of broadcast however, the presence of such cpuidle
states results in the cpuidle driver calling tick_broadcast_enable()
which shuts down the local clockevent devices of all the cpus and
appoints the tick broadcast device as the clockevent device for each of
them. This works on those archs where the tick broadcast device is a
real clockevent device. But on archs which depend on the hrtimer mode
of broadcast, the tick broadcast device hapens to be a pseudo device.
The consequence is that the local clockevent devices of all cpus are
shutdown and the kernel hangs at boot time in periodic mode.
Let us thus not register the cpuidle states which have
CPUIDLE_FLAG_TIMER_STOP flag set, on archs which depend on the hrtimer
mode of broadcast in periodic mode. This patch takes care of doing this
on powerpc. The cpus would not have entered into such deep cpuidle
states in periodic mode on powerpc anyway. So there is no loss here.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The current Armada XP suspend to RAM implementation, as added in
commit 27432825ae19f ("ARM: mvebu: Armada XP GP specific
suspend/resume code") does not handle big-endian configurations
properly: the small bit of assembly code putting the DRAM in
self-refresh and toggling the GPIOs to turn off power forgets to
convert the values to little-endian.
This commit fixes that by making sure the two values we will write to
the DRAM controller register and GPIO register are already in
little-endian before entering the critical assembly code.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Fixes: 27432825ae19f ("ARM: mvebu: Armada XP GP specific suspend/resume code") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 7232398abc6a ("ARM: tegra: Convert PMC to a driver") changed tegra_resume()
location storing from late to early and, as a result, broke suspend on Tegra20.
PMC scratch register 41 is used by tegra LP1 resume code for retrieving stored
physical memory address of common resume function and in the same time used by
tegra20_cpu_shutdown() (shared by Tegra20 cpuidle driver and platform SMP code),
which is storing CPU1 "resettable" status. It implies strict order of scratch
register usage, otherwise resume function address is lost on Tegra20 after
disabling non-boot CPU's on suspend. Fix it by storing "resettable" status in
IRAM instead of PMC scratch register.
According to the PSCI specification and the SMC/HVC calling
convention, PSCI function_ids that are not implemented must
return NOT_SUPPORTED as return value.
Current KVM implementation takes an unhandled PSCI function_id
as an error and injects an undefined instruction into the guest
if PSCI implementation is called with a function_id that is not
handled by the resident PSCI version (ie it is not implemented),
which is not the behaviour expected by a guest when calling a
PSCI function_id that is not implemented.
This patch fixes this issue by returning NOT_SUPPORTED whenever
the kvm PSCI call is executed for a function_id that is not
implemented by the PSCI kvm layer.
Cc: Christoffer Dall <christoffer.dall@linaro.org> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On VM entry, we disable access to the VFP registers in order to
perform a lazy save/restore of these registers.
On VM exit, we restore access, test if we did enable them before,
and save/restore the guest/host registers if necessary. In this
sequence, the FPEXC register is always accessed, irrespective
of the trapping configuration.
If the guest didn't touch the VFP registers, then the HCPTR access
has now enabled such access, but we're missing a barrier to ensure
architectural execution of the new HCPTR configuration. If the HCPTR
access has been delayed/reordered, the subsequent access to FPEXC
will cause a trap, which we aren't prepared to handle at all.
The same condition exists when trapping to enable VFP for the guest.
The fix is to introduce a barrier after enabling VFP access. In the
vmexit case, it can be relaxed to only takes place if the guest hasn't
accessed its view of the VFP registers, making the access to FPEXC safe.
The set_hcptr macro is modified to deal with both vmenter/vmexit and
vmtrap operations, and now takes an optional label that is branched to
when the guest hasn't touched the VFP registers.
Before calling into the filesystem, vfs_setxattr calls
security_inode_setxattr, which ends up calling selinux_inode_setxattr in
our case. That returns -EOPNOTSUPP whenever SBLABEL_MNT is not set.
SBLABEL_MNT was supposed to be set by sb_finish_set_opts, which sets it
only if selinux_is_sblabel_mnt returns true.
The selinux_is_sblabel_mnt logic was broken by eadcabc697e9 "SELinux: do
all flags twiddling in one place", which didn't take into the account
the SECURITY_FS_USE_NATIVE behavior that had been introduced for nfs
with eb9ae686507b "SELinux: Add new labeling type native labels".
This caused setxattr's of security labels over NFSv4.2 to fail.
Cc: Eric Paris <eparis@redhat.com> Cc: David Quigley <dpquigl@davequigley.com> Reported-by: Richard Chan <rc556677@outlook.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
[PM: added the stable dependency] Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 007bea098b86 (intel_pstate: Add setting voltage value for
baytrail P states.) introduced byt_set_pstate() with the assumption that
it would always be run by the CPU whose MSR is to be written by it. It
turns out, however, that is not always the case in practice, so modify
byt_set_pstate() to enforce the MSR write done by it to always happen on
the right CPU.
Fixes: 007bea098b86 (intel_pstate: Add setting voltage value for baytrail P states.) Signed-off-by: Joe Konno <joe.konno@intel.com> Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When dma mapping (dma_map_sg) fails in sdhci_pre_dma_transfer, -EINVAL
is returned. There are 3 callers of sdhci_pre_dma_transfer:
* sdhci_pre_req and sdhci_adma_table_pre: handle negative return
* sdhci_prepare_data: handles 0 (error) and "else" (good) only
So teach sdhci_prepare_data to understand negative return values from
sdhci_pre_dma_transfer and disable DMA in that case, as well as for
zero.
It was introduced in 348487cb28e66b032bae1b38424d81bf5b444408 (mmc:
sdhci: use pipeline mmc requests to improve performance). The commit
seems to be suspicious also by assigning host->sg_count both in
sdhci_pre_dma_transfer and sdhci_adma_table_pre.
Commit 83a60ed8f0b5 ("iommu/arm-smmu: fix ARM_SMMU_FEAT_TRANS_OPS
condition") accidentally negated the ID0_ATOSNS predicate in the ATOS
feature check, causing the driver to attempt ATOS requests on SMMUv2
hardware without the ATOS feature implemented.
This patch restores the predicate to the correct value.
The conversion to be16_add_cpu() is incorrect in case cryptlen is
negative due to premature (i.e. before addition / subtraction)
implicit conversion of cryptlen (int -> u16) leading to sign loss.
Fixes: 1d11911a8c57 ("crypto: talitos - fix warning: 'alg' may be used uninitialized in this function") Signed-off-by: Horia Geanta <horia.geanta@freescale.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ffs_closed can race with configfs_rmdir which will call config_item_release, so
add an extra check to avoid calling the unregister_gadget_item with an null
gadget item.
Signed-off-by: Rui Miguel Silva <rui.silva@linaro.org> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
when copying to iter the size can be different then the iov count,
the check for full iov is wrong and make any read on request which
is not the exactly size of iov to return -EFAULT.
So, just check the success of the copy.
Signed-off-by: Rui Miguel Silva <rui.silva@linaro.org> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The Ethernet controller found in the Armada 370, 380 and 385 SoCs don't
support TCP/IP checksumming with frame sizes larger than 1600 bytes.
This patch fixes the issue by disabling the features NETIF_F_IP_CSUM and
NETIF_F_TSO for the Armada 370 and compatibles SoCs when the MTU is set
to a value greater than 1600 bytes.
Signed-off-by: Simon Guinot <simon.guinot@sequanux.org> Fixes: c5aff18204da ("net: mvneta: driver for Marvell Armada 370/XP network unit") Cc: <stable@vger.kernel.org> # v3.8+ Acked-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch updates the Ethernet DT nodes for Armada XP SoCs with the
compatible string "marvell,armada-xp-neta".
Signed-off-by: Simon Guinot <simon.guinot@sequanux.org> Fixes: 77916519cba3 ("arm: mvebu: Armada XP MV78230 has only three Ethernet interfaces") Cc: <stable@vger.kernel.org> # v3.8+ Acked-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Reviewed-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The mvneta driver supports the Ethernet IP found in the Armada 370, XP,
380 and 385 SoCs. Since at least one more hardware feature is available
for the Armada XP SoCs then a way to identify them is needed.
This patch introduces a new compatible string "marvell,armada-xp-neta".
Signed-off-by: Simon Guinot <simon.guinot@sequanux.org> Fixes: c5aff18204da ("net: mvneta: driver for Marvell Armada 370/XP network unit") Cc: <stable@vger.kernel.org> # v3.8+ Acked-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Acked-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When allocating Rx related buffers, alloc_pages is called using an order
number that is decreased until successful. A system under stress can
experience failures during this allocation process resulting in a warning
being issued. This message can be of concern to end users even though the
failure is not fatal. Since the failure is not fatal and can occur
multiple times, the driver should include the __GFP_NOWARN flag to
suppress the warning message from being issued.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is NULL pointer dereference possible during statistics update if the route
used for OOTB responce is removed at unfortunate time. If the route exists when
we receive OOTB packet and we finally jump into sctp_packet_transmit() to send
ABORT, but in the meantime route is removed under our feet, we take "no_route"
path and try to update stats with IP_INC_STATS(sock_net(asoc->base.sk), ...).
But sctp_ootb_pkt_new() used to prepare responce packet doesn't call
sctp_transport_set_owner() and therefore there is no asoc associated with this
packet. Probably temporary asoc just for OOTB responces is overkill, so just
introduce a check like in all other places in sctp_packet_transmit(), where
"asoc" is dereferenced.
To reproduce this, one needs to
0. ensure that sctp module is loaded (otherwise ABORT is not generated)
1. remove default route on the machine
2. while true; do
ip route del [interface-specific route]
ip route add [interface-specific route]
done
3. send enough OOTB packets (i.e. HB REQs) from another host to trigger ABORT
responce
As bnx2x_init_ptp() is only called if bp->flags contains PTP_SUPPORTED,
we also need to guard bnx2x_stop_ptp() with same condition, otherwise
ptp_task workqueue is not initialized and kernel barfs on
cancel_work_sync()
Fixes: eeed018cbfa30 ("bnx2x: Add timestamping and PTP hardware clock support") Reported-by: Michel Lespinasse <walken@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Michal Kalderon <Michal.Kalderon@qlogic.com> Cc: Ariel Elior <Ariel.Elior@qlogic.com> Cc: Yuval Mintz <Yuval.Mintz@qlogic.com> Cc: David Decotigny <decot@google.com> Acked-by: Sony Chacko <sony.chacko@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When limiting phy link speed using "max-speed" to 100mbps or less on a
giga bit phy, phy never completes auto negotiation and phy state
machine is held in PHY_AN. Fixing this issue by comparing the giga
bit advertise though phydev->supported doesn't have it but phy has
BMSR_ESTATEN set. So that auto negotiation is restarted as old and
new advertise are different and link comes up fine.
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When in HA mode, the driver exposes an IB (RoCE) device instance with only
one port. Under SRIOV, the existing implementation doesn't go well with
the PF RoCE driver's role of Special QPs Para-Virtualization, etc.
As such, disable HA for the mlx4 PF RoCE device in SRIOV mode.
Fixes: a57500903093 ('IB/mlx4: Add port aggregation support') Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The check_csum() function relied on hwtstamp_rx_filter to know if rxvlan
offload is disabled. This is wrong since rxvlan offload can be switched
on/off regardless of hwtstamp_rx_filter.
Also moved check_csum to query CQE information to identify VLAN packets
and removed the check of IP packets, since it has been validated before.
Fixes: f8c6455bb04b ('net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE') Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Indication of a single completed packet, marked by txbbs_skipped
being bigger then zero, in not enough in order to wake up a
stopped TX queue. The completed packet may contain a single TXBB,
while next packet to be sent (after the wake up) may have multiple
TXBBs (LSO/TSO packets for example), causing overflow in queue followed
by WQE corruption and TX queue timeout.
Instead, wake the stopped queue only when there's enough room for the
worst case (maximum sized WQE) packet that we should need to handle after
the queue is opened again.
Also created an helper routine - mlx4_en_is_tx_ring_full, which checks
if the current TX ring is full or not. It provides better code readability
and removes code duplication.
Signed-off-by: Ido Shamay <idos@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
TX ring QP wasn't released at mlx4_en_destroy_tx_ring. Instead, the code
used the deprecated base_tx_qpn field. Move TX QP release to
mlx4_en_destroy_tx_ring and remove the base_tx_qpn field.
Fixes: ddae0349fdb7 ('net/mlx4: Change QP allocation scheme') Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ICMP messages can trigger ICMP and local errors. In this case
serr->port is 0 and starting from Linux 4.0 we do not return
the original target address to the error queue readers.
Add function to define which errors provide addr_offset.
With this fix my ping command is not silent anymore.
Fixes: c247f0534cc5 ("ip: fix error queue empty skb handling") Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
tcp_fastopen_reset_cipher really cannot be called from interrupt
context. It allocates the tcp_fastopen_context with GFP_KERNEL and
calls crypto_alloc_cipher, which allocates all kind of stuff with
GFP_KERNEL.
Thus, we might sleep when the key-generation is triggered by an
incoming TFO cookie-request which would then happen in interrupt-
context, as shown by enabling CONFIG_DEBUG_ATOMIC_SLEEP:
This patch moves the call to tcp_fastopen_init_key_once to the places
where a listener socket creates its TFO-state, which always happens in
user-context (either from the setsockopt, or implicitly during the
listen()-call)
Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Fixes: 222e83d2e0ae ("tcp: switch tcp_fastopen key generation to net_get_random_once") Signed-off-by: Christoph Paasch <cpaasch@apple.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The lockless lookups can return entry that is unlinked.
Sometimes they get reference before last neigh_cleanup_and_release,
sometimes they do not need reference. Later, any
modification attempts may result in the following problems:
1. entry is not destroyed immediately because neigh_update
can start the timer for dead entry, eg. on change to NUD_REACHABLE
state. As result, entry lives for some time but is invisible
and out of control.
2. __neigh_event_send can run in parallel with neigh_destroy
while refcnt=0 but if timer is started and expired refcnt can
reach 0 for second time leading to second neigh_destroy and
possible crash.
Thanks to Eric Dumazet and Ying Xue for their work and analyze
on the __neigh_event_send change.
Fixes: 767e97e1e0db ("neigh: RCU conversion of struct neighbour") Fixes: a263b3093641 ("ipv4: Make neigh lookups directly in output packet path.") Fixes: 6fd6ce2056de ("ipv6: Do not depend on rt->n in ip6_finish_output2().") Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ying Xue <ying.xue@windriver.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
PACKET_FANOUT_LB computes f->rr_cur such that it is modulo
f->num_members. It returns the old value unconditionally, but
f->num_members may have changed since the last store. Ensure
that the return value is always < num.
When modifying the logic, simplify it further by replacing the loop
with an unconditional atomic increment.
Fixes: dc99f600698d ("packet: Add fanout support.") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We need to tell compiler it must not read f->num_members multiple
times. Otherwise testing if num is not zero is flaky, and we could
attempt an invalid divide by 0 in fanout_demux_cpu()
Note bug was present in packet_rcv_fanout_hash() and
packet_rcv_fanout_lb() but final 3.1 had a simple location
after commit 95ec3eb417115fb ("packet: Add 'cpu' fanout policy.")
Fixes: dc99f600698dc ("packet: Add fanout support.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After the ->set() spinlocks were removed br_stp_set_bridge_priority
was left running without any protection when used via sysfs. It can
race with port add/del and could result in use-after-free cases and
corrupted lists. Tested by running port add/del in a loop with stp
enabled while setting priority in a loop, crashes are easily
reproducible.
The spinlocks around sysfs ->set() were removed in commit: 14f98f258f19 ("bridge: range check STP parameters")
There's also a race condition in the netlink priority support that is
fixed by this change, but it was introduced recently and the fixes tag
covers it, just in case it's needed the commit is: af615762e972 ("bridge: add ageing_time, stp_state, priority over netlink")
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Fixes: 14f98f258f19 ("bridge: range check STP parameters") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
->auto_asconf_splist is per namespace and mangled by functions like
sctp_setsockopt_auto_asconf() which doesn't guarantee any serialization.
Also, the call to inet_sk_copy_descendant() was backuping
->auto_asconf_list through the copy but was not honoring
->do_auto_asconf, which could lead to list corruption if it was
different between both sockets.
This commit thus fixes the list handling by using ->addr_wq_lock
spinlock to protect the list. A special handling is done upon socket
creation and destruction for that. Error handlig on sctp_init_sock()
will never return an error after having initialized asconf, so
sctp_destroy_sock() can be called without addrq_wq_lock. The lock now
will be take on sctp_close_sock(), before locking the socket, so we
don't do it in inverse order compared to sctp_addr_wq_timeout_handler().
Instead of taking the lock on sctp_sock_migrate() for copying and
restoring the list values, it's preferred to avoid rewritting it by
implementing sctp_copy_descendant().
Issue was found with a test application that kept flipping sysctl
default_auto_asconf on and off, but one could trigger it by issuing
simultaneous setsockopt() calls on multiple sockets or by
creating/destroying sockets fast enough. This is only triggerable
locally.
Fixes: 9f7d653b67ae ("sctp: Add Auto-ASCONF support (core).") Reported-by: Ji Jianwen <jiji@redhat.com> Suggested-by: Neil Horman <nhorman@tuxdriver.com> Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We saw excessive direct memory compaction triggered by skb_page_frag_refill.
This causes performance issues and add latency. Commit 5640f7685831e0
introduces the order-3 allocation. According to the changelog, the order-3
allocation isn't a must-have but to improve performance. But direct memory
compaction has high overhead. The benefit of order-3 allocation can't
compensate the overhead of direct memory compaction.
This patch makes the order-3 page allocation atomic. If there is no memory
pressure and memory isn't fragmented, the alloction will still success, so we
don't sacrifice the order-3 benefit here. If the atomic allocation fails,
direct memory compaction will not be triggered, skb_page_frag_refill will
fallback to order-0 immediately, hence the direct memory compaction overhead is
avoided. In the allocation failure case, kswapd is waken up and doing
compaction, so chances are allocation could success next time.
alloc_skb_with_frags is the same.
The mellanox driver does similar thing, if this is accepted, we must fix
the driver too.
V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
V2: make the changelog clearer
Cc: Eric Dumazet <edumazet@google.com> Cc: Chris Mason <clm@fb.com> Cc: Debabrata Banerjee <dbavatar@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When programming the start of a periodic output, the code wrongly places
the seconds value into the "low" register and the nanoseconds into the
"high" register. Even though this is backwards, it slipped through my
testing, because the re-arming code in the interrupt service routine is
correct, and the signal does appear starting with the second edge.
This patch fixes the issue by programming the registers correctly.
Signed-off-by: Richard Cochran <richardcochran@gmail.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since the addition of sysfs multicast router support if one set
multicast_router to "2" more than once, then the port would be added to
the hlist every time and could end up linking to itself and thus causing an
endless loop for rlist walkers.
So to reproduce just do:
echo 2 > multicast_router; echo 2 > multicast_router;
in a bridge port and let some igmp traffic flow, for me it hangs up
in br_multicast_flood().
Fix this by adding a check in br_multicast_add_router() if the port is
already linked.
The reason this didn't happen before the addition of multicast_router
sysfs entries is because there's a !hlist_unhashed check that prevents
it.
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Fixes: 0909e11758bd ("bridge: Add multicast_router sysfs entries") Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since it is possible for vnet_event_napi to end up doing
vnet_control_pkt_engine -> ... -> vnet_send_attr ->
vnet_port_alloc_tx_ring -> ldc_alloc_exp_dring -> kzalloc()
(i.e., in softirq context), kzalloc() should be called with
GFP_ATOMIC from ldc_alloc_exp_dring.
If hardware doesn't support DecodeAssist - a feature that provides
more information about the intercept in the VMCB, KVM decodes the
instruction and then updates the next_rip vmcb control field.
However, NRIP support itself depends on cpuid Fn8000_000A_EDX[NRIPS].
Since skip_emulated_instruction() doesn't verify nrip support
before accepting control.next_rip as valid, avoid writing this
field if support isn't present.
Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A huge amount of NIC drivers use the DMA API, however if
compiled under 32-bit an very important part of the DMA API can
be ommitted leading to the drivers not working at all
(especially if used with 'swiotlb=force iommu=soft').
As Prashant Sreedharan explains it: "the driver [tg3] uses
DEFINE_DMA_UNMAP_ADDR(), dma_unmap_addr_set() to keep a copy of
the dma "mapping" and dma_unmap_addr() to get the "mapping"
value. On most of the platforms this is a no-op, but ... with
"iommu=soft and swiotlb=force" this house keeping is required,
... otherwise we pass 0 while calling pci_unmap_/pci_dma_sync_
instead of the DMA address."
As such enable this even when using 32-bit kernels.
Reported-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Prashant Sreedharan <prashant@broadcom.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Chan <mchan@broadcom.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boris.ostrovsky@oracle.com Cc: cascardo@linux.vnet.ibm.com Cc: david.vrabel@citrix.com Cc: sanjeevb@broadcom.com Cc: siva.kallam@broadcom.com Cc: vyasevich@gmail.com Cc: xen-devel@lists.xensource.com Link: http://lkml.kernel.org/r/20150417190448.GA9462@l.oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Filipe Manana [Fri, 3 Jul 2015 19:30:34 +0000 (20:30 +0100)]
Btrfs: fix list transaction->pending_ordered corruption
When we call btrfs_commit_transaction(), we splice the list "ordered"
of our transaction handle into the transaction's "pending_ordered"
list, but we don't re-initialize the "ordered" list of our transaction
handle, this means it still points to the same elements it used to
before the splice. Then we check if the current transaction's state is
>= TRANS_STATE_COMMIT_START and if it is we end up calling
btrfs_end_transaction() which simply splices again the "ordered" list
of our handle into the transaction's "pending_ordered" list, leaving
multiple pointers to the same ordered extents which results in list
corruption when we are iterating, removing and freeing ordered extents
at btrfs_wait_pending_ordered(), resulting in access to dangling
pointers / use-after-free issues.
Similarly, btrfs_end_transaction() can end up in some cases calling
btrfs_commit_transaction(), and both did a list splice of the transaction
handle's "ordered" list into the transaction's "pending_ordered" without
re-initializing the handle's "ordered" list, resulting in exactly the
same problem.
This produces the following warning on a kernel with linked list
debugging enabled:
On a non-debug kernel this leads to invalid memory accesses, causing a
crash. Fix this by using list_splice_init() instead of list_splice() in
btrfs_commit_transaction() and btrfs_end_transaction().
Cc: stable@vger.kernel.org Fixes: 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3" Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit eab82377945d1d039982f568cc6583bb71ceb3f3)
Filipe Manana [Fri, 12 Jun 2015 08:35:35 +0000 (09:35 +0100)]
Btrfs: use kmem_cache_free when freeing entry in inode cache
The free space entries are allocated using kmem_cache_zalloc(),
through __btrfs_add_free_space(), therefore we should use
kmem_cache_free() and not kfree() to avoid any confusion and
any potential problem. Looking at the kfree() definition at
mm/slab.c it has the following comment:
/*
* (...)
*
* Don't free memory not originally allocated by kmalloc()
* or you will run into trouble.
*/
So better be safe and use kmem_cache_free().
Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz>
(cherry picked from commit af6bf76d1de143a38c919572462899e9f1fc477f)
For logical reasons such as the phase of the moon, this happened more
often with "-o inode_cache" than without any mount options.
After some debugging it turned out to be simple to understand what was
happening:
1) close_ctree() is called;
2) It then stops the transaction kthread, which commits the current
transaction;
3) It asks the cleaner kthread to stop, which is currently running
btrfs_delete_unused_bgs();
4) btrfs_delete_unused_bgs() finds an unused block group, starts a new
transaction, deletes the block group, which implies COWing some
tree nodes and leafs and dirtying their respective pages, and then
finally it ends the transaction it started, without committing it;
5) The cleaner kthread stops;
6) close_ctree() releases (from memory) the block group objects, which
produces the warning in the trace pasted above;
7) Then it invalidates all pages of the btree inode, by calling
invalidate_inode_pages2(), which waits for any pages under writeback,
and releases any non-dirty pages;
8) All work queues are destroyed (waiting first for their current tasks
to finish execution);
9) A final iput() is called against the btree inode;
10) This iput triggers a writeback of the btree inode because it still
has dirty pages;
11) This starts the whole chain of callbacks for the btree inode until
it eventually reaches btrfs_wq_submit_bio() where it leads to a
NULL pointer dereference because the work queues were already
destroyed.
Fix this by making the cleaner commit any transaction that it started
after the transaction kthread was stopped.
Filipe Manana [Wed, 17 Jun 2015 09:16:23 +0000 (10:16 +0100)]
Btrfs: fix fsync data loss after append write
If we do an append write to a file (which increases its inode's i_size)
that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode,
and the previous transaction added a new hard link to the file, which sets
the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync
the file, the inode's new i_size isn't logged. This has the consequence
that after the fsync log is replayed, the file size remains what it was
before the append write operation, which means users/applications will
not be able to read the data that was successsfully fsync'ed before.
This happens because neither the inode item nor the delayed inode get
their i_size updated when the append write is made - doing so would
require starting a transaction in the buffered write path, something that
we do not do intentionally for performance reasons.
Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is
set the inode is logged with its current i_size (log the in-memory inode
into the log tree).
This issue is not a recent regression and is easy to reproduce with the
following test case for fstests:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
here=`pwd`
tmp=/tmp/$$
status=1 # failure is the default!
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_supported_fs generic
_supported_os Linux
_need_to_be_root
_require_scratch
_require_dm_flakey
_require_metadata_journaling $SCRATCH_DEV
_crash_and_mount()
{
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again and mount. This makes the fs replay its fsync log.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
}
# Create the test file with some initial data and then fsync it.
# The fsync here is only needed to trigger the issue in btrfs, as it causes the
# the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
sync
# Add a hard link to our file.
# On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
# which is a necessary condition to trigger the issue.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Sync the filesystem to force a commit of the current btrfs transaction, this
# is a necessary condition to trigger the bug on btrfs.
sync
# Now append more data to our file, increasing its size, and fsync the file.
# In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
# write path did not update the inode item in the btree nor the delayed inode
# item (in memory struture) in the current transaction (created by the fsync
# handler), the fsync did not record the inode's new i_size in the fsync
# log/journal. This made the data unavailable after the fsync log/journal is
# replayed.
$XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
echo "File content after fsync and before crash:"
od -t x1 $SCRATCH_MNT/foo
_crash_and_mount
echo "File content after crash and log replay:"
od -t x1 $SCRATCH_MNT/foo
status=0
exit
The expected file output before and after the crash/power failure expects the
appended data to be available, which is:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
* 0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
* 0200000
Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
(cherry picked from commit 94e2d24a55e76e9e1c80ae0767a8a2fcf4cc8c80)
Liu Bo [Wed, 17 Jun 2015 08:59:57 +0000 (16:59 +0800)]
Btrfs: fix hang when failing to submit bio of directIO
The hang is uncoverd by generic/019.
btrfs_endio_direct_write() skips the "finish_ordered_fn" part when it hits
an error, thus those added ordered extents will never get processed, which
block processes that waiting for them via btrfs_start_ordered_extent().
This fixes the above, and meanwhile finish_ordered_fn will do the space
accounting work.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 70466207b172c10c72f22bf9e233b898611f497c)
Liu Bo [Wed, 17 Jun 2015 08:59:58 +0000 (16:59 +0800)]
Btrfs: fix warning of bytes_may_use
While running generic/019, dmesg got several warnings from
btrfs_free_reserved_data_space().
Test generic/019 produces some disk failures so sumbit dio will get errors,
in which case, btrfs_direct_IO() goes to the error handling and free
bytes_may_use, but the problem is that bytes_may_use has been free'd
during get_block().
This adds a runtime flag to show if we've gone through get_block(), if so,
don't do the cleanup work.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit f1cfda4899915f7a09556de874e1a54bdaf1687b)
Filipe Manana [Sat, 20 Jun 2015 17:20:09 +0000 (18:20 +0100)]
Btrfs: fix shrinking truncate when the no_holes feature is enabled
If the no_holes feature is enabled, we attempt to shrink a file to a size
that ends up in the middle of a hole and we don't have any file extent
items in the fs/subvol tree that go beyond the new file size (or any
ordered extents that will insert such file extent items), we end up not
updating the inode's disk_i_size, we only update the inode's i_size.
This means that after unmounting and mounting the filesystem, or after
the inode is evicted and reloaded, its i_size ends up being incorrect
(an inode's i_size is set to the disk_i_size field when an inode is
loaded). This happens when btrfs_truncate_inode_items() doesn't find
any file extent items to drop - in this case it never makes a call to
btrfs_ordered_update_i_size() in order to update the inode's disk_i_size.
Example reproducer:
$ mkfs.btrfs -O no-holes -f /dev/sdd
$ mount /dev/sdd /mnt
# Create our test file with some data and durably persist it.
$ xfs_io -f -c "pwrite -S 0xaa 0 128K" /mnt/foo
$ sync
# Append some data to the file, increasing its size, and leave a hole
# between the old size and the start offset if the following write. So
# our file gets a hole in the range [128Kb, 256Kb[.
$ xfs_io -c "truncate 160K" /mnt/foo
# We expect to see our file with a size of 160Kb, with the first 128Kb
# of data all having the value 0xaa and the remaining 32Kb of data all
# having the value 0x00.
$ od -t x1 /mnt/foo 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
* 0400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
* 0500000
# Now cleanly unmount and mount again the filesystem.
$ umount /mnt
$ mount /dev/sdd /mnt
# We expect to get the same result as before, a file with a size of
# 160Kb, with the first 128Kb of data all having the value 0xaa and the
# remaining 32Kb of data all having the value 0x00.
$ od -t x1 /mnt/foo 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
* 0400000
In the example above the file size/data do not match what they were before
the remount.
Fix this by always calling btrfs_ordered_update_i_size() with a size
matching the size the file was truncated to if btrfs_truncate_inode_items()
is not called for a log tree and no file extent items were dropped. This
ensures the same behaviour as when the no_holes feature is not enabled.
Shilong Wang [Sun, 12 Apr 2015 06:35:20 +0000 (14:35 +0800)]
Btrfs: fix wrong check for btrfs_force_chunk_alloc()
btrfs_force_chunk_alloc() return 1 for allocation chunk successfully.
This problem exists since commit c87f08ca4.
With this patch, we might fix some enospc problems for balances.
Signed-off-by: Wang Shilong <wangshilong1991@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 9ac2b7cb4755cb3311bb7d1ccf0eb51d0e006fba)
Filipe Manana [Mon, 29 Jun 2015 13:32:22 +0000 (14:32 +0100)]
Btrfs: fix memory corruption on failure to submit bio for direct IO
If we fail to submit a bio for a direct IO request, we were grabbing the
corresponding ordered extent and decrementing its reference count twice,
once for our lookup reference and once for the ordered tree reference.
This was a problem because it caused the ordered extent to be freed
without removing it from the ordered tree and any lists it might be
attached to, leaving dangling pointers to the ordered extent around.
Example trace with CONFIG_DEBUG_PAGEALLOC=y:
For read requests we weren't doing any cleanup either (none of the work
done by btrfs_endio_direct_read()), so a failure submitting a bio for a
read request would leave a range in the inode's io_tree locked forever,
blocking any future operations (both reads and writes) against that range.
So fix this by making sure we do the same cleanup that we do for the case
where the bio submission succeeds.
Filipe Manana [Fri, 3 Jul 2015 19:30:34 +0000 (20:30 +0100)]
Btrfs: fix list transaction->pending_ordered corruption
When we call btrfs_commit_transaction(), we splice the list "ordered"
of our transaction handle into the transaction's "pending_ordered"
list, but we don't reinitialize the "ordered" list of our transaction
handle, this means it still points to the same elements it used to
before the splice. Then we check if the current transaction's state
is >= TRANS_STATE_COMMIT_START and if it is we end up calling
btrfs_end_transaction() which simply splices again the "ordered" list
of our handle into the transaction's "pending_ordered" list, leaving
multiple pointers to the same ordered extents which results in list
corruption when we are iterating, removing and freeing ordered extents
at btrfs_wait_pending_ordered(), resulting in access to dangling
pointers / use-after-free issues.
This produces the following warning on a kernel with linked list
debugging enabled:
On a non-debug kernel this leads to invalid memory accesses, causing a
crash. Fix this by using list_splice_init() instead of list_splice() in
btrfs_commit_transaction().
Cc: stable@vger.kernel.org Fixes: 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3" Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit c56d45d8d1d01d82b336fd67c6cff10d0ea097ee)
Filipe Manana [Fri, 3 Jul 2015 10:36:49 +0000 (11:36 +0100)]
Btrfs: fix memory leak in the extent_same ioctl
We were allocating memory with memdup_user() but we were never releasing
that memory. This affected pretty much every call to the ioctl, whether
it deduplicated extents or not.
This issue was reported on IRC by Julian Taylor and on the mailing list
by Marcel Ritter, credit goes to them for finding the issue.
Reported-by: Julian Taylor <jtaylor.debian@googlemail.com> Reported-by: Marcel Ritter <ritter.marcel@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>
Mark Fasheh [Tue, 30 Jun 2015 21:42:08 +0000 (14:42 -0700)]
btrfs: don't update mtime/ctime on deduped inodes
One issue users have reported is that dedupe changes mtime on files,
resulting in tools like rsync thinking that their contents have changed when
in fact the data is exactly the same. We also skip the ctime update as no
user-visible metadata changes here and we want dedupe to be transparent to
the user.
Clone still wants time changes, so we special case this in the code.
Zygo Blaxell [Mon, 29 Jun 2015 21:15:22 +0000 (17:15 -0400)]
Merge tag 'v4.0.7' into zygo-4.0.7-zb64
This is the 4.0.7 stable release
# gpg: Signature made Mon Jun 29 15:29:37 2015 EDT using RSA key ID 6092693E
# gpg: Good signature from "Greg Kroah-Hartman (Linux kernel stable release signing key) <greg@kroah.com>"
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 647F 2865 4894 E3BD 4571 99BE 38DB BDC8 6092 693E
Patches 7cba160ad "powernv/cpuidle: Redesign idle states management"
and 77b54e9f2 "powernv/powerpc: Add winkle support for offline cpus"
use non-volatile condition registers (cr2, cr3 and cr4) early in the system
reset interrupt handler (system_reset_pSeries()) before it has been determined
if state loss has occurred. If state loss has not occurred, control returns via
the power7_wakeup_noloss() path which does not restore those condition
registers, leaving them corrupted.
Fix this by restoring the condition registers in the power7_wakeup_noloss()
case.
This is apparent when running a KVM guest on hardware that does not
support winkle or sleep and the guest makes use of secondary threads. In
practice this means Power7 machines, though some early unreleased Power8
machines may also be susceptible.
The secondary CPUs are taken off line before the guest is started and
they call pnv_smp_cpu_kill_self(). This checks support for sleep
states (in this case there is no support) and power7_nap() is called.
When the CPU is woken, power7_nap() returns and because the CPU is
still off line, the main while loop executes again. The sleep states
support test is executed again, but because the tested values cannot
have changed, the compiler has optimized the test away and instead we
rely on the result of the first test, which has been left in cr3
and/or cr4. With the result overwritten, the wrong branch is taken and
power7_winkle() is called on a CPU that does not support it, leading
to it stalling.
Fixes: 7cba160ad789 ("powernv/cpuidle: Redesign idle states management") Fixes: 77b54e9f213f ("powernv/powerpc: Add winkle support for offline cpus")
[mpe: Massage change log a bit more] Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes a timing issue that causes a GPU hang when the system
comes out of power saving.
During pm_resume, We are submitting batchbuffers before enabling
Interrupts this is causing us to miss the context switch interrupt,
and in consequence intel_execlists_handle_ctx_events is not triggered.
This patch is based on a patch from Deepak S <deepak.s@intel.com>
from another platform.
The above patch added a call to init_context() to fix an issue introduced
by a previous patch. But, it then opened up a small timing window for the
batches being added by the init_context (basically setting up the context)
to complete before the interrupts have been turned on, thus hanging the
GPU.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=89600 Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: Peter Antoine <peter.antoine@intel.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
[Jani: fixed typo in subject, massaged the comments a bit] Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When stacking request-based DM on blk_mq device, request cloning and
remapping are done in a single call to target's clone_and_map_rq().
The clone is allocated and valid only if clone_and_map_rq() returns
DM_MAPIO_REMAPPED.
The "IS_ERR(clone)" check in map_request() does not cover all the
!DM_MAPIO_REMAPPED cases that are possible (E.g. if underlying devices
are not ready or unavailable, clone_and_map_rq() may return
DM_MAPIO_REQUEUE without ever having established an ERR_PTR). Fix this
by explicitly checking for a return that is not DM_MAPIO_REMAPPED in
map_request().
Without this fix, DM core may call setup_clone() for a NULL clone
and oops like this:
On x86-64, __copy_instruction() always returns 0 (error) if the
instruction uses %rip-relative addressing. This is because
kernel_insn_init() is called the second time for 'insn' instance
in such cases and sets all its fields to 0.
Because of this, trying to place a kprobe on such instruction
will fail, register_kprobe() will return -EINVAL.
On Exynos4412 boards (Trats2, Odroid U3) after enabling L2 cache in 56b60b8bce4a ("ARM: 8265/1: dts: exynos4: Add nodes for L2 cache
controller") the second suspend to RAM failed. First suspend worked fine
but the next one hang just after powering down of secondary CPUs (system
consumed energy as it would be running but was not responsive).
The issue was caused by enabling delayed reset assertion for CPU0 just
after issuing power down of cores. This was introduced for Exynos4 in 13cfa6c4f7fa ("ARM: EXYNOS: Fix CPU idle clock down after CPU off").
The whole behavior is not well documented but after checking with vendor
code this should be done like this (on Exynos4):
1. Enable delayed reset assertion when system is running (for all CPUs).
2. Disable delayed reset assertion before suspending the system.
This can be done after powering off secondary CPUs.
3. Re-enable the delayed reset assertion when system is resumed.
Fixes: 13cfa6c4f7fa ("ARM: EXYNOS: Fix CPU idle clock down after CPU off") Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Tested-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Tested-by: Chanwoo Choi <cw00.choi@samsung.com> Signed-off-by: Kukjin Kim <kgene@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It seems Broadcom released two devices with conflicting device id. There
are for sure 14e4:4321 PCI devices with BCM4321 (N-PHY) chipset, they
can be found in routers, e.g. Netgear WNR834Bv2. However, according to
Broadcom public sources 0x4321 is also used for 5 GHz BCM4306 (G-PHY).
It's unsure if they meant PCI device id, or "virtual" id (from SPROM).
To distinguish these devices lets check PHY type (G vs. N).
Signed-off-by: Rafał Miłecki <zajec5@gmail.com> Cc: <stable@vger.kernel.org> # 3.16+ Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
BugLink: https://bugs.launchpad.net/bugs/1427680
This device requires new firmware files
AthrBT_0x11020100.dfu and ramps_0x11020100_40.dfu added to
/lib/firmware/ar3k/ that are not included in linux-firmware yet.
BugLink: https://bugs.launchpad.net/bugs/1462614
This device requires new firmware files
AthrBT_0x11020100.dfu and ramps_0x11020100_40.dfu added to
/lib/firmware/ar3k/ that are not included in linux-firmware yet.
This tells userspace that it's safe to use the RADEON_VA_UNMAP operation
of the DRM_RADEON_GEM_VA ioctl.
Signed-off-by: Michel Dänzer <michel.daenzer@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 0aedb1626566 ("drm/i915: Don't skip request retirement if the active list is empty") Acked-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With the introduction of multiple views of an obj in the same vm, each
vma was taught to cache its copy of the pages (so that different views
could have different page arrangements). However, this missed decoupling
those vma->ggtt_view.pages when the vma released its reference on the
obj->pages. As we don't always free the vma, this leads to a possible
scenario (e.g. execbuffer interrupted by the shrinker) where the vma
points to a stale obj->pages, and explodes.
drm/i915: Infrastructure for supporting different GGTT views per object
Tvrtko says, if someone else will be confused how this can happen, key
is the reservation execbuffer path. That puts the VMA on the exec_list
which prevents i915_vma_unbind and i915_gem_vma_destroy from fully
destroying the VMA. So the VMA is left existing as an empty object in
the list - unbound and disassociated with the backing store. Kind of a
cached memory object. And then re-using it needs to clear the cached
pages pointer which is fixed above.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1227892 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Michel Thierry <michel.thierry@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
[Jani: Added Tvrtko's explanation to commit message.] Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Turns out 1366x768 does not in fact work on this hardware.
Signed-off-by: Adam Jackson <ajax@redhat.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
iser connection termination process happens in 2 stages:
- isert_wait_conn:
- resumes rdma disconnect
- wait for session commands
- wait for flush completions (post a marked wr to signal we are done)
- wait for logout completion
- queue work for connection cleanup (depends on disconnected/timewait
events)
- isert_free_conn
- last reference put on the connection
In case we are terminating during IOs, we might be posting send/recv
requests after we posted the last work request which might lead
to a use-after-free condition in isert_handle_wc.
After we posted the last wr in isert_wait_conn we are guaranteed that
no successful completions will follow (meaning no new work request posts
may happen) but other flush errors might still come. So before we
put the last reference on the connection, we repeat the process of
posting a marked work request (isert_wait4flush) in order to make sure all
pending completions were flushed.
Since commit "2426bd456a6 target: Report correct response ..."
we might get a command with data_size that does not fit to
the number of allocated data sg elements. Given that we rely on
cmd t_data_nents which might be different than the data_size,
we sometimes receive local length error completion. The correct
approach would be to take the command data_size into account when
constructing the ib sg_list.
Worse yet, reading the error message (the filter again) it says that
there was no error, when there clearly was. The issue is that the
code that checks the input does not check for balanced ops. That is,
having an op between a closed parenthesis and the next token.
This would only cause a warning, and fail out before doing any real
harm, but it should still not caues a warning, and the error reported
should work:
On a HP Envy TouchSmart laptop, there are 2 speakers (main speaker
and subwoofer speaker), 1 headphone and 2 DACs, without this fixup,
the headphone will be assigned to a DAC and the 2 speakers will be
assigned to another DAC, this assignment makes the surround-2.1
channels invalid.
To fix it, here using a DAC/pin preference map to bind the main
speaker to 1 DAC and the subwoofer speaker will be assigned to another
DAC.
The PLL impose a certain input range to work correctly, but it appears that
this input range does not apply on the input clock (or parent clock) but
on the input clock after it has passed the PLL divisor.
Fix the implementation accordingly.
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com> Reported-by: Jonas Andersson <jonas@microbit.se> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>