Filipe Manana [Mon, 14 Sep 2015 08:09:31 +0000 (09:09 +0100)]
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
(cherry picked from commit 1c44d0493183fc27cdf030b1d437442d1f751bec)
# gpg: Signature made Sun Sep 13 12:12:11 2015 EDT using RSA key ID 6092693E
# gpg: Good signature from "Greg Kroah-Hartman (Linux kernel stable release signing key) <greg@kroah.com>"
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 647F 2865 4894 E3BD 4571 99BE 38DB BDC8 6092 693E
The Sourcery CodeBench Lite 2014.05 toolchain (gcc 4.8.3, binutils
2.24.51) has a GCC which implements -fuse-ld, and it doesn't include
the gold linker, but it lacks an ld.bfd executable in its
installation. This means that passing -fuse-ld=bfd fails with:
Arguably this is a deficiency in the toolchain, but I suspect it's
commonly used enough that it's worth accommodating: just use
cc-ldoption (to cause a link attempt) instead of cc-option to test
whether we can use -fuse-ld. So -fuse-ld=bfd won't be used with this
toolchain, but the build will rightly succeed, just as it does for
toolchains which don't implement -fuse-ld (and don't use gold as the
default linker).
Note: this will change the failure mode for a corner case I was trying
to handle in d2b30cd4b722, where the toolchain defaults to the gold
linker and the BFD linker is not found in PATH, from:
Commit b253149b843f ("sched/idle/x86: Restore mwait_idle() to fix boot
hangs, to improve power savings and to improve performance") restores
mwait_idle(), but the trace_cpu_idle related calls are missing. This
causes powertop on my old desktop powered by Intel Core2 E6550 to
report zero wakeups and zero events.
In the recent x2apic cleanup I got two things really wrong:
1) The safety check in __disable_x2apic which allows the function to
be called unconditionally is backwards. The check is there to
prevent access to the apic MSR in case that the machine has no
apic. Though right now it returns if the machine has an apic and
therefor the disabling of x2apic is never invoked.
2) x2apic_disable() sets x2apic_mode to 0 after registering the local
apic. That's wrong, because register_lapic_address() checks x2apic
mode and therefor takes the wrong code path.
This results in boot failures on machines with x2apic preenabled by
BIOS and can also lead to an fatal MSR access on machines without
apic.
The solutions are simple:
1) Correct the sanity check for apic availability
2) Clear x2apic_mode _before_ calling register_lapic_address()
Fixes: 659006bf3ae3 'x86/x2apic: Split enable and setup function' Reported-and-tested-by: Javier Monteagudo <javiermon@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://bugzilla.redhat.com/show_bug.cgi?id=1224764 Cc: Laura Abbott <labbott@redhat.com> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Borislav Petkov <bp@alien8.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since commit feb44f1f7a4ac299d1ab1c3606860e70b9b89d69 (x86/xen:
Provide a "Xen PV" APIC driver to support >255 VCPUs) Xen guests need
a full APIC driver and thus should depend on X86_LOCAL_APIC.
This fixes an i386 build failure with !SMP && !CONFIG_X86_UP_APIC by
disabling Xen support in this configuration.
Users needing Xen support in a non-SMP i386 kernel will need to enable
CONFIG_X86_UP_APIC.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit d795ef9aa831 ("arm64: perf: don't warn about missing
interrupt-affinity property for PPIs") added a check for PPIs so that
we avoid parsing the interrupt-affinity property for these naturally
affine interrupts.
Unfortunately, this check can trigger an early (successful) return and
we will not assign the value of cpu_pmu->plat_device. This patch fixes
the issue.
Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When injecting a fault into a misbehaving 32bit guest, it seems
rather idiotic to also inject a 64bit fault that is only going
to corrupt the guest state. This leads to a situation where we
perform an illegal exception return at EL2 causing the host
to crash instead of killing the guest.
Just fix the stupid bug that has been there from day 1.
Reported-by: Russell King <rmk+kernel@arm.linux.org.uk> Tested-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We added changes in fnic driver patch 1.6.0.16 to acquire
io_req_lock in fnic_queuecommand() before issuing I/O so that io completion
is serialized. But when releasing the lock we check for the I/O flag and
this could be modified if IO abort occurs before I/O completion. In this case
we wont release the lock and causes deadlock in some scenerios. Using the
local variable to check the IO lock status will resolve the problem.
Fixes: 41df7b02db82cf6c14f094757bac3830d10a827f Signed-off-by: Hiral Shah <hishah@cisco.com> Signed-off-by: Sesidhar Baddela <sebaddel@cisco.com> Signed-off-by: Anil Chintalapati <achintal@cisco.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: James Bottomley <JBottomley@Odin.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The Crucial M500 is known to have issues with queued TRIM commands, the
factory recertified SSDs use a different model number naming convention
which causes them to get ignored by the blacklist.
The new naming convention boils down to: s/Crucial_/FC/
The CAN FD data bittiming constants are provided via netlink only when there
are valid CAN FD constants available in priv->data_bittiming_const.
Due to the indirection of pointer assignments in the peak_usb driver the
priv->data_bittiming_const never becomes NULL - not even for non-FD adapters.
The data_bittiming_const points to zero'ed data which leads to this result
when running 'ip -details link show can0':
35: can0: <NOARP,ECHO> mtu 16 qdisc noop state DOWN mode DEFAULT group default qlen 10
link/can promiscuity 0
can state STOPPED restart-ms 0
pcan_usb: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1
: dtseg1 0..0 dtseg2 0..0 dsjw 1..0 dbrp 0..0 dbrp-inc 0 <== BROKEN!
clock 8000000
This patch changes the struct peak_usb_adapter::bittiming_const and struct
peak_usb_adapter::data_bittiming_const to pointers to fix the assignemnt
problems.
Reported-by: Oliver Hartkopp <socketcan@hartkopp.net> Tested-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The routines in scsi_rpm.c assume that if a runtime-PM callback is
invoked for a SCSI device, it can only mean that the device's driver
has asked the block layer to handle the runtime power management (by
calling blk_pm_runtime_init(), which among other things sets q->dev).
However, this assumption turns out to be wrong for things like the ses
driver. Normally ses devices are not allowed to do runtime PM, but
userspace can override this setting. If this happens, the kernel gets
a NULL pointer dereference when blk_post_runtime_resume() tries to use
the uninitialized q->dev pointer.
This patch fixes the problem by calling the block layer's runtime-PM
routines only if the device's driver really does have a runtime-PM
callback routine. Since ses doesn't define any such callbacks, the
crash won't occur.
This fixes Bugzilla #101371.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Stanisław Pitucha <viraptor@gmail.com> Reported-by: Ilan Cohen <ilanco@gmail.com> Tested-by: Ilan Cohen <ilanco@gmail.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: James Bottomley <JBottomley@Odin.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This helper is required for irq chips which do not implement a
irq_set_type callback and need to call down the irq domain hierarchy
for the actual trigger type change.
This helper is required to fix further wreckage caused by the
conversion of TI OMAP to hierarchical irq domains and therefor tagged
for stable.
irq_chip_retrigger_hierarchy() returns -ENOSYS if it was not able to
find at least one .irq_retrigger() callback implemented in the IRQ
domain hierarchy.
That's wrong, because check_irq_resend() expects a 0 return value from
the callback in case that the hardware assisted resend was not
possible. If the return value is non zero the core code assumes
hardware resend success and the software resend is not invoked.
This results in lost interrupts on platforms where none of the parent
irq chips in the hierarchy implements the retrigger callback.
This is observable on TI OMAP, where the hierarchy is:
ARM GIC <- OMAP wakeupgen <- TI Crossbar
Return 0 instead so the software resend mechanism gets invoked.
The conversion of the wakeupgen irqchip to hierarchical irq domains
failed to provide a mechanism to properly set the trigger type of an
interrupt.
The wakeupgen irq chip itself has no mechanism and therefor no
irq_set_type() callback. The code before the conversion relayed the
trigger configuration directly to the underlying GIC.
Restore the correct behaviour by setting the wakeupgen irq_set_type
callback to irq_chip_set_type_parent(). This propagates the
set_trigger() call to the underlying GIC irqchip.
The TI crossbar irqchip doesn't provides any facility to configure the
wakeup sources, but the conversion to hierarchical irqdomains set the
irq_set_wake callback to irq_chip_set_wake_parent. The parent chip
(OMAP wakeupgen) has no irq_set_wake function either so the call will
fail with -ENOSYS. As a result the irq_set_wake() call in the resume
path will trigger an 'Unbalanced wake disable' warning.
Before the conversion the GIC irqchip was the top level irqchip and
correctly flagged with IRQCHIP_SKIP_SET_WAKE.
Restore the correct behaviour by removing the irq_set_type callback
from the crossbar irqchip and set the IRQCHIP_SKIP_SET_WAKE flag which
lets the irq_set_irq_wake() call from the driver succeed.
The ARM GIC requires that all interrupts which are not used as a
wakeup source have to be masked during suspend.
The conversion of the crossbar irqchip to hierarchical irq domains
failed to mark the crossbar irqchip with the IRQCHIP_MASK_ON_SUSPEND
flag and therefor broke the suspend requirement of the GIC.
Before the conversion the flags were visible because the GIC was the
top level irqchip. After the conversion the crossbar irqchip is the
top level irq chip whose flags are evaluated in suspend_device_irq().
As the flag is not set the masking of the non-wakeup irqs is not
invoked which breaks suspend.
Add the IRQCHIP_MASK_ON_SUSPEND flag to the crossbar irqchip, so the
GIC interrupts get masked properly.
The conversion of the crossbar irqchip to hierarchical irq domains
failed to provide a mechanism to properly set the trigger type of an
interrupt.
The crossbar irq chip itself has no mechanism and therefor no
irq_set_type() callback. The code before the conversion relayed the
trigger configuration directly to the underlying GIC.
Restore the correct behaviour by setting the crossbar irq_set_type
callback to irq_chip_set_type_parent(). This propagates the
set_trigger() call to the underlying GIC irqchip.
Some use of those functions were providing unitialized values to those
functions. Notably, when reading 0 bytes from an empty file on a 9P
filesystem, the return code of read() was not 0.
Everytime we use the logical context with execlists it becomes dirty (as
the hardware will write the new register values afterwards, as well as
the GPU state that will be used). We need to then flag the context as
dirty everytime since after a swap-out/swap-in cycle the dirty flag will
be cleared, and a further swap-out cycle will then loose the most recent
GPU state.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Meelis and Helge reported that 3a9ad0b4fdcd ("PCI: Add pci_bus_addr_t")
caused HPMCs on A500 and hangs on rp5470.
PA-RISC does not set ARCH_DMA_ADDR_T_64BIT, even for 64-bit kernels, so
prior to 3a9ad0b4fdcd, we always used 32-bit PCI addresses. After 3a9ad0b4fdcd, we do use 64-bit PCI addresses in 64-bit kernels, and
apparently there's some PA-RISC problem related to them.
Make sure all non-READ SCSI commands get targ_xfer_tag initialized
to 0xffffffff, not just WRITEs.
Double-free of a TUR cmd object occurs under the following scenario:
1. TUR received (targ_xfer_tag is uninitialized and left at 0)
2. TUR status sent
3. First unsolicited NOPIN is sent to initiator (gets targ_xfer_tag of 0)
4. NOPOUT for NOPIN (with TTT=0) arrives
- its ExpStatSN acks TUR status, TUR is queued for removal
- LIO tries to find NOPIN with TTT=0, but finds the same TUR instead,
TUR is queued for removal for the 2nd time
At the last iteration of the loop, j may equal zero and thus
tp_list[j - 1] causes an invalid read.
Change the logic of the loop so that j - 1 is always >= 0.
After a for-loop was replaced by list_for_each_entry, see
Commit bbbc7e8502c9 ("ALSA: hda - Allocate hda_pcm objects dynamically"),
Commit 751e2216899c ("ALSA: hda: fix possible null dereference"),
a possible NULL pointer dereference has been introduced; this patch adds
the NULL check on pcm->pcm, while leaving a potentially superfluous
check on pcm itself untouched.
The widget power-saving code tries to turn up/down the power of each
widget in the I/O paths that are modified at each jack plug/unplug.
The recent report revealed that the power activation leaves some
widgets unpowered after plugging. This is because
snd_hda_activate_path() turns on path->active flag at the end of the
function while the path power management is done before that. Then
it's regarded as if nothing is active, and the driver turns off the
power.
The fix is simply to set the flag at the beginning of the function,
before trying to power up.
The is_active_nid_for_any() function in the generic parser is supposed
to check all connections from/to the given widget, but the current
code checks only the first input connection (index = 0).
This patch corrects the code to check all inputs by passing -1 to
index argument.
On shutdown/reboot of CX20722, first shut down all EAPDs, then
shut down the afg node to D3.
Failure to do so can lead to spurious noises from the internal speaker
directly after reboot (and before the codec is reinitialized again, i e
in BIOS setup or GRUB menus).
The fix for deadlock in PM in commit [1ee23fe07ee8: ALSA: usb-audio:
Fix deadlocks at resuming] introduced a new check of in_pm flag.
However, the brainless patch author evaluated it in a wrong way
(logical AND instead of logical OR), thus usb_autopm_get_interface()
is wrongly called at probing, leading to unbalance of runtime PM
refcount.
This patch fixes it by correcting the logic.
Reported-by: Hans Yang <hansy@nvidia.com> Fixes: 1ee23fe07ee8 ('ALSA: usb-audio: Fix deadlocks at resuming') Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This should help to fix the following issue in Docker:
https://github.com/opencontainers/runc/issues/133
In some conditions, a Docker container needs to be started twice in
order to work.
As implemented, ACS-4 sense reporting for ATA devices bypasses error
diagnosis and handling in libata degrading EH behavior significantly.
Revert the related changes for now.
As implemented, ACS-4 sense reporting for ATA devices bypasses error
diagnosis and handling in libata degrading EH behavior significantly.
Revert the related changes for now.
ATA_ID_COMMAND_SET_3/4 constants are not reverted as they're used by
later changes.
As implemented, ACS-4 sense reporting for ATA devices bypasses error
diagnosis and handling in libata degrading EH behavior significantly.
Revert the related changes for now.
Commit 000851119e80 changed sha256/512 update functions to
pass more data to nx_build_sg_list(), which ends with
sg list overflows and usually with update functions failing
for data larger than max_sg_len * NX_PAGE_SIZE.
This happens because:
- both "total" and "to_process" are updated, which leads to
"to_process" getting overflowed for some data lengths
For example:
In first iteration "total" is 50, and let's assume "to_process"
is 30 due to sg limits. At the end of first iteration "total" is
set to 20. At start of 2nd iteration "to_process" overflows on:
to_process = total - to_process;
- "in_sg" is not reset to nx_ctx->in_sg after each iteration
- nx_build_sg_list() is hitting overflow because the amount of data
passed to it would require more than sgmax elements
- as consequence of previous item, data stored in overflowed sg list
may no longer be aligned to SHA*_BLOCK_SIZE
This patch changes sha256/512 update functions so that "to_process"
respects sg limits and never tries to pass more data to
nx_build_sg_list() to avoid overflows. "to_process" is calculated
as minimum of "total" and sg limits at start of every iteration.
Fixes: 000851119e80 ("crypto: nx - Fix SHA concurrence issue and sg
limit bounds") Signed-off-by: Jan Stancek <jstancek@redhat.com> Cc: Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com> Cc: Marcelo Henrique Cerri <mhcerri@linux.vnet.ibm.com> Cc: Fionnuala Gunter <fin@linux.vnet.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit bcdb247c6b6a ("sd: Limit transfer length") clamped the maximum
size of an I/O request to the MAXIMUM TRANSFER LENGTH field in the BLOCK
LIMITS VPD. This had the unfortunate effect of also limiting the maximum
size of non-filesystem requests sent to the device through sg/bsg.
Avoid using blk_queue_max_hw_sectors() and set the max_sectors queue
limit directly.
Also update the comment in blk_limits_max_hw_sectors() to clarify that
max_hw_sectors defines the limit for the I/O controller only.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Brian King <brking@linux.vnet.ibm.com> Tested-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: James Bottomley <JBottomley@Odin.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In case of hw iscsi offload, an host can have N-number of active
connections. There can be IO's running on some connections which
make host->host_busy always TRUE. Now if logout from a connection
is tried then the code gets into an infinite loop as host->host_busy
is always TRUE.
iscsi_conn_teardown(....)
{
.........
/*
* Block until all in-progress commands for this connection
* time out or fail.
*/
for (;;) {
spin_lock_irqsave(session->host->host_lock, flags);
if (!atomic_read(&session->host->host_busy)) { /* OK for ERL == 0 */
spin_unlock_irqrestore(session->host->host_lock, flags);
break;
}
spin_unlock_irqrestore(session->host->host_lock, flags);
msleep_interruptible(500);
iscsi_conn_printk(KERN_INFO, conn, "iscsi conn_destroy(): "
"host_busy %d host_failed %d\n",
atomic_read(&session->host->host_busy),
session->host->host_failed);
................
...............
}
}
This is not an issue with software-iscsi/iser as each cxn is a separate
host.
Fix:
Acquiring eh_mutex in iscsi_conn_teardown() before setting
session->state = ISCSI_STATE_TERMINATE.
Signed-off-by: John Soni Jose <sony.john@avagotech.com> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Reviewed-by: Chris Leech <cleech@redhat.com> Signed-off-by: James Bottomley <JBottomley@Odin.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 4c21b8fd8f14 ("MIPS: seccomp: Handle indirect system calls (o32)")
fixed indirect system calls on O32 but it also introduced a bug for MIPS64
where it erroneously modified the v0 (syscall) register with the assumption
that the sycall offset hasn't been taken into consideration. This breaks
seccomp on MIPS64 n64 and n32 ABIs. We fix this by replacing the addition
with a move instruction.
When inserting a new register into a block, the present bit map size is
increased using krealloc. krealloc does not clear the additionally
allocated memory, leaving it filled with random values. Result is that
some registers are considered cached even though this is not the case.
Fix the problem by clearing the additionally allocated memory. Also, if
the bitmap size does not increase, do not reallocate the bitmap at all
to reduce overhead.
Fixes: 3f4ff561bc88 ("regmap: rbtree: Make cache_present bitmap per node") Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commits 9a036b93a344 ("x86/signal/64: Remove 'fs' and 'gs'
from sigcontext") and c6f2062935c8 ("x86/signal/64: Fix SS handling for
signals delivered to 64-bit programs").
They were cleanups, but they break dosemu by changing the signal return
behavior (and removing 'fs' and 'gs' from the sigcontext struct - while
not actually changing any behavior - causes build problems).
The PM runtime core by default assumes a chip is suspended when runtime
PM is enabled. Currently the arizona driver enables runtime PM when the
chip is fully active and then disables the DCVDD regulator at the end of
arizona_dev_init. This however has several problems, firstly the if we
reach the end of arizona_dev_init, we did not properly follow all the
proceedures for shutting down the chip, and most notably we never marked
the chip as cache only so any writes occurring between then and the next
PM runtime resume will be lost. Secondly, if we are already resumed when
we reach the end of dev_init, then at best we get unbalanced regulator
enable/disables at work we lose DCVDD whilst we need it.
Additionally, since the commit 4f0216409f7c ("mfd: arizona: Add better
support for system suspend"), the PM runtime operations may
disable/enable the IRQ, so the IRQs must now be enabled before we call
any PM operations.
This patch adds a call to pm_runtime_set_active to inform the PM core
that the device is starting up active and moves the PM enabling to
around the IRQ initialisation to avoid any PM callbacks happening until
the IRQs are initialised.
Signed-off-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We must invalidate the L1 cache before enabling coherency, otherwise
secondary CPUs can inject invalid cache lines into the coherent CPU
cluster, which could then be migrated to other CPUs. This fixes a
recent regression with SoCFPGA randomly failing to boot.
Fixes: 02b4e2756e01 ("ARM: v7 setup function should invalidate L1 cache") Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Alexander Kochetkov <al.kochet@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
All ARMv5 and older CPUs invalidate their caches in the early assembly
setup function, prior to enabling the MMU. This is because the L1
cache should not contain any data relevant to the execution of the
kernel at this point; all data should have been flushed out to memory.
This requirement should also be true for ARMv6 and ARMv7 CPUs - indeed,
these typically do not search their caches when caching is disabled (as
it needs to be when the MMU is disabled) so this change should be safe.
ARMv7 allows there to be CPUs which search their caches while caching is
disabled, and it's permitted that the cache is uninitialised at boot;
for these, the architecture reference manual requires that an
implementation specific code sequence is used immediately after reset
to ensure that the cache is placed into a sane state. Such
functionality is definitely outside the remit of the Linux kernel, and
must be done by the SoC's firmware before _any_ CPU gets to the Linux
kernel.
Changing the data cache clean+invalidate to a mere invalidate allows us
to get rid of a lot of platform specific hacks around this issue for
their secondary CPU bringup paths - some of which were buggy.
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Heiko Stuebner <heiko@sntech.de> Tested-by: Dinh Nguyen <dinguyen@opensource.altera.com> Acked-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com> Tested-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com> Acked-by: Shawn Guo <shawn.guo@linaro.org> Tested-by: Thierry Reding <treding@nvidia.com> Acked-by: Thierry Reding <treding@nvidia.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Michal Simek <michal.simek@xilinx.com> Tested-by: Wei Xu <xuwei5@hisilicon.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Alexander Kochetkov <al.kochet@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When using a toolchain with gold as the default linker, the VDSO build
fails:
VDSO arch/arm/vdso/vdso.so.raw
HOSTCC arch/arm/vdso/vdsomunge
MUNGE arch/arm/vdso/vdso.so.dbg
OBJCOPY arch/arm/vdso/vdso.so
BFD: arch/arm/vdso/vdso.so: Not enough room for program headers, try
linking with -N
For whatever reason, ld.gold is omitting an exidx program header that
ld.bfd emits, and even when I work around that, I don't get a working
VDSO.
For now, instead of supporting gold (which will fail to link the
kernel anyway since it does not implement --pic-veneer), direct the
compiler to use the traditional bfd linker. This is accomplished by
using -fuse-ld, which is implemented in GCC 4.8 and later.
Note: one limitation of this is that if the toolchain is configured
to use gold by default, and the bfd linker is not in $PATH, the VDSO
build will fail:
This will happen if CROSS_COMPILE begins with a path such as
/opt/bin/arm-linux-gnu- but /opt/bin is not in $PATH. This is
considered an acceptable corner-case limitation and is easily worked
around.
Additonal note: we use cc-option instead of cc-ldoption so that
-fuse-ld=bfd is placed in the command line if the compiler recognizes
the option. Using cc-ldoption results in an attempt to link, which
fails in the situation just described, causing -fuse-ld=bfd to be
omitted and gold to be used for the VDSO link, which is what we're
trying to prevent.
Reported-by: Stefan Agner <stefan@agner.ch> Signed-off-by: Nathan Lynch <nathan_lynch@mentor.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Alexander Kochetkov <al.kochet@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit <ed8509edddeb> ("ARM: dts: omap5: add minimal l4 bus
layout with control module support") moved pbias_regulator dt node
from being a child node of ocp to be the child node of
omap5_padconf_global. After this device for pbias_regulator is
not created.
Fix it by adding "simple-bus" compatible property to
omap5_padconf_global dt node.
Fixes: ed8509edddeb ("ARM: dts: omap5: add minimal l4 bus
layout with control module support")
Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit <7415b0b4c645> ("ARM: dts: omap4: add minimal l4 bus layout
with control module support") moved pbias_regulator dt node
from being a child node of ocp to be the child node of
omap4_padconf_global. After this device for pbias_regulator
is not created.
Fix it by adding "simple-bus" compatible property to
omap4_padconf_global dt node.
Fixes: 7415b0b4c645 ("ARM: dts: omap4: add minimal l4 bus layout
with control module support")
Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit <d919501feffa> ("ARM: dts: dra7: add minimal l4 bus
layout with control module support") moved pbias_regulator dt node
from being a child node of ocp to be the child node of
scm_conf. After this device for pbias_regulator is
not created.
Fix it by adding "simple-bus" compatible property to
scm_conf dt node.
Fixes: d919501feffa ("ARM: dts: dra7: add minimal l4 bus
layout with control module support")
Suggested-by: Tero Kristo <t-kristo@ti.com> Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com> Tested-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit <72b10ac00eb1> ("ARM: dts: omap24xx: add minimal l4 bus
layout with control module support") moved pbias_regulator dt node
from being a child node of ocp to be the child node of
scm_conf. After this device for pbias_regulator is
not created.
Fix it by adding "simple-bus" compatible property to
scm_conf dt node.
Fixes: 72b10ac00eb1 ("ARM: dts: omap24xx: add minimal l4 bus
layout with control module support")
Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The PCIe interrupts are also routed through the GPC. This has been
missed from the conversion to stacked IRQ domains as the PCIe
controller uses an explicit interrupt map and thus doesn't inherit
the SoC global interrupt parent.
Since fc_fcp_cleanup_cmd() can sleep this function must not
be called while holding a spinlock. This patch avoids that
fc_fcp_cleanup_each_cmd() triggers the following bug:
Due to patch "libfc: Do not invoke the response handler after
fc_exch_done()" (commit ID 7030fd62) the lport_recv() call
in fc_exch_recv_req() is passed a dangling pointer. Avoid this
by moving the fc_frame_free() call from fc_invoke_resp() to its
callers. This patch fixes the following crash:
This addresses two issues that cause problems with viewperf maya-03 in
situation with memory pressure.
The first issue causes attempts to unreserve buffers if batched
reservation fails due to, for example, a signal pending. While previously
the ttm_eu api was resistant against this type of error, it is no longer
and the lockdep code will complain about attempting to unreserve buffers
that are not reserved. The issue is resolved by avoid calling
ttm_eu_backoff_reservation in the buffer reserve error path.
The second issue is that the binding_mutex may be held when user-space
fence objects are created and hence during memory reclaims. This may cause
recursive attempts to grab the binding mutex. The issue is resolved by not
holding the binding mutex across fence creation and submission.
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by: Sinclair Yeh <syeh@vmware.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
During unbinding the driver was dereferencing a pointer to memory
already freed by power_supply_unregister().
Driver was freeing its internal description of battery through pointers
stored in power_supply structure. However, because the core owns the
power supply instance, after calling power_supply_unregister() this
memory is freed and the driver cannot access these members.
Fix this by storing the pointer to internal description of battery in a
local variable before calling power_supply_unregister(), so the pointer
remains valid.
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Reported-by: H.J. Lu <hjl.tools@gmail.com> Fixes: 297d716f6260 ("power_supply: Change ownership from driver to core") Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In kbuild it is allowed to define objects in files named "Makefile"
and "Kbuild".
Currently localmodconfig reads objects only from "Makefile"s and misses
modules like nouveau.
Link: http://lkml.kernel.org/r/1437948415-16290-1-git-send-email-richard@nod.at Reported-and-tested-by: Leonidas Spyropoulos <artafinde@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The commit ccc9d90a9a8b5c4ad7e9708ec41f75ff9e98d61d "xenbus_client:
Extend interface to support multi-page ring" removes the call to
free_xenballooned_pages() in xenbus_unmap_ring_vfree_hvm(), leaking a
page for every shared ring.
Only with backends running in HVM domains were affected.
Signed-off-by: Julien Grall <julien.grall@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It turns out that a PV domU also requires the "Xen PV" APIC
driver. Otherwise, the flat driver is used and we get stuck in busy
loops that never exit, such as in this stack trace:
(gdb) target remote localhost:9999
Remote debugging using localhost:9999
__xapic_wait_icr_idle () at ./arch/x86/include/asm/ipi.h:56
56 while (native_apic_mem_read(APIC_ICR) & APIC_ICR_BUSY)
(gdb) bt
#0 __xapic_wait_icr_idle () at ./arch/x86/include/asm/ipi.h:56
#1 __default_send_IPI_shortcut (shortcut=<optimized out>,
dest=<optimized out>, vector=<optimized out>) at
./arch/x86/include/asm/ipi.h:75
#2 apic_send_IPI_self (vector=246) at arch/x86/kernel/apic/probe_64.c:54
#3 0xffffffff81011336 in arch_irq_work_raise () at
arch/x86/kernel/irq_work.c:47
#4 0xffffffff8114990c in irq_work_queue (work=0xffff88000fc0e400) at
kernel/irq_work.c:100
#5 0xffffffff8110c29d in wake_up_klogd () at kernel/printk/printk.c:2633
#6 0xffffffff8110ca60 in vprintk_emit (facility=0, level=<optimized
out>, dict=0x0 <irq_stack_union>, dictlen=<optimized out>,
fmt=<optimized out>, args=<optimized out>)
at kernel/printk/printk.c:1778
#7 0xffffffff816010c8 in printk (fmt=<optimized out>) at
kernel/printk/printk.c:1868
#8 0xffffffffc00013ea in ?? ()
#9 0x0000000000000000 in ?? ()
Mailing-list-thread: https://lkml.org/lkml/2015/8/4/755 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In commit 33511b157bbcebaef853cc1811992b664a2e5862 ("rtlwifi: add support to
send beacon frame"), the mechanism for sending beacons was established. That
patch works correctly for rtl8192cu, but there is a possibility of getting
the following warnings in the PCI drivers:
The warning is followed by a NULL pointer dereference as follows:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000006
IP: [<ffffffffc073998e>] rtl_get_tcb_desc+0x5e/0x760 [rtlwifi]
This problem was reported at http://thread.gmane.org/gmane.linux.kernel.wireless.general/138645,
but no solution was found at that time.
The problem was also reported at https://bugzilla.kernel.org/show_bug.cgi?id=9744
and this solution was developed and tested there.
The USB driver works with a NULL final argument in the adapter_tx() callback;
however, the PCI drivers need a struct rtl_tcb_desc in that position.
Fixes: 33511b157bbc ("rtlwifi: add support to send beacon frame.") Signed-off-by: Luis Felipe Dominguez Vega <lfdominguez@nauta.cu> Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The driver code allows for the disabling of MSI interrupts; however the
module_parm line was missed and the option fails to show with modinfo.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the card is not owned by the PCIe bus, we need to
acquire ownership first. This flow is implemented in
iwl_pcie_prepare_card_hw. Because of a hardware bug, we
need to disable link power management before we can
request ownership otherwise the other user of the device
won't get notified that we are requesting the device which
will prevent us from acquire ownership.
Same holds for the down flow where we need to make sure
that any other potential user is notified that the driver
is going down.
If rb->aux_refcount is decremented to zero before rb->refcount,
__rb_free_aux() may be called twice resulting in a double free of
rb->aux_pages. Fix this by adding a check to __rb_free_aux().
Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 57ffc5ca679f ("perf: Fix AUX buffer refcounting") Link: http://lkml.kernel.org/r/1437953468.12842.17.camel@decadent.org.uk Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A recent fix to the shadow timestamp inadvertly broke the running time
accounting.
We must not update the running timestamp if we fail to schedule the
event, the event will not have ran. This can (and did) result in
negative total runtime because the stopped timestamp was before the
running timestamp (we 'started' but never stopped the event -- because
it never really started we didn't have to stop it either).
Reported-and-Tested-by: Vince Weaver <vincent.weaver@maine.edu> Fixes: 72f669c0086f ("perf: Update shadow timestamp before add event") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shaohua Li <shli@fb.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Vince reported that the fasync signal stuff doesn't work proper for
inherited events. So fix that.
Installing fasync allocates memory and sets filp->f_flags |= FASYNC,
which upon the demise of the file descriptor ensures the allocation is
freed and state is updated.
Now for perf, we can have the events stick around for a while after the
original FD is dead because of references from child events. So we
cannot copy the fasync pointer around. We can however consistently use
the parent's fasync, as that will be updated.
Fixes commit eae79b4f3e82 ("rsi: fix memory leak in rsi_load_ta_instructions()")
which stopped the driver from functioning.
Firmware data has been allocated using vmalloc(), resulting in memory
that cannot be used for DMA. Hence the firmware was first copied to a
buffer allocated with kmalloc() in the original code. This patch reverts
the commit and only calls "kfree()" to release the buffer after sending
the data. This fixes the memory leak without breaking the driver.
Add a comment to the kmemdup() calls to explain why this is done, and abort
if memory allocation fails.
Tested on a Topic Miami-Florida board which contains the rsi SDIO chip.
Also added the same kfree() call to the USB glue driver. This was not
tested on actual hardware though, as I only have the SDIO version.
Fixes: eae79b4f3e82 ("rsi: fix memory leak in rsi_load_ta_instructions()") Signed-off-by: Mike Looijmans <mike.looijmans@topic.nl> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The BUG_ON() in purge_persistent_gnt() will be triggered when previous purge
work haven't finished.
There is a work_pending() before this BUG_ON, but it doesn't account if the work
is still currently running.
Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We should consider info->feature_persistent when adding indirect page to list
info->indirect_pages, else the BUG_ON() in blkif_free() would be triggered.
When we are using persistent grants the indirect_pages list
should always be empty because blkfront has pre-allocated enough
persistent pages to fill all requests on the ring.
Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Clocks 0 to 31 are on CKENA, and not CKENB. The clock register names
were inadequately inverted. As a consequence, all clock operations were
happening on CKENB, because almost all but 2 clocks are on CKENA.
As the clocks were activated by the bootloader in the former tests, it
escaped the testing that the wrong clock gate was manipulated. The error
was revealed by changing the pxa3xx-nand driver to a module, where upon
unloading, the wrong clock was disabled in CKENB.
Fixes: 9bbb8a338fb2 ("clk: pxa: add pxa3xx clock driver") Signed-off-by: Robert Jarzmik <robert.jarzmik@free.fr> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Hugetlbfs pages will get a refcount in get_any_page() or
madvise_hwpoison() if soft offlining through madvise. The refcount which
is held by the soft offline path should be released if we fail to isolate
hugetlbfs pages.
Fix it by reducing the refcount for both isolation success and failure.
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After trying to drain pages from pagevec/pageset, we try to get reference
count of the page again, however, the reference count of the page is not
reduced if the page is still not on LRU list.
Fix it by adding the put_page() to drop the page reference which is from
__get_any_page().
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
!spin_is_locked() and spin_unlock_wait() are both only control barriers.
The code needs an acquire barrier, otherwise the cpu might perform read
operations before the lock test.
As no primitive exists inside <include/spinlock.h> and since it seems
noone wants another primitive, the code creates a local primitive within
ipc/sem.c.
With regards to -stable:
The change of sem_wait_array() is a bugfix, the change to sem_lock() is a
nop (just a preprocessor redefinition to improve the readability). The
bugfix is necessary for all kernels that use sem_wait_array() (i.e.:
starting from 3.10).
The current semaphore code allows a potential use after free: in
exit_sem we may free the task's sem_undo_list while there is still
another task looping through the same semaphore set and cleaning the
sem_undo list at freeary function (the task called IPC_RMID for the same
semaphore set).
For example, with a test program [1] running which keeps forking a lot
of processes (which then do a semop call with SEM_UNDO flag), and with
the parent right after removing the semaphore set with IPC_RMID, and a
kernel built with CONFIG_SLAB, CONFIG_SLAB_DEBUG and
CONFIG_DEBUG_SPINLOCK, you can easily see something like the following
in the kernel log:
I wasn't able to trigger any badness on a recent kernel without the
proper config debugs enabled, however I have softlockup reports on some
kernel versions, in the semaphore code, which are similar as above (the
scenario is seen on some servers running IBM DB2 which uses semaphore
syscalls).
The patch here fixes the race against freeary, by acquiring or waiting
on the sem_undo_list lock as necessary (exit_sem can race with freeary,
while freeary sets un->semid to -1 and removes the same sem_undo from
list_proc or when it removes the last sem_undo).
After the patch I'm unable to reproduce the problem using the test case
[1].
void create_set()
{
int i, j;
pid_t p;
union {
int val;
struct semid_ds *buf;
unsigned short int *array;
struct seminfo *__buf;
} un;
/* Create and initialize semaphore set */
for (i = 0; i < NSET; i++) {
sid[i] = semget(IPC_PRIVATE , NSEM, 0644 | IPC_CREAT);
if (sid[i] < 0) {
perror("semget");
exit(EXIT_FAILURE);
}
}
un.val = 0;
for (i = 0; i < NSET; i++) {
for (j = 0; j < NSEM; j++) {
if (semctl(sid[i], j, SETVAL, un) < 0)
perror("semctl");
}
}
/* Launch threads that operate on semaphore set */
for (i = 0; i < NSEM * NSET * NSET; i++) {
p = fork();
if (p < 0)
perror("fork");
if (p == 0)
thread();
}
/* Free semaphore set */
for (i = 0; i < NSET; i++) {
if (semctl(sid[i], NSEM, IPC_RMID))
perror("IPC_RMID");
}
/* Wait for forked processes to exit */
while (wait(NULL)) {
if (errno == ECHILD)
break;
};
}
int main(int argc, char **argv)
{
pid_t p;
srand(time(NULL));
while (1) {
p = fork();
if (p < 0) {
perror("fork");
exit(EXIT_FAILURE);
}
if (p == 0) {
create_set();
goto end;
}
/* Wait for forked processes to exit */
while (wait(NULL)) {
if (errno == ECHILD)
break;
};
}
end:
return 0;
}
[akpm@linux-foundation.org: use normal comment layout] Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Rafael Aquini <aquini@redhat.com> CC: Aristeu Rozanski <aris@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Josef Bacik [Tue, 25 Aug 2015 16:54:39 +0000 (12:54 -0400)]
Btrfs: make fsync fast again
Filipe removed the main reason I made fsync fast, which was to move the waiting
on ordered IO while we're logging extents. This patch preserves his need to
catch IO errors and my need for fsync performance to not suck donkey balls.
Thanks,
Filipe Manana [Thu, 13 Nov 2014 16:59:53 +0000 (16:59 +0000)]
Btrfs: don't ignore log btree writeback errors
If an error happens during writeback of log btree extents, make sure the
error is returned to the caller (fsync), so that it takes proper action
(commit current transaction) instead of writing a superblock that points
to log btrees with all or some nodes that weren't durably persisted.
Filipe Manana [Fri, 17 Apr 2015 16:08:37 +0000 (17:08 +0100)]
Btrfs: don't attach unnecessary extents to transaction on fsync
We don't need to attach ordered extents that have completed to the current
transaction. Doing so only makes us hold memory for longer than necessary
and delaying the iput of the inode until the transaction is committed (for
each created ordered extent we do an igrab and then schedule an asynchronous
iput when the ordered extent's reference count drops to 0), preventing the
inode from being evictable until the transaction commits.
Josef Bacik [Tue, 25 Aug 2015 17:01:40 +0000 (13:01 -0400)]
Btrfs: deal with error on secondary log properly
If we have an fsync at the same time in two seperate subvolumes we could end up
with the tree log pointing at invalid blocks. We need to notice if our writeout
failed in anyway, if it did then we need to do a full transaction commit and
return an error on the subvolume that gave us the io error. Thanks,
Filipe Manana [Wed, 19 Aug 2015 10:09:40 +0000 (11:09 +0100)]
Btrfs: fix file read corruption after extent cloning and fsync
If we partially clone one extent of a file into a lower offset of the
file, fsync the file, power fail and then mount the fs to trigger log
replay, we can get multiple checksum items in the csum tree that overlap
each other and result in checksum lookup failures later. Those failures
can make file data read requests assume a checksum value of 0, but they
will not return an error (-EIO for example) to userspace exactly because
the expected checksum value 0 is a special value that makes the read bio
endio callback return success and set all the bytes of the corresponding
page with the value 0x01 (at fs/btrfs/inode.c:__readpage_endio_check()).
From a userspace perspective this is equivalent to file corruption
because we are not returning what was written to the file.
Details about how this can happen, and why, are included inline in the
following reproducer test case for fstests and the comment added to
tree-log.c.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_dm_flakey
_require_cloner
_require_metadata_journaling $SCRATCH_DEV
# Create our test file with a single 100K extent starting at file
# offset 800K. We fsync the file here to make the fsync log tree gets
# a single csum item that covers the whole 100K extent, which causes
# the second fsync, done after the cloning operation below, to not
# leave in the log tree two csum items covering two sub-ranges
# ([0, 20K[ and [20K, 100K[)) of our extent.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 800K 100K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone part of our extent into file offset 400K. This adds a file
# extent item to our inode's metadata that points to the 100K extent
# we created before, using a data offset of 20K and a data length of
# 20K, so that it refers to the sub-range [20K, 40K[ of our original
# extent.
$CLONER_PROG -s $((800 * 1024 + 20 * 1024)) -d $((400 * 1024)) \
-l $((20 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Now fsync our file to make sure the extent cloning is durably
# persisted. This fsync will not add a second csum item to the log
# tree containing the checksums for the blocks in the sub-range
# [20K, 40K[ of our extent, because there was already a csum item in
# the log tree covering the whole extent, added by the first fsync
# we did before.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
echo "File digest before power failure:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
# Silently drop all writes and ummount to simulate a crash/power
# failure.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again, mount to trigger log replay and validate file
# contents.
# The fsync log replay first processes the file extent item
# corresponding to the file offset 400K (the one which refers to the
# [20K, 40K[ sub-range of our 100K extent) and then processes the file
# extent item for file offset 800K. It used to happen that when
# processing the later, it erroneously left in the csum tree 2 csum
# items that overlapped each other, 1 for the sub-range [20K, 40K[ and
# 1 for the whole range of our extent. This introduced a problem where
# subsequent lookups for the checksums of blocks within the range
# [40K, 100K[ of our extent would not find anything because lookups in
# the csum tree ended up looking only at the smaller csum item, the
# one covering the subrange [20K, 40K[. This made read requests assume
# an expected checksum with a value of 0 for those blocks, which caused
# checksum verification failure when the read operations finished.
# However those checksum failure did not result in read requests
# returning an error to user space (like -EIO for e.g.) because the
# expected checksum value had the special value 0, and in that case
# btrfs set all bytes of the corresponding pages with the value 0x01
# and produce the following warning in dmesg/syslog:
#
# "BTRFS warning (device dm-0): csum failed ino 257 off 917504 csum\
# 1322675045 expected csum 0"
#
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
echo "File digest after log replay:"
# Must match the same digest he had after cloning the extent and
# before the power failure happened.
md5sum $SCRATCH_MNT/foo | _filter_scratch
_unmount_flakey
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
(cherry picked from commit b776fbaea280916fcdaa69bd9aef959aa3691e19)
Zygo Blaxell [Tue, 18 Aug 2015 00:19:19 +0000 (20:19 -0400)]
Merge tag 'v4.1.6' into zygo-4.1.6-zb64
This is the 4.1.6 stable release
# gpg: Signature made Sun Aug 16 23:52:57 2015 EDT using RSA key ID 6092693E
# gpg: Good signature from "Greg Kroah-Hartman (Linux kernel stable release signing key) <greg@kroah.com>"
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 647F 2865 4894 E3BD 4571 99BE 38DB BDC8 6092 693E