Mark Kettenis [Fri, 5 Jul 2002 09:26:57 +0000 (02:26 -0700)]
[PATCH] Fix note sections in ELF core dumps
Edition 4.1 of the System V Application Binary Interface says that
"The first namesz bytes in name contains a null-terminated
representation of the entry's owner or originator". This implies that
the terminating null is included in namesz, which is corroborated by
the example that follows the description. However, this is not what
the Linux kernel does when it writes its notes into an ELF core dump.
The attached patch fixes this.
In usb_bluetooth_probe, the transfer buffers for the write pool urbs
are allocated with size 0, because bluetooth->bulk_out_buffer_size isn't set
until after the loop.
USB: removed file ops from usb device structure
Moved the file ops and minor number stuff out of the usb structure,
Now usb_register_dev() and usb_deregister_dev() must be called if
you want to use the USB major number.
Paul Menage [Fri, 5 Jul 2002 04:11:23 +0000 (21:11 -0700)]
[PATCH] Shift BKL into ->statfs()
This patch removes BKL protection from the invocation of the
super_operations ->statfs() method, and shifts it into the filesystems
where necessary. Any out-of-tree filesystems may need to take the BKL in
their statfs() methods if they were relying on it for synchronisation.
All ->statfs() implementations have been modified to take the BKL,
except for those that don't reference any external mutable data or that
already have their own locking.
Additionally, capifs is changed to use simple_statfs rather than its
own home-grown version.
The BKL change has been flagged at the end of
Documentation/filesystems/porting, along with the recent change to
->permission BKL usage.
Alexander Viro [Thu, 4 Jul 2002 15:54:08 +0000 (08:54 -0700)]
[PATCH] ->i_dev switched to dev_t
* ->i_dev followed the example of ->s_dev - it's dev_t now. All
remaining uses of ->i_dev either outright want dev_t (stat()) or couldn't
care less (printing major:minor in /proc/<pid>/maps, etc.)
Alexander Viro [Thu, 4 Jul 2002 15:54:03 +0000 (08:54 -0700)]
[PATCH] assorted kdev_t cleanups in filesystems
* JFS uses its ->logdev only twice - one of the places assigns
it to_kdev_t(le32_to_cpu(...)), another uses kdev_t_to_nr() of it.
Switched to u32 - it's just a place where we store device number we'd got
from superblock.
* several reiserfs_fs.h function prototypes removed - functions
in question don't exist anymore.
* smbfs doesn't support device nodes; ->f_rdev removed.
Alexander Viro [Thu, 4 Jul 2002 15:53:44 +0000 (08:53 -0700)]
[PATCH] raid kdev_t cleanups - part 2
* a bunch of callers of partition_name() are calling
bdev_partition_name(),
* the last users of raid1 and multipath ->dev are gone; so are
the fields in question.
Alexander Viro [Thu, 4 Jul 2002 15:53:39 +0000 (08:53 -0700)]
[PATCH] raid ->diskop() splitup
* ->diskop() split into individual methods; prototypes cleaned
up. In particular, handling of hot_add_disk() gets mdk_rdev_t * of
the component we are adding as an argument instead of playing the games
with major/minor. Code cleaned up.
Alexander Viro [Thu, 4 Jul 2002 15:53:33 +0000 (08:53 -0700)]
[PATCH] raid kdev_t cleanups (part 1)
* ->error_handler() switched to struct block_device *.
* md_sync_acct() switched to struct block_device *.
* raid5 struct disk_info ->dev is gone - we use ->bdev everywhere.
* bunch of kdev_same() when we have corresponding struct block_device *
and can simply compare them is removed from drivers/md/*.c
Alexander Viro [Thu, 4 Jul 2002 15:53:28 +0000 (08:53 -0700)]
[PATCH] kdev_t crapectomy
* since the last caller of is_read_only() is gone, the function
itself is removed.
* destroy_buffers() is not used anymore; gone.
* fsync_dev() is gone; the only user is (broken) lvm.c and first
step in fixing lvm.c will consist of propagating struct block_device *
anyway; at that point we'll just use fsync_bdev() in there.
* prototype of bio_ioctl() removed - function doesn't exist
anymore.
Alexander Viro [Thu, 4 Jul 2002 15:53:22 +0000 (08:53 -0700)]
[PATCH] cdrom.c cleanups
* Bunch of functions in cdrom.c used to get kdev_t and use it
only to do cdrom_find_device(dev), even though their callers already
had struct cdrom_device_info * in question. Switched to passing
said pointer directly.
* useless exports removed; stuff not used outside of cdrom.c
made static.
Alexander Viro [Thu, 4 Jul 2002 15:53:17 +0000 (08:53 -0700)]
[PATCH] (md.c) block device size cleanups
* calc_dev_sboffset() and calc_dev_size() in md.c are getting
mk_rdev_t instead of kdev_t. Callers updated.
* calls of blkdev_size_in_bytes() in md.c replaced with use
of rdev->bdev->bd_inode->i_size.
Alexander Viro [Thu, 4 Jul 2002 15:53:12 +0000 (08:53 -0700)]
[PATCH] devpts cleanup
* devpts "upcalls" eliminated.
* instead of playing games with revalidation we simply use
ramfs-style tree and kill dentries upon devpts_pty_kill(). That
allows to get rid of a lot of code in fs/devpts/*.c.
* devpts_fs.h cleaned up.
* devpts/root.c and devpts/devpts_i.h removed.
* array of pointers to devpts inodes killed; with ramfs-style tree
it's not needed anymore.
* devpts/inode.c cleaned up.
* devpts_pty_new() used to get mk_kdev() only to convert it to
dev_t (hardly a surprise, since it's mknod() in disguise). Now it gets
dev_t as an argument.
Andrew Morton [Thu, 4 Jul 2002 15:32:06 +0000 (08:32 -0700)]
[PATCH] reduce lock contention in try_to_free_buffers()
The blockdev mapping's private_lock is fairly contended. The buffer
LRU cache fixed a lot of that, but under page replacement load,
try_to_free_buffers is still showing up.
Moving the freeing of buffer_heads outside the lock reduces contention
in there by 30%.
Andrew Morton [Thu, 4 Jul 2002 15:32:00 +0000 (08:32 -0700)]
[PATCH] debug: check page refcount in __free_pages_ok()
Add a BUG() check to __free_pages_ok() - to catch someone freeing a
page which has a non-zero refcount. Actually, this check is mainly to
catch someone (ie: shrink_cache()) incrementing a page's refcount
shortly after it has been freed
Also clean up __free_pages_ok() a bit and convert lots of BUGs to BUG_ON.
Andrew Morton [Thu, 4 Jul 2002 15:31:50 +0000 (08:31 -0700)]
[PATCH] JBD commit callback capability
This is a patch which Stephen has applied to ext3's 2.4 repository.
Originally written by Andreas, generalised somewhat by Stephen.
Add jbd callback mechanism, requested for InterMezzo. We allow the jbd's
client to request notification when a given handle's IO finally commits to
disk, so that clients can manage their own writeback state asynchronously.
Andrew Morton [Thu, 4 Jul 2002 15:31:45 +0000 (08:31 -0700)]
[PATCH] ext3 truncate fix
Forward-port of a fix which Stephen has applied to ext3's 2.4 CVS tree.
Fix for a rare problem seen under stress in data=journal mode: if we
have to restart a truncate transaction while traversing the inode's
direct blocks, we need to deal with bh==NULL in ext3_clear_blocks.
Andrew Morton [Thu, 4 Jul 2002 15:31:40 +0000 (08:31 -0700)]
[PATCH] combine generic_writepages() and mpage_writepages()
generic_writepages and mpage_writepages are basically identical,
except one calls ->writepage() and the other calls mpage_writepage().
This duplication is irritating.
The patch folds generic_writepage() into mpage_writepages(). It does
this rather kludgily: if the get_block argument to mpage_writepages()
is NULL then use ->writepage().
Can't think of a better way, really - we could go for a fully-blown
write_actor_t thing, but that would be overly elaborate and would not
allow mpage_writepage() to be inlined inside mpage_writepages(), which
is rather desirable.
Andrew Morton [Thu, 4 Jul 2002 15:31:30 +0000 (08:31 -0700)]
[PATCH] suppress more allocation failure warnings
The `page allocation failure' warning in __alloc_pages() is being a
pain. But I'm persisting with it...
The patch renames PF_RADIX_TREE to PF_NOWARN, and uses it in a few
places where allocations failures are known to happen. These code
paths are well-tested now and suppressing the warning is OK.
Andrew Morton [Thu, 4 Jul 2002 15:31:25 +0000 (08:31 -0700)]
[PATCH] always update page->flags atomically
move_from_swap_cache() and move_to_swap_cache() are playing with
page->flags nonatomically. The page is on the LRU at the time and
another CPU could be altering page->flags concurrently.
The patch converts those functions to use atomic operations.
It also rationalises the number of bits which are cleared. It's not
really clear to me what page flags we really want to set to a known
state in there.
It had no right to go clearing PG_arch_1. I'm now clearing PG_arch_1
inside rmqueue() which is still a bit presumptious.
btw: shmem uses PAGE_CACHE_SIZE and swapper_space uses PAGE_SIZE. I've
been carefully maintaining the distinction, but it looks like shmem
will break if we ever do make these values different.
Also, __add_to_page_cache() was performing a non-atomic RMW against
page->flags, under the assumption that it was a newly allocated page
which no other CPU would look at. Not true - this function is used for
moving anon pages into swapcache. Those anon pages are on the LRU -
other CPUs can be performing operations against page->flags while
__add_to_swap_cache is stomping on them. This had me running around in
circles for two days.
So let's move the initialisation of the page state into rmqueue(),
where the page really is new (could do it in page_cache_alloc,
perhaps).
The SetPageLocked() in __add_to_page_cache() is also rather curious.
Seems OK for both pagecache and swapcache so I covered that with a
comment.
2.4 has the same problem. Basically, add_to_swap_cache() can stomp on
another CPU's manipulation of page->flags. After a quick review of the
code there, it is barely conceivable that a concurrent refill_inactve()
could get its PG_referenced and PG_active bits scribbled on. Rather
unlikely because swap_out() will probably see PageActive() and bale
out.
Also, mark_dirty_kiobuf() could have its PG_dirty bit accidentally
cleared (but try_to_swap_out() sets it again later).
But there may be other code paths. Really, I think this needs fixing
in 2.4 - it's horrid.
Andrew Morton [Thu, 4 Jul 2002 15:31:20 +0000 (08:31 -0700)]
[PATCH] Use __GFP_HIGH in mpage_writepages()
In mpage_writepage(), use __GFP_HIGH when allocating the BIO: writeback
is a memory reclaim function and is entitle to dip into the page
reserves to get its IO underway.
Andrew Morton [Thu, 4 Jul 2002 15:31:15 +0000 (08:31 -0700)]
[PATCH] resurrect __GFP_HIGH
This patch reinstates __GFP_HIGH functionality.
__GFP_HIGH means "able to dip into the emergency pools". However,
somewhere along the line this got broken. __GFP_HIGH ceased to do
anything. Instead, !__GFP_WAIT is used to tell the page allocator to
try harder.
__GFP_HIGH makes sense. The concepts of "unable to sleep" and "should
try harder" are quite separate, and overloading !__GFP_WAIT to mean
"should access emergency pools" seems wrong.
This patch fixes a problem in mempool_alloc(). mempool_alloc() tries
the first allocation with __GFP_WAIT cleared. If that fails, it tries
again with __GFP_WAIT enabled (if the caller can support __GFP_WAIT).
So it is currently performing an atomic allocation first, even though
the caller said that they're prepared to go in and call the page
stealer.
I thought this was a mempool bug, but Ingo said:
> no, it's not GFP_ATOMIC. The important difference is __GFP_HIGH, which
> triggers the intrusive highprio allocation mode. Otherwise gfp_nowait is
> just a nonblocking allocation of the same type as the original gfp_mask.
> ...
> what i've added is a bit more subtle allocation method, with both
> performance and balancing-correctness in mind:
>
> 1. allocate via gfp_mask, but nonblocking
> 2. if failure => try to get from the pool if the pool is 'full enough'.
> 3. if failure => allocate with gfp_mask [which might block]
>
> there is performance data that this method improves bounce-IO performance
> significantly, because even under VM pressure (when gfp_mask would block)
> we can still use up to 50% of the memory pool without blocking (and
> without endangering deadlock-free allocation). Ie. the memory pool is also
> a fast 'frontside cache' of memory elements.
Ingo was assuming that __GFP_HIGH was still functional. It isn't, and the
mempool design wants it.
Andrew Morton [Thu, 4 Jul 2002 15:30:54 +0000 (08:30 -0700)]
[PATCH] set TASK_RUNNING in cond_resched()
do_select() does set_current_state(TASK_INTERRUPTIBLE) then calls
__pollwait() which calls __get_free_page() and the cond_resched() which
I added to the pagecache reclaim code never returns.
The patch makes cond_resched() more useful by setting current->state to
TASK_RUNNING before scheduling.
Andrew Morton [Thu, 4 Jul 2002 15:30:44 +0000 (08:30 -0700)]
[PATCH] shmem fixes
A shmem cleanup/bugfix patch from Hugh Dickins.
- Minor: in try_to_unuse(), only wait on writeout if we actually
started new writeout. Otherwise, there is no need because a
wait_on_page_writeback() has already been executed against this page.
And it's locked, so no new writeback can start.
- Minor: in shmem_unuse_inode(): remove all the
wait_on_page_writeback() logic. We already did that in
try_to_unuse(), adn the page is locked so no new writeback can start.
- Less minor: add a missing a page_cache_release() to
shmem_get_page_locked() in the uncommon case where the page was found
to be under writeout.
Andrew Morton [Thu, 4 Jul 2002 15:30:39 +0000 (08:30 -0700)]
[PATCH] remove swap_get_block()
Patch from Christoph Hellwig removes swap_get_block().
I was sort-of hanging onto this function because it is a standard
get_block function, and maybe perhaps it could be used to make swap use
the regular filesystem I/O functions. We don't want to do that, so
kill it.
Andrew Morton [Thu, 4 Jul 2002 15:30:34 +0000 (08:30 -0700)]
[PATCH] pdflush cleanup
Writeback/pdflush cleanup patch from Steven Augart
* Exposes nr_pdflush_threads as /proc/sys/vm/nr_pdflush_threads, read-only.
(I like this - I expect that management of the pdflush thread pool
will be important for many-spindle machines, and this is a neat way
of getting at the info).
* Adds minimum and maximum checking to the five writable pdflush
and fs-writeback parameters.
* Minor indentation fix in sysctl.c
* mm/pdflush.c now includes linux/writeback.h, which prototypes
pdflush_operation. This is so that the compiler can
automatically check that the prototype matches the definition.
Andrew Morton [Thu, 4 Jul 2002 15:30:20 +0000 (08:30 -0700)]
[PATCH] Remove ext3's buffer_head cache
Removes ext3's open-coded inode and allocation bitmap LRUs.
This patch includes a cleanup to ext3_new_block(). The local variables
`bh', `bh2', `i', `j', `k' and `tmp' have been renamed to something
more palatable.
Andrew Morton [Thu, 4 Jul 2002 15:30:10 +0000 (08:30 -0700)]
[PATCH] per-cpu buffer_head cache
ext2 and ext3 implement a custom LRU cache of buffer_heads - the eight
most-recently-used inode bitmap buffers and the eight MRU block bitmap
buffers.
I don't like them, for a number of reasons:
- The code is duplicated between filesystems
- The functionality is unavailable to other filesystems
- The LRU only applies to bitmap buffers. And not, say, indirects.
- The LRUs are subtly dependent upon lock_super() for protection:
without lock_super protection a bitmap could be evicted and freed
while in use.
And removing this dependence on lock_super() gets us one step on
the way toward getting that semaphore out of the ext2 block allocator -
it causes significant contention under some loads and should be a
spinlock.
- The LRUs pin 64 kbytes per mounted filesystem.
Now, we could just delete those LRUs and rely on the VM to manage the
memory. But that would introduce significant lock contention in
__find_get_block - the blockdev mapping's private_lock and page_lock
are heavily used.
So this patch introduces a transparent per-CPU bh lru which is hidden
inside __find_get_block(), __getblk() and __bread(). It is designed to
shorten code paths and to reduce lock contention. It uses a seven-slot
LRU. It achieves a 99% hit rate in `dbench 64'. It provides benefit
to all filesystems.
The next patches remove the open-coded LRUs from ext2 and ext3.
Taken together, these patches are a code cleanup (300-400 lines gone),
and they reduce lock contention. Anton tested these patches on the
32-way and demonstrated a throughput improvement of up to 15% on
RAM-only dbench runs. See http://samba.org/~anton/linux/2.5.24/dbench/
Most of this benefit is from avoiding find_get_page() on the blockdev
mapping. Because the generic LRU copes with indirect blocks as well as
bitmaps.
Andrew Morton [Thu, 4 Jul 2002 15:30:04 +0000 (08:30 -0700)]
[PATCH] Fix 3c59x driver for some 3c566B's
Fix from Rahul Karnik and Donald Becker - some new 3c566B mini-PCI NICs
refuse to power up the transceiver unless we tickle an undocumented bit
in an undocumented register. They worked this out by before-and-after
diffing of the register contents when it was set up by the Windows
driver.
o Add a + to $(MAKEBOOT), so that make knows that it's a recursive make
invocation.
o For files which are generated like .map -> .c -> .o,
add an explicit dependency for .c -> .o.
Otherwise, make sees the .c as an intermediate object and removes it,
causing an unnecessary recompilation at next invocation.
Matthew Dharm [Thu, 4 Jul 2002 11:21:24 +0000 (04:21 -0700)]
[PATCH] usb-storage: remove timer
This removes the timer usage in usb-storage. This cleans up quite a bit
of the state machine and eliminates quite a few potential races.
Initialization commands and other non-data-path mechanisms use the USB core
timeout mechanism. Anything in the data path uses the SCSI mid-layer
mechanism.
James Bottomley [Thu, 4 Jul 2002 05:40:19 +0000 (22:40 -0700)]
[PATCH] fix SCSI driverfs for IDE panic on boot.
This panic was reported to lkml by Anton Altaparmakov. The code added to
partitions/check.c to add partitions to driverfs requires preparation by the
calling entity. There's a NULL pointer check to see if the calling entity
actually did the preparation, but IDE forgets to clear the area it kmalloc's
for struct genhd so the pointer contains junk.
The fix is just to clear the struct genhd before IDE uses it.
Removed usb-uhci-hcd.o from the list of UHCI drivers.
This allowed the logic to be cleaned up.
Removed CONFIG_EXPERIMENTAL dependancy, as it's no longer needed.
NTFS: 2.0.14 - Run list merging code cleanup, minor locking changes, typo fixes.
- Change fs/ntfs/super.c::ntfs_statfs() to not rely on BKL by moving
the locking out of super.c::get_nr_free_mft_records() and taking and
dropping the mftbmp_lock rw_semaphore in ntfs_statfs() itself.
- Bring attribute run list merging code (fs/ntfs/attrib.c) in sync with
current userspace ntfs library code. This means that if a merge
fails the original run lists are always left unmodified instead of
being silently corrupted.
- Misc typo fixes.
Matthew Wilcox [Tue, 2 Jul 2002 04:58:30 +0000 (21:58 -0700)]
[PATCH] rewrite find_vma_prev
For PA-RISC, we need find_vma_prev to return `prev', even if vma is NULL.
Our stack is at the top of memory, growing upwards, so when we page fault
we need to see prev. For added bonus points, the code becomes simpler,
less indented, shorter and (for me, anyway) easier to understand. The
code is well-tested, even on x86. For PA and ia64 this code is called in
the page fault handler path so it is exercised frequently.
Matthew Wilcox [Mon, 1 Jul 2002 08:36:58 +0000 (04:36 -0400)]
[PATCH] softscsi patch
Doug Gilbert and James Bottomley hassled me all through KernelSummit &
OLS to explain about softirqs, tasklets and bottom halves. In the end,
it was easier to write the code myself. Thanks to James for pointing
out that the pointer handling in my original code was completely broken
and helping me debug.
I've booted this patch on a 4-way system at OSDL with two Adaptec SCSI
cards. I haven't tried stressing it (not quite sure which discs I can
use ;-), and I don't understand the locking in the scsi subsystem at all.
The main effect of applying this patch is that scsi_softirq() [was
scsi_tasklet_func, and before that scsi_bottom_half_handler()] can now be
run on multiple CPUs at the same time. We _seem_ to do enough locking
elsewhere in the SCSI stack that this is safe. But someone who really
understands the SCSI stack should audit this.
This work shows up a hole in the current softirq API -- there's no support
for unregistering a softirq (close_softirq or similar). We should do
this in scsi_exit -- make sure no softirqs are running while we unload.
This probably isn't a problem in practice, but it'd be nice to fix it.