the attached patch (against BK-curr) fixes a bug in the new PID allocator,
which bug can cause incorrect hashing of the PID structure which causes
infinite loops in find_pid(). [and potentially other problems.]
Andrew Morton [Thu, 19 Sep 2002 15:37:21 +0000 (08:37 -0700)]
[PATCH] reduced locking in release_pages()
From Marcus Alanen <maalanen@ra.abo.fi>
Don't retake the zone lock after spilling a batch of pages into the
buddy.
Instead, just clear local variable `zone' to indicate that no lock is
held.
This is actually a common case - whenever release_pages() is called
with exactly 16 pages (truncate, page reclaim..) Marcus' patch will
save a lock and an unlock.
Also, remove some lock-avoidance heuristics in
pagevec_deactivate_inactive(): the caller has already made these
checks, and the chance of the check here actually doing anything useful
is negligible.
Andrew Morton [Thu, 19 Sep 2002 15:37:08 +0000 (08:37 -0700)]
[PATCH] hugetlbpages cleanup
From Christoph Hellwig, acked by Rohit.
- fix config.in description: we know we're on i386 and we also know
that a feature can only be enabled if the hw supports it, the code
alone is not enough
- the sysctl is VM-releated, so move it from /proc/sys/kernel tp
/proc/sys/vm
Andrew Morton [Thu, 19 Sep 2002 15:36:56 +0000 (08:36 -0700)]
[PATCH] fix mmap(MAP_LOCKED)
From Hubertus Franke.
The MAP_LOCKED flag to mmap() currently does nothing. Hubertus' patch
fixes it so that the relevant mapping is locked into memory, if the
called has CAP_IPC_LOCK.
Andrew Morton [Thu, 19 Sep 2002 15:36:47 +0000 (08:36 -0700)]
[PATCH] readv/writev bounds checking fixes
- writev currently returns -EFAULT if _any_ of the segments has an
invalid address. We should only return -EFAULT if the first segment
has a bad address.
If some of the first segments have valid addresses we need to write
them and return a partial result.
- The current code only checks if the sum-of-lengths is negative. If
individual segments have a negative length but the result is positive
we miss that.
So rework the code to detect this, and to be immune to odd wrapping
situations.
As a bonus, we save one pass across the iovec.
- ditto for readv.
The check for "does any segment have a negative length" has already
been performed in do_readv_writev(), but it's basically free here, and
we need to do it for generic_file_read/write anyway.
This all means that the iov_length() function is unsafe because of
wrap/overflow isues. It should only be used after the
generic_file_read/write or do_readv_writev() checking has been
performed. Its callers have been reviewed and they are OK.
The code now passes LTP testing and has been QA'd by Janet's team.
Andrew Morton [Thu, 19 Sep 2002 15:36:43 +0000 (08:36 -0700)]
[PATCH] writev speedup
A patch from Hirokazu Takahashi to speed up the new sped-up writev
code.
Instead of running ->prepare_write/->commit_write for each individual
segment, we walk the segments between prepage and commit. So
potentially much larger amounts of data are passed to commit_write(),
and prepare_write() is called much less often.
Added bonus: the segment walk happens inside the kmap_atomic(), so we
run kmap_atomic() once per page, not once per segment.
We've demonstrated a speedup of over 3x. This is writing 1024-segment
iovecs where the individual segments have an average length of 24
bytes, which is a favourable case for this patch.
Andrew Morton [Thu, 19 Sep 2002 15:36:39 +0000 (08:36 -0700)]
[PATCH] swapout fix
Silly bug which was halving swapout bandwidth: we've taken a copy of
page->mapping into a local convenience variable, but forgot to update
that local after adding the page to swapcache.
Andrew Morton [Thu, 19 Sep 2002 15:36:29 +0000 (08:36 -0700)]
[PATCH] remove statm_pgd_range
Bill Irwin's patch to avoid having to walk pagetables while generating
/proc/stat output.
It can significantly overstate the size of various mappings because it
assumes that all VMAs are fully populated.
But spending 100% of one of my four CPUs running top(1) is a bug.
Bill says this fixes a bug, too. The `SIZE' parameter is supposed to
display the amount of memory which the process would consume if it
faulted everything in. But "before it only showed instantiated
3rd-level pagetables, so if something within a 4MB aligned range hadn't
been faulted in it would slip past the old one".
Andrew Morton [Thu, 19 Sep 2002 15:36:22 +0000 (08:36 -0700)]
[PATCH] _alloc_pages cleanup
Patch from Martin Bligh. It should only affect machines using
discontigmem.
"This patch is was originally from Andrea's tree (from SGI??), and has
been tweaked since by both Christoph (who cleaned up all the code),
and myself (who just hit it until it worked).
It removes _alloc_pages, and adds all nodes to the zonelists
directly, which also changes the fallback zone order to something more
sensible ... instead of: "foreach (node) { foreach (zone) }" we now
do something more like "foreach (zone_type) { foreach (node) }"
Christoph has a more recent version that's fancier and does a couple
more cleanups, but it seems to have a bug in it that I can't track
down easily, so I propose we do the simple thing for now, and take the
rest of the cleanups when it works ... it seems to build nicely on
top of this seperately to me.
Tested on 16-way NUMA-Q with discontigmem + NUMA support."
Andrew Morton [Thu, 19 Sep 2002 15:35:54 +0000 (08:35 -0700)]
[PATCH] free_area_init cleanup
Patch from Martin Bligh. It should only affect machines using
discontigmem.
"This patch cleans up free_area_init stuff, and undefines mem_map and
max_mapnr for discontigmem, where they were horrible kludges anyway
... We just use the lmem_maps instead, which makes much more sense.
It also kills pgdat->node_start_mapnr, which is tarred with the same
brush.
It breaks free_area_init_core into a couple of sections, pulls the
allocation of the lmem_map back into the next higher function, and
passes more things via the pgdat. But that's not very interesting,
the objective was to kill mem_map for discontigmem, which seems to
attract bugs like flypaper. This brings any misuses to obvious
compile-time errors rather than wierd oopses, which I can't help but
feel is a good thing.
It does break other discontigmem architectures, but in a very obvious
way (they won't compile) and it's easy to fix. I think that's a small
price to pay ... ;-) At some point soon I will follow up with a patch
to remove free_area_init_node for the contig mem case, or at the very
least rename it to something more sensible, like __free_area_init.
Christoph has grander plans to kill mem_map more extensively in
addition to the attatched, but I've heard nobody disagree that it
should die for the discontigmem case at least.
Oh, and I renamed mem_map in drivers/pcmcia/sa1100 to pc_mem_map
because my tiny little brain (and cscope) find it confusing like that.
Tested on 16-way NUMA-Q with discontigmem + NUMA support and on a
standard PC (well, boots and appears functional). On top of
2.5.33-mm4"
Andrew Morton [Thu, 19 Sep 2002 15:35:46 +0000 (08:35 -0700)]
[PATCH] clean up argument passing in writeback paths
The writeback code paths which walk the superblocks and inodes are
getting an increasing arguments passed to them.
The patch wraps those args into the new `struct writeback_control',
and uses that instead. There is no functional change.
The new writeback_control structure is passed down through the
writeback paths in the place where the old `nr_to_write' pointer used
to be.
writeback_control will be used to pass new information up and down the
writeback paths. Such as whether the writeback should be non-blocking,
and whether queue congestion was encountered.
Jeff Garzik [Thu, 19 Sep 2002 17:28:05 +0000 (13:28 -0400)]
Update eepro100 net driver's mdio_{read,write} functions
to take 'struct net_device *' not 'long' as their first
argument. This makes eepro100 compatible with the standard
MII ethtool API, preparing it for that support.
No functional changes should occur with this patch, if anything
changes at all it is a bug. (and testing shows no changes...)
David S. Miller [Thu, 19 Sep 2002 15:07:35 +0000 (08:07 -0700)]
[PATCH] block device oopses on shutdown in 2.5.x
The partition code registers a generic device for disks
which have a dev->driver non-NULL but whose dev->driver->remove
points into outer space. So when reboot happens --> OOPS
in drivers/base/power.c:device_shutdown()
Ok, amusingly in my case dev->driver == &scsi_done(), hehe :-)
Two cases of uninitialized memory spotted, here is the patch.
Jeff Garzik [Thu, 19 Sep 2002 13:22:47 +0000 (09:22 -0400)]
more fixes for sundance net driver:
* default to PIO (fixes bugs in some chips), but add CONFIG_xxx option
for MMIO
* proper support for variable MTU sizes
* add missing unregister_netdev in an error path
(with a kudos to Jason Lunz for merging most of this)
Remove bogus timer optimization - even if the timer isn't pending,
it might be actively running on another CPU, so we still need to
do the synchronous wait.
Scott Murray [Thu, 19 Sep 2002 10:29:16 +0000 (03:29 -0700)]
[PATCH] Small pcihpfs dnotify fix
I've been working on a userspace daemon to go with my CompactPCI driver,
and yesterday I discovered an oversight in pci_hp_change_slot_info - it
doesn't call dnotify_parent, so dnotify based clients basically don't
work against pcihpfs. The following patch (against 2.5 BK) reworks
things to just update the mtime (since we're modifying the file after
all), and then call dnotify_parent.
Jeff Garzik [Thu, 19 Sep 2002 09:05:29 +0000 (05:05 -0400)]
sundance net driver fixes, and a few cleanups too:
- Remove unused/constant members from struct pci_id_info
(which then allows removal of 'drv_flags' from private struct)
- If no phy is found, fail to load that board
- Always start phy id scan at id 1 to avoid problems (Donald Becker)
- Autodetect where mii_preable_required is needed,
default to not needed. (Donald Becker)
This is the latest version of the generic pidhash patch. The biggest
change is the removal of separately allocated pid structures: they are
now part of the task structure and the first task that uses a PID will
provide the pid structure. Task refcounting is used to avoid the
freeing of the task structure before every member of a process group or
session has exited.
This approach has a number of advantages besides the performance gains.
Besides simplifying the whole hashing code significantly, attach_pid()
is now fundamentally atomic and can be called during create_process()
without worrying about task-list side-effects. It does not have to
re-search the pidhash to find out about raced PID-adding either, and
attach_pid() cannot fail due to OOM. detach_pid() can do a simple
put_task_struct() instead of the kmem_cache_free().
The only minimal downside is the potential pending task structures after
session leaders or group leaders have exited - but the number of orphan
sessions and process groups is usually very low - and even if it's
higher, this can be regarded as a slow execution of the final
deallocation of the session leader, not some additional burden.