Andrew Morton [Sat, 8 Jan 2005 06:03:01 +0000 (22:03 -0800)]
[PATCH] readpage-vs-invalidate fix
A while ago we merged a patch which tried to solve a problem wherein a
concurrent read() and invalidate_inode_pages() would cause the read() to
return -EIO because invalidate cleared PageUptodate() at the wrong time.
That patch tests for (page_count(page) != 2) in invalidate_complete_page() and
bales out if false.
Problem is, the page may be in the per-cpu LRU front-ends over in
lru_cache_add. This elevates the refcount pending spillage of the page onto
the LRU for real. That causes a false positive in invalidate_complete_page(),
causing the page to not get invalidated. This screws up the logic in my new
O_DIRECT-vs-buffered coherency fix.
So let's solve the invalidate-vs-read in a different manner. Over on the
read() side, add an explicit check to see if the page was invalidated. If so,
just drop it on the floor and redo the read from scratch.
Note that only do_generic_mapping_read() needs treatment. filemap_nopage(),
filemap_getpage() and read_cache_page() are already doing the
oh-it-was-invalidated-so-try-again thing.
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
H. Peter Anvin [Sat, 8 Jan 2005 06:02:27 +0000 (22:02 -0800)]
[PATCH] raid6: altivec support
This patch adds Altivec support for RAID-6, if appropriately configured on
the ppc or ppc64 architectures. Note that it changes the compile flags for
ppc64 in order to handle -maltivec correctly; this change was vetted on the
ppc64 mailing list and OK'd by paulus.
Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
David Brownell [Sat, 8 Jan 2005 06:02:09 +0000 (22:02 -0800)]
[PATCH] fbdev: rivafb should recognize NF2/IGP
I got tired of not seeing the boot time penguin on my Shuttle SN41G2, and
not having a decently large text display when I bypass X11. XFree86 says
it's "Chipset GeForce4 MX Integrated GPU", and the kernel driver has hooks
for this chip ID although it doesn't have a #define to match.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch is for ia64 kernel, and defines CONFIG_HOLES_IN_ZONE in
arch/ia64/Kconfig. IA64 has memory holes smaller than its MAX_ORDER and
its virtual memmap allows holes in a zone's memmap.
This patch makes vmemmap aligned with IA64_GRANULE_SIZE in
arch/ia64/mm/init.c.
this also means "buddy" is a head of continuous free pages
of length of (1 << page_order(buddy)).
bad_range() is called from inner loop of __free_pages_bulk().
In many archs, bad_range() is only a sanity check, it will always return 0.
But if a zone's memmap has a hole, it sometimes returns 1.
An architecture with memory holes in a zone has to define CONFIG_HOLES_IN_ZONE.
When CONFIG_HOLES_IN_ZONE is defined, pfn_valid() is called for checking
whether a buddy pages is valid or not.
[PATCH] no buddy bitmap patch revist: intro and includes
Followings are patches for removing bitmaps from the buddy allocator. This
is benefical to memory-hot-plug stuffs, because this removes a data
structure which must meet to a host's physical memory layout.
This is one step to manage physical memory in nonlinear / discontiguous way
and will reduce some amounts of codes to implement memory-hot-plug.
This patch removes bitmaps from zone->free_area[] in include/linux/mmzone.h,
and adds some comments on page->private field in include/linux/mm.h.
non-atomic ops for changing PG_private bit is added in include/page-flags.h.
zone->lock is always acquired when PG_private of "a free page" is changed.
[PATCH] vm: for -mm only: remove remap_page_range() completely
All in-tree references to remap_page_range() have been removed by prior
patches in the series. This patch, intended to be applied after some waiting
period for people to adjust to the API change, notice __deprecated, etc., does
the final removal of remap_page_range() as a function symbol declared within
kernel headers and/or implemented in kernel sources.
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ingo Molnar [Sat, 8 Jan 2005 05:59:57 +0000 (21:59 -0800)]
[PATCH] remove the BKL by turning it into a semaphore
This is the current remove-BKL patch. I test-booted it on x86 and x64, trying
every conceivable combination of SMP, PREEMPT and PREEMPT_BKL. All other
architectures should compile as well. (most of the testing was done with the
zaphod patch undone but it applies cleanly on vanilla -mm3 as well and should
work fine.)
this is the debugging-enabled variant of the patch which has two main
debugging features:
- debug potentially illegal smp_processor_id() use. Has caught a number
of real bugs - e.g. look at the printk.c fix in the patch.
- make it possible to enable/disable the BKL via a .config. If this
goes upstream we dont want this of course, but for now it gives
people a chance to find out whether any particular problem was caused
by this patch.
This patch has one important fix over the previous BKL patch: on PREEMPT
kernels if we preempted BKL-using code then the code still auto-dropped the
BKL by mistake. This caused a number of breakages for testers, which
breakages went away once this bug was fixed.
Also the debugging mechanism has been improved alot relative to the previous
BKL patch.
Would be nice to test-drive this in -mm. There will likely be some more
smp_processor_id() false positives but they are 1) harmless 2) easy to fix up.
We could as well find more real smp_processor_id() related breakages as well.
The most noteworthy fact is that no BKL-using code was found yet that relied
on smp_processor_id(), which is promising from a compatibility POV.
Hugh Dickins [Sat, 8 Jan 2005 05:59:38 +0000 (21:59 -0800)]
[PATCH] vmtrunc: restart_addr in truncate_count
Despite its restart_pgoff pretentions, unmap_mapping_range_vma was fatally
unable to distinguish a vma to be restarted from the case where that vma
has been freed, and its vm_area_struct reused for the top part of a
!new_below split of an isomorphic vma yet to be scanned.
The obvious answer is to note restart_vma in the struct address_space, and
cancel it when that vma is freed; but I'm reluctant to enlarge every struct
inode just for this. Another answer is to flag valid restart in the
vm_area_struct; but vm_flags is protected by down_write of mmap_sem, which
we cannot take within down_write of i_sem. If we're going to need yet
another field, better to record the restart_addr itself: restart_vma only
recorded the last restart, but a busy tree could well use more.
Actually, we don't need another field: we can neatly (though naughtily)
keep restart_addr in vm_truncate_count, provided mapping->truncate_count
leaps over those values which look like a page-aligned address. Zero
remains good for forcing a scan (though now interpreted as restart_addr 0),
and it turns out no change is needed to any of the vm_truncate_count
settings in dup_mmap, vma_link, vma_adjust, move_one_page.
Hugh Dickins [Sat, 8 Jan 2005 05:59:23 +0000 (21:59 -0800)]
[PATCH] vmtrunc: bug if page_mapped
If unmap_mapping_range (and mapping->truncate_count) are doing their jobs
right, truncate_complete_page should never find the page mapped: add BUG_ON
for our immediate testing, but this patch should probably not go to mainline -
a mapped page here is not a catastrophe.
Hugh Dickins [Sat, 8 Jan 2005 05:59:06 +0000 (21:59 -0800)]
[PATCH] vmtrunc: vm_truncate_count race caution
Fix some unlikely races in respect of vm_truncate_count.
Firstly, it's supposed to be guarded by i_mmap_lock, but some places copy a
vma structure by *new_vma = *old_vma: if the compiler implements that with a
bytewise copy, new_vma->vm_truncate_count could be munged, and new_vma later
appear up-to-date when it's not; so set it properly once under lock.
vma_link set vm_truncate_count to mapping->truncate_count when adding an empty
vma: if new vmas are being added profusely while vmtruncate is in progess,
this lets them be skipped without scanning.
vma_adjust has vm_truncate_count problem much like it had with anon_vma under
mprotect merge: when merging be careful not to leave vma marked as up-to-date
when it might not be, lest unmap_mapping_range in progress - set
vm_truncate_count 0 when in doubt. Similarly when mremap moving ptes from one
vma to another.
Cut a little code from __anon_vma_merge: now vma_adjust sets "importer" in the
remove_next case (to get its vm_truncate_count right), its anon_vma is already
linked by the time __anon_vma_merge is called.
vmtruncate (or more generally, unmap_mapping_range) has been observed
responsible for very high latencies: the lockbreak work in unmap_vmas is good
for munmap or exit_mmap, but no use while mapping->i_mmap_lock is held, to
keep our place in the prio_tree (or list) of a file's vmas.
Extend the zap_details block with i_mmap_lock pointer, so unmap_vmas can
detect if that needs lockbreak, and break_addr so it can notify where it left
off.
Add unmap_mapping_range_vma, used from both prio_tree and nonlinear list
handlers. This is what now calls zap_page_range (above unmap_vmas), but
handles the lockbreak and restart issues: letting unmap_mapping_range_ tree or
list know when they need to start over because lock was dropped.
When restarting, of course there's a danger of never making progress. Add
vm_truncate_count field to vm_area_struct, update that to mapping->
truncate_count once fully scanned, skip up-to-date vmas without a scan (and
without dropping i_mmap_lock).
Further danger of never making progress if a vma is very large: when breaking
out, save restart_vma and restart_addr (and restart_pgoff to confirm, in case
vma gets reused), to help continue where we left off.
Hugh Dickins [Sat, 8 Jan 2005 05:58:36 +0000 (21:58 -0800)]
[PATCH] vmtrunc: unmap_mapping_range_tree
Move unmap_mapping_range's nonlinear vma handling out to its own inline,
parallel to the prio_tree handling; unmap_mapping_range_list is a better name
for the nonlinear list, rename the other unmap_mapping_range_tree.
Hugh Dickins [Sat, 8 Jan 2005 05:58:19 +0000 (21:58 -0800)]
[PATCH] vmtrunc: restore unmap_vmas zap_bytes
The low-latency unmap_vmas patch silently moved the zap_bytes test after the
TLB finish and lockbreak and regather: why? That not only makes zap_bytes
redundant (might as well use ZAP_BLOCK_SIZE), it makes the unmap_vmas level
redundant too - it's all about saving TLB flushes when unmapping a series of
small vmas.
Move zap_bytes test back before the lockbreak, and delete the curious comment
that a small zap block size doesn't matter: it's true need_flush prevents TLB
flush when no page has been unmapped, but unmapping pages in small blocks
involves many more TLB flushes than in large blocks.
Hugh Dickins [Sat, 8 Jan 2005 05:58:02 +0000 (21:58 -0800)]
[PATCH] vmtrunc: truncate_count not atomic
Why is mapping->truncate_count atomic? It's incremented inside i_mmap_lock
(and i_sem), and the reads don't need it to be atomic.
And why smp_rmb() before call to ->nopage? The compiler cannot reorder the
initial assignment of sequence after the call to ->nopage, and no cpu (yet!)
can read from the future, which is all that matters there.
And delete totally bogus reset of truncate_count from blkmtd add_device.
truncate_count is all about detecting i_size changes: i_size does not change
there; and if it did, the count should be incremented not reset.
Nick Piggin [Sat, 8 Jan 2005 05:54:31 +0000 (21:54 -0800)]
[PATCH] debug sched domains before attach
Change the sched-domain debug routine to be called on a per-CPU basis, and
executed before the domain is actually attached to the CPU. Previously, all
CPUs would have their new domains attached, and then the debug routine would
loop over all of them.
This has two advantages: First, there is no longer any theoretical races: we
are running the debug routine on a domain that isn't yet active, and should
have no racing access from another CPU. Second, if there is a problem with a
domain, the validator will have a better chance to catch the error and print a
diagnostic _before_ the domain is attached, which may take down the system.
Also, change reporting of detected error conditions to KERN_ERR instead of
KERN_DEBUG, so they have a better chance of being seen in a hang on boot
situation.
The patch also does an unrelated (and harmless) cleanup in migration_thread().
Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ingo Molnar [Sat, 8 Jan 2005 05:54:15 +0000 (21:54 -0800)]
[PATCH] Fix smp_processor_id() warning in numa_node_id()
The patch below fixes smp_processor_id() warnings that are triggered by
numa_node_id().
All uses of numa_node_id() in mm/mempolicy.c seem to use it as a 'hint'
only, not as a correctness number. Once a node is established, it's used
in a preemption-safe way. So the simple fix is to disable the checking for
numa_node_id(). But additional review would be more than welcome, because
this patch turns off the preemption-checking of numa_node_id() permanently.
Tested on amd64.
Ingo Molnar [Sat, 8 Jan 2005 05:53:57 +0000 (21:53 -0800)]
[PATCH] oprofile smp_processor_id() fixes
Clean up a few suspicious-looking uses of smp_processor_id() in preemptible
code.
The current_cpu_data use is unclean but most likely safe. I haven't seen any
outright bugs. Since oprofile does not seem to be ready for different-type
CPUs (do we even care?), the patch below documents this property by using
boot_cpu_data.
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ingo Molnar [Sat, 8 Jan 2005 05:53:41 +0000 (21:53 -0800)]
[PATCH] idle thread preemption fix
The early bootup stage is pretty fragile because the idle thread is not yet
functioning as such and so we need preemption disabled. Whether the bootup
fails or not seems to depend on timing details so e.g. the presence of
SCHED_SMT makes it go away.
Disabling preemption explicitly has another advantage: the atomicity check
in schedule() will catch early-bootup schedule() calls from now on.
The patch also fixes another preempt-bkl buglet: interrupt-driven
forced-preemption didnt go through preempt_schedule() so it resulted in
auto-dropping of the BKL. Now we go through preempt_schedule() which
properly deals with the BKL.
Ingo Molnar [Sat, 8 Jan 2005 05:53:24 +0000 (21:53 -0800)]
[PATCH] sched: fix scheduling latencies for !PREEMPT kernels
This patch adds a handful of cond_resched() points to a number of key,
scheduling-latency related non-inlined functions.
This reduces preemption latency for !PREEMPT kernels. These are scheduling
points complementary to PREEMPT_VOLUNTARY scheduling points (might_sleep()
places) - i.e. these are all points where an explicit cond_resched() had
to be added.
Ingo Molnar [Sat, 8 Jan 2005 05:52:50 +0000 (21:52 -0800)]
[PATCH] sched: fix scheduling latencies in mttr.c
Fix scheduling latencies in the MTRR-setting codepath. Also, fix bad bug:
MTRR's _must_ be set with interrupts disabled!
From: Bernard Blackham <bernard@blackham.com.au>
The patch sched-fix-scheduling-latencies-in-mttr in recent -mm kernels has
the bad side-effect of re-enabling interrupts even if they were disabled.
This caused bugs in Software Suspend 2 which reenabled MTRRs whilst
interrupts were already disabled.
Attached is a replacement patch which uses spin_lock_irqsave instead of
spin_lock_irq.
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ingo Molnar [Sat, 8 Jan 2005 05:52:32 +0000 (21:52 -0800)]
[PATCH] fix keventd execution dependency
We dont want to execute off keventd since it might hold a semaphore our
callers hold too. This can happen when kthread_create() is called from
within keventd. This happened due to the IRQ threading patches but it
could happen with other code too.
Ingo Molnar [Sat, 8 Jan 2005 05:51:41 +0000 (21:51 -0800)]
[PATCH] sched: mm: fix scheduling latencies in unmap_vmas()
The attached patch fixes long latencies in unmap_vmas(). We had lockbreak
code in that function already but it did not take delayed effects of
TLB-gather into account.
Ingo Molnar [Sat, 8 Jan 2005 05:51:22 +0000 (21:51 -0800)]
[PATCH] sched: net: fix scheduling latencies in __release_sock
The attached patch fixes long scheduling latencies caused by backlog
triggered by __release_sock(). That code only executes in process context,
and we've made the backlog queue private already at this point so it is
safe to do a cond_resched_softirq().
Ingo Molnar [Sat, 8 Jan 2005 05:51:03 +0000 (21:51 -0800)]
[PATCH] sched: net: fix scheduling latencies in netstat
The attached patch fixes long scheduling latencies caused by access to the
/proc/net/tcp file. The seqfile functions keep softirqs disabled for a
very long time (i've seen reports of 20+ msecs, if there are enough sockets
in the system). With the attached patch it's below 100 usecs.
The cond_resched_softirq() relies on the implicit knowledge that this code
executes in process context and runs with softirqs disabled.
Potentially enabling softirqs means that the socket list might change
between buckets - but this is not an issue since seqfiles have a 4K
iteration granularity anyway and /proc/net/tcp is often (much) larger than
that.
Ingo Molnar [Sat, 8 Jan 2005 05:50:46 +0000 (21:50 -0800)]
[PATCH] sched: vfs: fix scheduling latencies in prune_dcache() and select_parent()
The attached patch fixes long scheduling latencies in select_parent() and
prune_dcache(). The prune_dcache() lock-break is easy, but for
select_parent() the only viable solution i found was to break out if
there's a resched necessary - the reordering is not necessary and the
dcache scanning/shrinking will later on do it anyway.
Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ingo Molnar [Sat, 8 Jan 2005 05:50:09 +0000 (21:50 -0800)]
[PATCH] sched: ext3: fix scheduling latencies in ext3
The attached patch fixes long scheduling latencies in the ext3 code, and it
also cleans up the existing lock-break functionality to use the new
primitives.
This patch has been in the -VP patchset for quite some time.
Ingo Molnar [Sat, 8 Jan 2005 05:49:52 +0000 (21:49 -0800)]
[PATCH] sched: add cond_resched_softirq()
It adds cond_resched_softirq() which can be used by _process context_
softirqs-disabled codepaths to preempt if necessary. The function will
enable softirqs before scheduling. (Later patches will use this
primitive.)
Ingo Molnar [Sat, 8 Jan 2005 05:49:19 +0000 (21:49 -0800)]
[PATCH] preempt cleanup
This is another generic fallout from the voluntary-preempt patchset: a
cleanup of the cond_resched() infrastructure, in preparation of the latency
reduction patches. The changes:
- uninline cond_resched() - this makes the footprint smaller,
especially once the number of cond_resched() points increase.
- add a 'was rescheduled' return value to cond_resched. This makes it
symmetric to cond_resched_lock() and later latency reduction patches
rely on the ability to tell whether there was any preemption.
- make cond_resched() more robust by using the same mechanism as
preempt_kernel(): by using PREEMPT_ACTIVE. This preserves the task's
state - e.g. if the task is in TASK_ZOMBIE but gets preempted via
cond_resched() just prior scheduling off then this approach preserves
TASK_ZOMBIE.
- the patch also adds need_lockbreak() which critical sections can use
to detect lock-break requests.
Ingo Molnar [Sat, 8 Jan 2005 05:49:02 +0000 (21:49 -0800)]
[PATCH] improve preemption on SMP
SMP locking latencies are one of the last architectural problems that cause
millisec-category scheduling delays. CONFIG_PREEMPT tries to solve some of
the SMP issues but there are still lots of problems remaining: spinlocks
nested at multiple levels, spinning with irqs turned off, and non-nested
spinning with preemption turned off permanently.
The nesting problem goes like this: if a piece of kernel code (e.g. the MM
or ext3's journalling code) does the following:
then even with CONFIG_PREEMPT enabled, current kernels may spin on
spinlock_2 indefinitely. A number of critical sections break their long
paths by using cond_resched_lock(), but this does not break the path on
SMP, because need_resched() *of the other CPU* is not set so
cond_resched_lock() doesnt notice that a reschedule is due.
to solve this problem i've introduced a new spinlock field,
lock->break_lock, which signals towards the holding CPU that a
spinlock-break is requested by another CPU. This field is only set if a
CPU is spinning in a spinlock function [at any locking depth], so the
default overhead is zero. I've extended cond_resched_lock() to check for
this flag - in this case we can also save a reschedule. I've added the
lock_need_resched(lock) and need_lockbreak(lock) methods to check for the
need to break out of a critical section.
Another latency problem was that the stock kernel, even with CONFIG_PREEMPT
enabled, didnt have any spin-nicely preemption logic for the following,
commonly used SMP locking primitives: read_lock(), spin_lock_irqsave(),
spin_lock_irq(), spin_lock_bh(), read_lock_irqsave(), read_lock_irq(),
read_lock_bh(), write_lock_irqsave(), write_lock_irq(), write_lock_bh().
Only spin_lock() and write_lock() [the two simplest cases] where covered.
In addition to the preemption latency problems, the _irq() variants in the
above list didnt do any IRQ-enabling while spinning - possibly resulting in
excessive irqs-off sections of code!
preempt-smp.patch fixes all these latency problems by spinning irq-nicely
(if possible) and by requesting lock-breaks if needed. Two
architecture-level changes were necessary for this: the addition of the
break_lock field to spinlock_t and rwlock_t, and the addition of the
_raw_read_trylock() function.
Testing done by Mark H Johnson and myself indicate SMP latencies comparable
to the UP kernel - while they were basically indefinitely high without this
patch.
i successfully test-compiled and test-booted this patch ontop of BK-curr
using the following .config combinations: SMP && PREEMPT, !SMP && PREEMPT,
SMP && !PREEMPT and !SMP && !PREEMPT on x86, !SMP && !PREEMPT and SMP &&
PREEMPT on x64. I also test-booted x86 with the generic_read_trylock
function to check that it works fine. Essentially the same patch has been
in testing as part of the voluntary-preempt patches for some time already.
NOTE to architecture maintainers: generic_raw_read_trylock() is a crude
version that should be replaced with the proper arch-optimized version
ASAP.
From: Hugh Dickins <hugh@veritas.com>
The i386 and x86_64 _raw_read_trylocks in preempt-smp.patch are too
successful: atomic_read() returns a signed integer.
Nathan Lynch [Sat, 8 Jan 2005 05:48:27 +0000 (21:48 -0800)]
[PATCH] introduce idle_task_exit
Heiko Carstens figured out that offlining a cpu can leak mm_structs because
the dying cpu's idle task fails to switch to init_mm and mmdrop its
active_mm before the cpu is down. This patch introduces idle_task_exit,
which allows the idle task to do this as Ingo suggested.
I will follow this up with a patch for ppc64 which calls idle_task_exit
from cpu_die.
This patch removes two outdated/misleading comments from the CPU scheduler.
1) The first comment removed is simply incorrect. The function it
comments on is not used for what the comments says it is anymore.
2) The second comment is a leftover from when the "if" block it comments
on contained a goto. It does not any more, and the comment doesn't make
sense.
There isn't really a reason to add different comments, though someone might
feel differently in the case of the second one. I'll leave adding a
comment to anybody who wants to - more important to just get rid of them
now.
Con Kolivas [Sat, 8 Jan 2005 05:46:46 +0000 (21:46 -0800)]
[PATCH] sched: remove_interactive_credit
Special casing tasks by interactive credit was helpful for preventing fully
cpu bound tasks from easily rising to interactive status.
However it did not select out tasks that had periods of being fully cpu
bound and then sleeping while waiting on pipes, signals etc. This led to a
more disproportionate share of cpu time.
Backing this out will no longer special case only fully cpu bound tasks,
and prevents the variable behaviour that occurs at startup before tasks
declare themseleves interactive or not, and speeds up application startup
slightly under certain circumstances. It does cost in interactivity
slightly as load rises but it is worth it for the fairness gains.
Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Con Kolivas [Sat, 8 Jan 2005 05:46:30 +0000 (21:46 -0800)]
[PATCH] sched: requeue_granularity
Change the granularity code to requeue tasks at their best priority instead
of changing priority while they're running. This keeps tasks at their top
interactive level during their whole timeslice.
Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Con Kolivas [Sat, 8 Jan 2005 05:45:57 +0000 (21:45 -0800)]
[PATCH] sched: adjust_timeslice_granularity
The minimum timeslice was decreased from 10ms to 5ms. In the process, the
timeslice granularity was leading to much more rapid round robinning of
interactive tasks at cache trashing levels.
Restore minimum granularity to 10ms.
Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Con Kolivas [Sat, 8 Jan 2005 05:45:40 +0000 (21:45 -0800)]
[PATCH] sched: alter_kthread_prio
Timeslice proportion has been increased substantially for -niced tasks. As
a result of this kernel threads have much larger timeslices than they
previously had.
Change kernel threads' nice value to -5 to bring their timeslice back in
line with previous behaviour. This means kernel threads will be less
likely to cause large latencies under periods of system stress for normal
nice 0 tasks.
Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Matthew Dobson [Sat, 8 Jan 2005 05:44:51 +0000 (21:44 -0800)]
[PATCH] sched: active_load_balance() fixlet
There is a small problem with the active_load_balance() patch that Darren
sent out last week. As soon as we discover a potential 'target_cpu' from
'cpu_group' to try to push tasks to, we cease considering other CPUs in
that group as potential 'target_cpu's. We break out of the
for_each_cpu_mask() loop and try to push tasks to that CPU. The problem is
that there may well be other idle cpus in that group that we should also
try to push tasks to. Here is a patch to fix that small problem. The
solution is to simply move the code that tries to push the tasks into the
for_each_cpu_mask() loop and do away with the whole 'target_cpu' thing
entirely. Compiled & booted on a 16-way x440.
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andrew Theurer [Sat, 8 Jan 2005 05:44:34 +0000 (21:44 -0800)]
[PATCH] sched: newidle fix
Allow idle_balance to search an incresingly larger span of cpus to find a
cpu. Minor change, NODE_SD_INIT gets SD_BALANCE_NEWIDLE flag. This is
critical for x86_64, where there is only one cpu oer node. In the current
code, idle_balance for Opteron -never- works.
Signed-off-by: <habanero@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andrew Theurer [Sat, 8 Jan 2005 05:44:22 +0000 (21:44 -0800)]
[PATCH] sched: can_migrate exception for idle cpus
Fix can_migrate to allow aggressive steal for idle cpus. This -was- in
mainline, but I believe sched_domains kind of blasted it outta there. IMO,
it's a no brainer for an idle cpu (with all that cache going to waste) to
be granted to steal a task. The one enhancement I have made was to make
sure the whole cpu was idle.
Signed-off-by: <habanero@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andrew Theurer [Sat, 8 Jan 2005 05:44:05 +0000 (21:44 -0800)]
[PATCH] sched: more agressive wake_idle()
This patch addresses some problems with wake_idle(). Currently wake_idle()
will wake a task on an alternate cpu if:
1) task->cpu is not idle
2) an idle cpu can be found
However the span of cpus to look for is very limited (only the task->cpu's
sibling). The scheduler should find the closest idle cpu, starting with
the lowest level domain, then going to higher level domains if allowed
(doamin has flag SD_WAKE_IDLE). This patch does this.
This and the other two patches (also to be submitted) combined have
provided as much at 5% improvement on that "online transaction DB workload"
and 2% on the industry standard J@EE workload.
I asked Martin Bligh to test these for regression, and he did not find any.
I would like to submit for inclusion to -mm and barring any problems
eventually to mainline.
Signed-off-by: <habanero@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Neil Brown [Sat, 8 Jan 2005 05:43:47 +0000 (21:43 -0800)]
[PATCH] nfsd4_setclientid_confirm locking fix
Avoid unlock-without-lock problem on error path in nfsd4_setclientid_confirm
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Greg Banks [Sat, 8 Jan 2005 05:43:31 +0000 (21:43 -0800)]
[PATCH] oprofile: fix ia64 callgraph bug with old gcc
With Keith Owens <kaos@sgi.com>
This patch from Keith Owens fixes a bug in the ia64 port of oprofile when
built without the kdb patch and with a pre-3.4 gcc.
If you build a standard kernel with gcc < 3.4 then
ia64_spinlock_contention_pre3_4 is defined. But a standard kernel does not
have ia64_spinlock_contention_pre3_4_end, that label is only added by the
kdb patch. To get the backtrace profiling with gcc < 3.4, the _end label
needs to be added as part of the kernprof patch, then I will remove it from
kdb.
Signed-off-by: Keith Owens <kaos@sgi.com> Signed-off-by: Greg Banks <gnb@melbourne.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Thayne Harbaugh [Sat, 8 Jan 2005 05:43:14 +0000 (21:43 -0800)]
[PATCH] initramfs: unprivileged image creation
This patch makes several tweaks so that an initramfs image can be
completely created by an unprivileged user. It should maintain
compatibility with previous initramfs early userspace cpio/image creation
and it updates documentation.
There are a few very important tweaks:
CONFIG_INITRAMFS_SOURCE is now either a single cpio archive that is
directly used or a list of directories and files for building a cpio
archive for the initramfs image. Making the cpio archive listable in
CONFIG_INITRAMFS_SOURCE makes the cpio step more official and automated so
that it doesn't have to be copied by hand to usr/initramfs_data.cpio (I
think this was broken anyway and would be overwritten). The alternative
list of directories *and* files means that files can be install in a "root"
directory and device-special files can be listed in a file list.
CONFIG_ROOT_UID and CONFIG_ROOT_GID are now available for doing simple
user/group ID translation. That means that user ID 500, group ID 500 can
create all the files in the "root" directory, but that they can all be
owned by user ID 0, group ID 0 in the cpio image.
Various documentation updates to pull it all together.
Removal of old cruft that was unused/misleading.
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
telldir() is broken on large ext3 dir_index'd directories because
getdents() gives d_off==0 for the first entry
Here's a patch which fixes the problem, but note the following warning
from the readdir man page:
According to POSIX, the dirent structure contains a field char d_name[]
of unspecified size, with at most NAME_MAX characters preceding the
terminating null character. Use of other fields will harm the porta-
bility of your programs.
Also, as always, telldir() and seekdir() are truly awful interfaces
because they implicitly assume that (a) a directory is a linear data
structure, and (b) that the position in a directory can be expressed
in a cookie which hsa only 31 bits on 32-bit systems.
So there will be hash colliions that will cause programs that assume
that seekdir(dirent->d_off) will always return the next directory
entry to sometimes lose directory entries in the
not-as-unlikely-as-we-would wish case of a 31-bit hash collision.
Really, any program which is using telldir/seekdir really should be
rewritten to not use these interfaces if at all possible. So with
these caveats....
What we need to do is wire '.' and '..' to have hash values of (0,0) and
(2,0), respectively, without ignoring other existing dirents with colliding
hashes. (In those cases the programs will break, but they are statistically
rare, and there's not much we can do in those cases anyway.)
According to "initramfs buffer format -- third draft"
http://lwn.net/2002/0117/a/initramfs-buffer-format.php3 "the cpio
"TRAILER!!!" entry (cpio end-of-archive) is optional, but is not ignored"
The kernel handling does not follow this spec. If you add null padding
after an uncompressed cpio without TRAILER!!! the kernel complains "no
cpio magic". In a gzipped archive one gets "junk in gzipped archive"
without the TRAILER!!!
This patch changes the state transitions so the kernel will follow the spec.
Tested: padded uncompressed, padded compressed, unpadded compressed (error)
and trailing junk in compressed (error)
===
I have a boot loader that knows how to load files, determine their size,
and advance to the next 4-byte boundary and reports the total size of the
files loaded. It doesn't understand about converting this number to some
ASCII representation.
With this patch I can embed the contents of a file padded with NULs
with out knowing the exact size of the file with the following files:
1) file containing cpio header & file name, padded to 4 bytes
2) contents of file
3) pad file of zeros, the size at least as large as the that specified
for the file.
hpa points out that you should be careful with the headers, use unique
inode numbers and/or add a cpio header with just TRAILER!!! to reset the
inode hash table to avoid unwanted hard links. I just put this sequence as
the last files loaded.
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Thayne Harbaugh [Sat, 8 Jan 2005 05:42:25 +0000 (21:42 -0800)]
[PATCH] gen_init_cpio symlink, pipe and socket support
This patch makes gen_init_cpio more complete by adding symlink, pipe and
socket support. It updates scripts/gen_initramfs_list.sh to support the
new types. The patch applies to the recent mm series that already have the
updated gen_init_cpio and gen_initramfs_list.sh.
From: William Lee Irwin III <wli@holomorphy.com>
The rest of gen_init_cpio.c seems to cast the result of strlen() to handle
this situation, so this patch follows suit while killing off size_t -related
printk() warnings.
Signed-off-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Stas Sergeev [Sat, 8 Jan 2005 05:41:19 +0000 (21:41 -0800)]
[PATCH] fix cdrom autoclose
The attached patch fixes the CD-ROM autoclose. It is broken in recent
kernels for CD-ROMs that do not properly report that the tray is opened.
Now on such a drives the kernel will do one close attempt and check for the
disc again. This is how it used to work in the past.
Signed-off-by: Stas Sergeev <stsp@aknet.ru> Acked-by: Alexander Kern <alex.kern@gmx.de> Acked-by: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nathan Lynch [Sat, 8 Jan 2005 05:41:01 +0000 (21:41 -0800)]
[PATCH] prohibit slash in proc directory entry names
proc_create() needs to check that the name of an entry to be created does
not contain a '/' character.
To test, I hacked the ibmveth driver to try to call request_irq with a
bogus "foo/bar" devname. The creation of the /proc/irq/1234/xxx entry
silently fails, as intended. Perhaps the irq code should be made to check
for the failure.
Olof Johansson [Sat, 8 Jan 2005 05:40:27 +0000 (21:40 -0800)]
[PATCH] ppc64: IOMMU cleanups: Main cleanup patch
Earlier cleanup efforts of the ppc64 IOMMU code have mostly been targeted
at simplifying the allocation schemes and modularising things for the
various platforms. The IOMMU init functions are still a mess. This is an
attempt to clean them up and make them somewhat easier to follow.
The new rules are:
1. iommu_init_early_<arch> is called before any PCI/VIO init is done
2. The pcibios fixup routines will call the iommu_{bus,dev}_setup functions
appropriately as devices are added.
TCE space allocation has changed somewhat:
* On LPARs, nothing is really different. ibm,dma-window properties are still
used to determine table sizes.
* On pSeries SMP-mode (non-LPAR), the full TCE space per PHB is split up
in 256MB chunks, each handed out to one child bus/slot as needed. This
makes current max 7 child buses per PHB, something we're currently below
on all machine models I'm aware of.
* Exception to the above: Pre-POWER4 machines with Python PHBs have a full
GB of DMA space allocated at the PHB level, since there are no EADS-level
tables on such systems.
* PowerMac and Maple still work like before: all buses/slots share one table.
* VIO works like before, ibm,my-dma-window is used like before.
* iSeries has not been touched much at all, besides the changed unit of
the it_size variable in struct iommu_table.
Other things changed:
* Powermac and maple PCI/IOMMU inits have been changed a bit to conform to
the new init structure
* pci_dma_direct.c has been renamed pci_direct_iommu.c to match
pci_iommu.c (see separate patch)
* Likewise, a couple of the pci direct init functions have been renamed.
Signed-off-by: Olof Johansson <olof@austin.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch renames pci_dma_direct.c to pci_direct_iommu.c to comply to the
naming convention of the other iommu files.
This is part of the iommu cleanup, but broken out as a separate patch since
for mainline, a BK rename is more appropriate. Still, we need a patch to
apply for non-BK-based trees (-mm)
Signed-off-by: Olof Johansson <olof@austin.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Paul Mackerras [Sat, 8 Jan 2005 05:38:49 +0000 (21:38 -0800)]
[PATCH] ppc64: use newer RTAS call when available
This patch is from Nathan Fontenot <nfont@austin.ibm.com> originally.
The PPC64 EEH code needs a small update to start using the
ibm,read-slot-reset-state2 rtas call if available. The currently used
ibm,read-slot-reset-state call will be going away on future machines.
This patch attempts to use the newer rtas call if available and falls back
the older version otherwise. This will maintain EEH slot checking
capabilities on all future and current firmware levels.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
David Gibson [Sat, 8 Jan 2005 05:38:32 +0000 (21:38 -0800)]
[PATCH] ppc64: add performance monitor register information to processor.h
Most special purpose registers on the ppc64 have both the SPR number, and
the various fields within the register defined in asm-ppc64/processor.h.
So far that's not true for the performance counter control registers, MMCR0
and MMCRA. They have the SPR numbers defined, but the internal fields are
defined in the oprofile code and (just a few) in traps.c where they're
actually used.
This patch moves all the MMCR0 and MMCRA definitions, plus the MSR
performance monitor bit, MSR_PMM, into processor.h.
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rik van Riel [Sat, 8 Jan 2005 05:37:59 +0000 (21:37 -0800)]
[PATCH] vmscan: count writeback pages in nr_scanned
OOM kills have been observed with 70% of the pages in lowmem being in the
writeback state. If we count those pages in sc->nr_scanned, the VM should
throttle and wait for IO completion, instead of OOM killing.
(akpm: this is how the code was designed to work - we broke it six months
ago).
Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hirokazu Takata [Sat, 8 Jan 2005 05:37:43 +0000 (21:37 -0800)]
[PATCH] m32r: build fix
This patch is required to fix compile errors for m32r.
This was originally given by the following patch:
[PATCH] move irq_enter and irq_exit to common code
http://www.ussg.iu.edu/hypermail/linux/kernel/0411.1/1738.html
I think it was maybe accidentally dropped only for the m32r arch due to a
patching conflict with the other patches or something like that.
I'm being at least sometimes deferred to for hugetlb maintenance.
I also originally wrote the fs methods, and generally get stuck
working on it on a regular basis. So here is a MAINTAINERS entry
reflecting that.
Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dave Jiang [Sat, 8 Jan 2005 00:04:18 +0000 (00:04 +0000)]
[ARM PATCH] 2363/1: IQ80332 platform port
Patch from Dave Jiang
Signed-off-by: Dave Jiang
This is the IQ80332 platform port that's based off the IOP33x CPU. The IQ80332 is an PCI-express CRB based off the IOP332 processor. Otherwise functionalites are fairly similar to IQ80331. Signed-off-by: Russell King
Dave Jiang [Fri, 7 Jan 2005 23:36:26 +0000 (23:36 +0000)]
[ARM PATCH] 2362/1: cleanup of PCI defines for IOP321 platforms
Patch from Dave Jiang
Signed-off-by: Dave Jiang
Major cleanup of the 321 PCI defines to make them more coherent. Unified some groups that were per platform to common proc specific. Removed some magic numbers. Signed-off-by: Russell King