The irq vector offset should spread the irq's out evenly, which
implies that it should vary between 0-7, not any further (the
higher bits are done by updating current_vector by 8).
This also means that we don't have any overflow condition.
This version fixes:
- missing rtnl_lock()/rtnl_unload() bug on unregister_hdlc_device
- N2, C101: interrupt handler now works under high IRQ load from other
devices (with previous versions, the IRQ processing for the card could
sometimes stop after reaching "work limit")
This is production-tested on devices I have access to (N2, C101, PC300,
PCI200SYN).
This finally kills of blk_queue_empty(). This is similar to the patch I
recently sent to fix the SCSI logic as well. A lot of drivers are doing
this in our core, mainly because that is the way they always did it:
start_queue:
if (blk_queue_empty(q))
return;
rq = elv_next_request(q);
if (!rq)
return;
Patch simply removes the blk_queue_empty() check, and adds a check for
!rq return from elv_next_request() if the driver didn't already do that.
Additionally, the AS io scheduler can return NULL from
elv_next_request() if it thinks this is best. This way we are also
prepared for that to work well.
The problem with setiathome is that it displays something every now and
then - so it gets a backboost from X, and hovers at a relatively high
priority.
Make it compile again and various cleanups and a few bug fixes. Only
changes x86-64 specific files.
Most of it are S3 suspend changes from Pavel and comment spelling fixes
from Steven Cole.
- Remove now obsolete check_cpu function
- Fix sys_ioctl prototype
- Small optimization - use SYSCALL for 32bit signal handling.
- Fix S3 suspend handling and split into individual files like i386 (Pavel)
- Merge from i386 (pci fixes etc.)
- Set correct paging attributes for IOMMU aperture
- Fix disable apic option
The ipc multiplexer syscall on x86 currently returns EINVAL for a
non-existing sub-opcode. This logical but is a problem with the
introduction of new operations (like semtimedop). Now EINVAL can mean
"no such operation" and "invalid parameter". To avoid such problems in
future, could you apply the attached patch?
[PATCH] missing FB_VISUAL_PSEUDOCOLOR in fb_prepare_logo()
This fixes the mighty penguin logo not appearing on visual workstation
framebuffer. The trouble is missing 'case FB_VISUAL_PSEUDOCOLOR:' in
fb_prepare_logo() function.
Roland McGrath [Fri, 4 Apr 2003 12:12:17 +0000 (04:12 -0800)]
[PATCH] linux-2.5.66-signal-cleanup.patch
Here is the cleanup patch I promised back in February. Sorry it took a
while.
The effects should be purely cosmetic in 2.5.66. However, the new
interface for the proper way to send thread-specific of process-global
signals from inside the kernel is needed for correct implementation of
some fixes to timer stuff that Ulrich told me about.
This cleans up some obsolete comments and macros in kernel/signal.c,
restores send_sig_info to its original behavior, and adds a global entry
point send_group_sig_info. I checked all the uses of send_sig and
send_sig_info and changed a few to send_group_sig_info.
I think it would be cleanest if the whole mess of *_sig* entry points were
reduced to two or three, but I did the change that minimized the number of
callers I had to fix up.
There should be no discernible difference, since the 2.5.66 send_sig_info
function did group semantics for those signals by number already. The only
exception to that is pdeath_signal, which I guess can be any signal number
but I deemed ought to be process-wide.
I did not change any of the calls using SIGKILL, though that does have
process-wide semantics. There is no need to change it since SIGKILL always
kills the whole group, though the code path for send_sig(SIGKILL,...) calls
in multithreaded processes will be different now.
John Levon [Fri, 4 Apr 2003 12:12:09 +0000 (04:12 -0800)]
[PATCH] bk - fix oprofile for pm driver register
OK, so I screwed up - didn't notice the late_initcall() that was
introduced, which was obviously bogus. This one should build OK for the
module case. I've tested insmod/rmmod alongside a mounted sysfs.
I think the built-in case is OK: oprofile/ is after kernel/ in the link
order. I tested that too.
Andrew Morton [Fri, 4 Apr 2003 12:12:01 +0000 (04:12 -0800)]
[PATCH] acpi compile fix
ACPI is performing a spin_lock() on a `void *'. That's OK when spin_lock is
implemented via an inline function. But when it is implemented via macros
it causes compile-time breakage.
So cast it to the right type. It really should be fixed not to use opaque
handles, though.
This fixes some performance problems. Some vendors implement firmware
updates over IPMI, and this speeds up that process quite a bit.
* Improve the "send - wait for response - send -wait for response -
etc" performance when using high-res timers. Before, an ~10ms delay
would be added to each message, because it didn't restart the timer
if nothing was happing when a new message was started.
David S. Miller [Thu, 3 Apr 2003 05:34:27 +0000 (21:34 -0800)]
[SPARC64]: Fix trap stack allocations so gcc-3.x builds work.
1) Use PTREGS_OFF consistently
2) Define it to allocate STACKFRAME_SZ instead of REGWIN_SZ
3) Kill off REGWIN_SZ, replace with sizeof(struct reg_window).
Alex Williamson [Thu, 3 Apr 2003 04:39:55 +0000 (20:39 -0800)]
[PATCH] ia64: remove platform_pci_dma_addres
This removes platform_pci_dma_address. Since the scatterlist
in 2.5 has a dma_address, seems like we can expect a certain usage
of it. SGI folks may want to verify this doesn't break their DMA
engines.
Andrew Morton [Thu, 3 Apr 2003 00:29:43 +0000 (16:29 -0800)]
[PATCH] ext3 journal commit I/O error fix
From: Hua Zhong <hzhong@cisco.com>
The current ext3 totally ignores I/O errors that happened during a
journal_force_commit time, causing user space to falsely believe it has
succeeded, which actually did not.
This patch checks IO error during journal_commit_transaction. and aborts
the journal when there is I/O error.
Originally I thought about reporting the error without doing aborting the
journal, but it probably needs a new flag. Aborting the journal seems to be
the easy way to signal "hey sth is wrong..".
Andrew Morton [Thu, 3 Apr 2003 00:29:28 +0000 (16:29 -0800)]
[PATCH] ext3_commit_write speedup
For an appending write, ext3_commit_write() will call the expensive
ext3_mark_inode_dirty() twice. Once in generic_commit_write()'s extension of
i_size and once in ext3_commit_write() itself where i_disksize is updated.
But by updating i_disksize _before_ calling generic_commit_write() these can
be piggybacked.
The patch takes the overhead of a write() from 1.96 microseconds down to
1.63.
Andrew Morton [Thu, 3 Apr 2003 00:29:21 +0000 (16:29 -0800)]
[PATCH] ext3_mark_inode_dirty() speedup
ext3_mark_inode_dirty() (and several other callers) use the
ext3_reserve_inode_write() and ext3_mark_ioc_dirty() pair for journalling an
inode's backing block.
Because ext3_reserve_inode_write() gets journalling access to the block there
is no need for ext3_mark_iloc_dirty() to do it as well.
This change reduces the overhead of a write() from 2.7 microseconds to 1.95
on a 2.7G P4.
Andrew Morton [Thu, 3 Apr 2003 00:29:13 +0000 (16:29 -0800)]
[PATCH] Fix jbd assert failure on IO error.
From: Stephen Tweedie <sct@redhat.com>
The buffer_uptodate flag gets cleared on IO failure, and this can panic jbd
when it tries to write such a buffer. Relax the panic to be just a warning.
Andrew Morton [Thu, 3 Apr 2003 00:29:06 +0000 (16:29 -0800)]
[PATCH] Add less-severe assert-failure form for ext3.
From: Stephen Tweedie <sct@redhat.com>
Add a new form of assert failure in ext3 which allows us to flag events which
are *usually* bugs, but which can be legally triggered in the presence of IO
failures. Don't panic the kernel on such errors unless we've defined
#JBD_PARANOID_IOFAIL, which will normally be set only for testing purposes.
Andrew Morton [Thu, 3 Apr 2003 00:28:58 +0000 (16:28 -0800)]
[PATCH] remove dparent_lock
The big SMP machines are seeing quite some contention in dnotify_parent()
(via vfs_write). This function is hammering the global dparent_lock.
However we don't actually need a global dparent_lock for pinning down
dentry->d_parent. We can use dentry->d_lock for this. That is already being
held across d_move.
This patch speeds up SDET on the 16-way by 5% and wipes dnotify_parent() off
the profiles.
It also uninlines dnofity_parent().
It also uses spin_lock(), which is faster than read_lock().
I'm not sure that we need to take both the source and target dentry's d_lock
in d_move.
The patch also does lots of s/__inline__/inline/ in dcache.h
Andrew Morton [Thu, 3 Apr 2003 00:28:49 +0000 (16:28 -0800)]
[PATCH] real_lookup race fix
From: Maneesh Soni <maneesh@in.ibm.com>
Here is a patch to use seqlock for real_lookup race with d_lookup as suggested
by Linus. The race condition can result in duplicate dentry when d_lookup
fails due concurrent d_move in some unrelated directory.
Apart from real_lookup, lookup_hash()->cached_lookup() can also fail due
to same reason. So, for that I am doing the d_lookup again.
Now we have __d_lookup (called from do_lookup() during pathwalk) and
d_lookup which uses seqlock to protect againt rename race.
dcachebench numbers (lower is better) don't have much difference on a 4-way
PIII xeon SMP box.
base-2565
Average usec/iteration 19059.4
Standard Deviation 503.07
base-2565 + seq_lock
Average usec/iteration 18843.2
Standard Deviation 450.57
Andrew Morton [Thu, 3 Apr 2003 00:28:27 +0000 (16:28 -0800)]
[PATCH] Fix devfs' partition handling
From: Andre Landwehr <andre.landwehr@gmx.net>
with / on an IDE harddisk the disks partitions do not appear in
devfs, only the disc device. This is due to rescan_partitions
being called twice and deleting but not re-creating the entries
during the second call.
Andrew Morton [Thu, 3 Apr 2003 00:28:18 +0000 (16:28 -0800)]
[PATCH] add vt console scrollback ioctl
From: Samuel Thibault <Samuel.Thibault@ens-lyon.fr>
There is no way for a braille device driven by brltty (userland root-owned
daemon) to scrollback the virtual console, the only way is to use the pc
keyboard. A very simple TIOCLINUX ioctl meets this need (tested).
Also add a command for bringing the last console to the top, as keyboard.c's
lastcons() does when pressing alt - down arrow.
Andrew Morton [Thu, 3 Apr 2003 00:28:11 +0000 (16:28 -0800)]
[PATCH] sync dirty pages in fadvise(FADV_DONTNEED)
This changes the fadvise(FADV_DONTNEED) operation to start async writeout of
any dirty pages in the file.
The thinking is that if the application doesn't want to use those pages in
the future, we may as well get IO underway against them so they can be freed
up on the next call to fadvise().
The POSIX spec does not go into any detail as to whether this is the right or
wrong behaviour.
This provides a nice way for applications whihc are writing streaming data
(the main users of fadvise) to keep the amount of dirty pagecache under
control without having to resort to system-wide VM tuning.
It also provides an "async fsync()". If the application passes in a length
of zero, fadvise will start async writeout of the pages, but will not
invalidate any of the file's pagecache.
Andrew Morton [Thu, 3 Apr 2003 00:27:56 +0000 (16:27 -0800)]
[PATCH] aic7xxx timer deletion fix
From: Zwane Mwaikambo <zwane@linuxpower.ca>
ahc_linux_free_device() needs to use del_timer_sync(). slab corruption has
been observed due to the timer handler running after the containing object
was freed.
Andrew Morton [Thu, 3 Apr 2003 00:27:40 +0000 (16:27 -0800)]
[PATCH] struct stat - support larger dev_t
From: Andries.Brouwer@cwi.nl
Below a patch that changes struct stat for a number of
architectures. Maintainers, please watch carefully.
Struct stat is used to transfer information from kernel
to user space on a stat() system call.
It has fields st_dev, st_rdev.
The size of these fields is in principle unrelated to
the size of a dev_t in user space or the size of a
dev_t or kdev_t in kernel space.
It is just the "capacity" of the channel.
The actual amount of useful information is the minimum
of the four sizes (kernel dev_t, kernel kdev_t,
user dev_t, width of stat st_dev, st_rdev fields).
The goal of this patch is to make sure that the stat() and stat64()
system calls transmit at least 32 and 64 bits, respectively.
This is achieved by using the padding that was present already.
We fail when no padding was present, or when the padding is on
the wrong side (after the field, while the machine is big-endian).
alpha: stat: uses unsigned int, 32 bits
arm: stat: uses unsigned short - bad.
The padding is on one side, which means that this can
be made into unsigned long only on little endian systems.
FIXED - unless __ARMEB__.
stat64: used unsigned short - FIXED, now unsigned long long.
cris: stat: used unsigned short - FIXED, now unsigned long
stat64: used unsigned short - FIXED, now unsigned long long.
i386: stat: used unsigned short - FIXED, now unsigned long
stat64: used unsigned short - FIXED, now unsigned long long.
ia64: stat: uses unsigned long, 64 bits
m68k: stat: used unsigned short - bad, but this cannot be fixed
since m68k is big-endian, and the available padding is on
the wrong side. NOT FIXED.
stat64: used unsigned short - FIXED, now unsigned long long.
mips: stat: uses dev_t which is unsigned int, 32 bits
stat64: used unsigned long, 32 bits. NOT FIXED.
(There is padding on one side, so this can be fixed if __MIPSEL__.)
mips64: stat: uses dev_t which is unsigned int, 32 bits
parisc: stat: uses dev_t, 32 bits
stat64: uses unsigned long long, 64 bits
ppc: stat: uses dev_t which is unsigned int, 32 bits
stat64: unsigned long long, 64 bits
ppc64: stat: uses dev_t which is unsigned long, 64 bits
stat64: uses unsigned long, 64 bits
sparc: stat: uses unsigned short, no padding. NOT FIXED.
stat64: used unsigned short - FIXED, now unsigned long long.
sparc64:stat: uses dev_t which is unsigned int, 32 bits
stat64: used unsigned short - FIXED, now unsigned long long.
s390: stat: used unsigned short, big-endian, padding on the wrong side,
NOT FIXED.
stat64: used unsigned short - FIXED, now unsigned long long.
s390x: stat: uses unsigned long, 64 bits
sh: stat: used unsigned short, but padding maybe on wrong side.
NOT FIXED.
stat64: used unsigned short - FIXED, now unsigned long long.
v850: stat: used __kernel_dev_t.
BUG: NEVER use __kernel types in a user space interface.
Replaced the types. FIXED - now unsigned int - 32 bits.
stat64: FIXED - now unsigned long long - 64 bits.
x86_64: stat: uses unsigned long, 64 bits
So, on most architectures we achieve the aim of 32 bits for stat,
64 bits for stat64. On all architectures we achieve at least
16 bits for stat, 32 bits for stat64.
Andrew Morton [Thu, 3 Apr 2003 00:27:33 +0000 (16:27 -0800)]
[PATCH] tmpfs 6/6: percentile sizing of tmpfs
From: CaT <cat@zip.com.au>
What this patch does is allow you to specify the max amount of memory tmpfs
can use as a percentage of available real ram. This (in my eyes) is useful
so that you do not have to remember to change the setting if you want
something other then 50% and some of your ram goes.
Hugh redid the arithmetic to not overflow at 4GB; the particular order of
lines helps RH's gcc-2.96-110 not to get confused in the do_div. 2.5 can use
totalram_pages. Update mount options in tmpfs Doc.
There's an argument that the percentage should be of ram+swap, that's what
Christoph originally intended. But we set the default at 50% of ram only, so
I believe it's more consistent to follow that precedent.
Andrew Morton [Thu, 3 Apr 2003 00:27:19 +0000 (16:27 -0800)]
[PATCH] tmpfs 4/6: use mark_page_accessed
From: Hugh Dickins <hugh@veritas.com>
tmpfs pages should be surfing the LRUs in the company of their filemap
friends: I was expecting the rules to change, but they've been stable so
long, let's sprinkle mark_page_accessed in the equivalent places here; but
(don't ask me why) SetPageReferenced in shmem_file_write. Ooh, and
shmem_populate was missing a flush_page_to_ram.
Andrew Morton [Thu, 3 Apr 2003 00:26:51 +0000 (16:26 -0800)]
[PATCH] file limit checking simplification
From: Hugh Dickins <hugh@veritas.com>
When handling rlimit != RLIM_INFINITY, generic_write_checks tests file
position against 0xFFFFFFFFULL, and casts it to a u32. This code is
carried forward from 2.4.4, and the 2.4-ac tree contains an apparently
obvious fix to one part of it (should set count to 0 not to a negative).
But when you think it through, it all turns out to be bogus.
On a 32-bit architecture: limit is a 32-bit unsigned long, we've
already handled *pos < 0 and *pos >= limit, so *pos here has no way
of being > 0xFFFFFFFFULL, and thus casting it to u32 won't truncate it.
And on a 64-bit architecture: limit is a 64-bit unsigned long, but this
code is disallowing file position beyond the 32 bits; or if there's some
userspace compatibility issue, with limit having to fit into 32 bits,
the 32-bit architecture argument applies and they're still irrelevant.
So just remove the 0xFFFFFFFFULL test; and in place of the u32, cast to
typeof(limit) so it's right even if rlimits get wider. And there's no
way we'd want to send SIGXFSZ below the limit: remove send_sig comment.
There's a similarly suspicious u32 cast a little further down, when
checking MAX_NON_LFS. Given its definition, that does no harm on any
arch: but it's better changed to unsigned long, the type of MAX_NON_LFS.
Andrew Morton [Thu, 3 Apr 2003 00:26:43 +0000 (16:26 -0800)]
[PATCH] bio kmapping changes
RAID5 is calling copy_data() under sh->lock. But copy_data() does kmap(),
which can sleep.
The best fix is to use kmap_atomic() in there. It is faster than kmap() and
does not block.
The patch removes the unused bio_kmap() and replaces __bio_kmap() with
__bio_kmap_atomic(). I think it's best to withdraw the sleeping-and-slow
bio_kmap() from the kernel API before someone else tries to use it.
Also, I notice that bio_kmap_irq() was using local_save_flags(). This is a
bug - local_save_flags() does not disable interrupts. Converted that to
local_irq_save(). These names are terribly chosen.
Andrew Morton [Thu, 3 Apr 2003 00:26:28 +0000 (16:26 -0800)]
[PATCH] monotonic clock source for hangcheck timer
From: john stultz <johnstul@us.ibm.com>
This patch, written with the advice of Joel Becker, addresses a problem with
the hangcheck-timer.
The basic problem is that the hangcheck-timer code (Required for Oracle)
needs a accurate hard clock which can be used to detect OS stalls (due to
udelay() or pci bus hangs) that would cause system time to skew (its sort of
a sanity check that insures the system's notion of time is accurate).
However, currently they are using get_cycles() to fetch the cpu's TSC
register, thus this does not work on systems w/o a synced TSC.
As suggested by Andi Kleen (see thread here:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0302.0/1234.html ) I've worked
with Joel and others to implement the monotonic_clock() interface. Some of
the major considerations made when writing this patch were
o Needs to be able to return accurate time in the absence of multiple timer
interrupts
o Needs to be abstracted out from the hardware
o Avoids impacting gettimeofday() performance
This interface returns a unsigned long long representing the number of
nanoseconds that has passed since time_init().
Andrew Morton [Thu, 3 Apr 2003 00:26:20 +0000 (16:26 -0800)]
[PATCH] handle bad inodes in put_inode
From: "J. Bruce Fields" <bfields@fieldses.org>
If the NFS daemon is presented with a filehandle for a file that has
been deleted, it does an iget() in fs/exportfs/expfs.c:export_iget() and
gets a bad inode back. When it subsequently iput()s the inode, the
result is:
Mar 27 12:53:40 snoopy kernel: EXT2-fs error (device ide0(3,3)): ext2_free_blocks: Freeing blocks not in datazone - block = 1802201963, count = 27499
Mar 27 12:53:40 snoopy kernel: Remounting filesystem read-only
The same can happen if ext2_get_inode() returns an error - ext2_read_inode()
will return an uninitialised inode and ext2_put_inode() is not allowed to go
looking inside the bad inode.
Andrew Morton [Thu, 3 Apr 2003 00:26:13 +0000 (16:26 -0800)]
[PATCH] tmpfs blk_congestion_wait fix
From: Hugh Dickins <hugh@veritas.com>
The blk_congestion_waits in shmem_getpage are appropriate when the error is
-ENOMEM, but not when the error is -EEXIST. So add that test in the first
instance, but omit it all in the second instance.