Martin Dalecki [Tue, 28 May 2002 09:25:46 +0000 (02:25 -0700)]
[PATCH] 2.5.18 QUEUE_EMPTY and the unpleasant friends.
- Eliminate all usages of the obscure QUEUE_EMPTY macro.
- Eliminate all unneccessary checks for RQ_INACTIVE, this can't happen during
the time we run the request strategy routine of a single major number block
device. Perhaps the still remaining usage in scsi and i2o_block.c should be
killed as well, since the upper ll_rw_blk layer shouldn't pass inactive
requests down.
Those are all places where we have deeply burried and hidden major number
indexed arrays. Let's deal with them slowly...
Martin Dalecki [Tue, 28 May 2002 09:25:34 +0000 (02:25 -0700)]
[PATCH] airo
Since apparently no body else did care thus far, and since I'm using
this driver, well here it comes:
- Adjust the airo wireless LAN card driver for the fact that modules
don't export symbols by default any longer.
- Make some stuff which obivously should be static there static as well.
(Plenty of code in Linux actually deserves a review for this
far too common bug...)
Robert Love [Tue, 28 May 2002 09:22:13 +0000 (02:22 -0700)]
[PATCH] preempt-safe net/ code
This fixes three locations in net/ where per-CPU data could bite us
under preemption. This is the result of an audit I did and should
constitute all of the unsafe code in net/.
In net/core/skbuff.c I did not have to introduce any code - just
rearrange the grabbing of smp_processor_id() to be in the interrupt off
region. Pretty clean fixes.
Note in the future we can use put_cpu() and get_cpu() to grab the CPU#
safely. I will send a patch to Marcelo so we can have a 2.4 version
(which doesn't do the preempt stuff), too...
Robert Love [Tue, 28 May 2002 09:21:53 +0000 (02:21 -0700)]
[PATCH] set_cpus_allowed optimization
This adds an optimization to set_cpus_allowed: if the task is not
running, there is no sense in kicking the migration_threads into action,
we just need to update task->cpu. This was suggested by Mike Kravetz.
Besides being an optimization, this would prevent any future race
between set_cpus_allowed and the migration_threads.
Robert Love [Tue, 28 May 2002 09:21:39 +0000 (02:21 -0700)]
[PATCH] documentation for the new scheduler
This adds documentation about the O(1) scheduler to Documentation/. The
new scheduler is complicated and providing future scheduler hackers some
background seems a Good Thing to me.
Specifically:
- add Documentation/sched-coding.txt: an overview of the functions,
magic numbers, and variables in the scheduler as well as (most
importantly) a review of the locking semantics.
- add Documentation/sched-design.txt: an edited version of Ingo's
initial email to lkml about his scheduler. Goes over the design,
implementation, and goals of the scheduler. I tried to edit it where
needed to bring it in line with the scheduler as it is today.
- modify kernel/sched.c: update your copyright and add a change entry
for the new scheduler.
Robert Love [Tue, 28 May 2002 09:21:25 +0000 (02:21 -0700)]
[PATCH] trivial: no "error" on preempt_count notice
The attached trivial patch simply changes the printk debug statement in
do_exit when preempt_count!=0 to say "note" instead of "error" and log
at KERN_INFO in lieu of KERN_ERR.
I want to keep the message around a bit, but people get too paranoid
when things like nfsd legitimately exit with a preempt_count=1.
Andrew Morton [Mon, 27 May 2002 12:15:03 +0000 (05:15 -0700)]
[PATCH] avoid sys_sync livelocks
This makes sure that sys_sync() will terminate. It counts up the
number of dirty pages in the machine and will refuse to write out more
than 1.25 times this number of pages. This function is called twice
on the sys_sync() path, so the kernel will actually write 2.5x the number
of initially-dirty pages before giving up.
Andrew Morton [Mon, 27 May 2002 12:13:58 +0000 (05:13 -0700)]
[PATCH] generic_file_write() cleanup
Fixes all the goto spaghetti in generic_file_write() and turns it into
something which humans can understand.
Andi tells me that gcc3 does a decent job of relocating blocks out of
line anyway. This patch gives the compiler a helping hand with
appropriate use of likely() and unlikely().
Andrew Morton [Mon, 27 May 2002 12:13:29 +0000 (05:13 -0700)]
[PATCH] dirsync
An implementation of directory-synchronous mounts.
I sent this out some months ago and it didn't generate a lot of
interest. Later we had one of the usual cheery exchanges with Wietse
Venema (postfix development) and he agreed that directory synchronous
mounts were something that he could use, and that there was benefit in
implementing them in Linux. If you choose to apply this I'll push the
2.4 patch.
Patch against e2fsprogs-1.26:
http://www.zip.com.au/~akpm/linux/dirsync/e2fsprogs-1.26.patch
Patch against util-linux-2.11n:
http://www.zip.com.au/~akpm/linux/dirsync/util-linux-2.11n.patch
The kernel patch includes implementations for ext2 and ext3. It's
pretty simple.
- When dirsync is in operation against a directory, the following operations
are synchronous within that directory: create, link, unlink, symlink,
mkdir, rmdir, mknod, rename (synchronous if either the source or dest
directory is dirsync).
- dirsync is a subset of sync. So `mount -o sync' or `chattr +S'
give you everything which `mount -o dirsync' or `chattr +D' gives,
plus synchronous file writes.
- ext2's inode.i_attr_flags is unused, and is removed.
- mount /dev/foo /mnt/bar -o dirsync works as expected.
- An ext2 or ext3 directory tree can be set dirsync with `chattr +D -R'.
- dirsync is maintained as new directories are created under
a `chattr +D' directory. Like `chattr +S'.
- Other filesystems can trivially be taught about dirsync. It's just
a matter of replacing `IS_SYNC(inode)' with `IS_DIRSYNC(inode)' in
the directory update functions. IS_SYNC will still be honoured when
IS_DIRSYNC is used.
- Non-directory files do not have their dirsync flag propagated. So
an S_ISREG file which is created inside a dirsync directory will not
have its dirsync bit set. chattr needs to do this as well.
- There was a bit of version skew between e2fsprogs' idea of the
inode flags and the kernel's. That is sorted out here.
- `lsattr' shows the dirsync flag as "D". The letter "D" was
previously being used for Compressed_Dirty_File. I changed
Compressed_Dirty_File to use "Z". Is that OK?
The mount(2) manpage needs to be taught about MS_DIRSYNC.
Andrew Morton [Mon, 27 May 2002 12:12:50 +0000 (05:12 -0700)]
[PATCH] direct-to-BIO writeback
Multipage BIO writeout from the pagecache.
It's pretty much the same as multipage reads. It falls back to buffers
if things got complex.
The write case is a little more complex because it handles pages which
have buffers and pages which do not. If the page didn't have buffers
this code does not add them.
Andrew Morton [Mon, 27 May 2002 12:12:36 +0000 (05:12 -0700)]
[PATCH] direct-to-BIO readahead
Implements BIO-based multipage reads into the pagecache, and turns this
on for ext2.
CPU load for `cat large_file > /dev/null' is reduced by approximately
15%. Similar reductions for tiobench with a single thread. (Earlier
claims of 25% were exaggerated - they were measured with slab debug
enabled. But 15% isn't bad for a load which is dominated by copy_*_user
costs).
With 2, 4 and 8 tiobench threads, throughput is increased as well, which was
unexpected. It's due to request queue weirdness. (Generally the
request queueing is doing bad things under certain workloads - that's a
separate issue.)
BIOs of up to 64 kbytes are assembled and submitted for readahead and
for single-page reads. So the work involved in reading 32 pages has gone
from:
These pages never have buffers attached. Buffers will be attached
later if the application writes to these pages (file overwrite).
The first version of this code (in the "delayed allocation" patches)
tries to handle everything - bios which start mid-page, bios which end
mid-page and pages which are covered by multiple bios. It is very
complex code and in fact appears to be incorrect: out-of-order BIO
completion could cause a page to come unlocked at the wrong time.
This implementation is much simpler: if things get complex, it just
falls back to the buffer-based block_read_full_page(), which isn't
going away, and which understands all that complexity. There's no
point in doing this in two places.
This code will bypass the buffer layer for
- fully-mapped pages which are on-disk contiguous.
- fully unmapoped pages (holes)
- partially unmapped pages, where the unmappedness is at the end of
the page (end-of-file).
and everything else falls back to buffers.
This means that with blocksize == PAGE_CACHE_SIZE, 100% of pages are
handed direct to BIO. With a heavy 10-minute dbench run on 4k
PAGE_CACHE_SIZE and 1k blocks, 95% of pages were handed direct to BIO.
Almost all of the other 5% were passed to block_read_full_page()
because they were already partially uptodate from an earlier sub-page
write(). This ratio will fall if PAGE_CACHE_SIZE/blocksize is greater
than four. But if that's the case, CPU efficiency is far from the main
concern - there are significant seek and bandwidth problems just at 4
blocks per page.
This code will stress out the block layer somewhat - RAID0 doesn't like
multipage BIOs, and there are probably others. RAID0 seems to struggle
along - readahead fails but read falls back to single-page reads, which
succeed. Such problems may be worked around by setting MPAGE_BIO_MAX_SIZE
to PAGE_CACHE_SIZE in fs/mpage.c.
It is trivial to enable multipage reads for many other filesystems. We
can do that after completion of external testing of ext2.
Andrew Morton [Mon, 27 May 2002 12:12:22 +0000 (05:12 -0700)]
[PATCH] relax nr_to_write requirements
Relax the requirements on the writeback_mapping a_op.
This function is passed the number of pages which it should write. The
current fs-writeback.c code will get confused if the address_space
writes back more pages than it was asked to.
With this change the address_space may write more pages than required
if that is convenient. Extent-based fileystems may wish to do this.
Andrew Morton [Mon, 27 May 2002 12:12:08 +0000 (05:12 -0700)]
[PATCH] mark swapout pages PageWriteback()
Pages which are under writeout to swap are locked, and not
PageWriteback(). So page allocators do not throttle against them in
shrink_caches().
This causes enormous list scans and general coma under really heavy
swapout loads.
One fix would be to teach shrink_cache() to wait on PG_locked for swap
pages. The other approach is to set both PG_locked and PG_writeback
for swap pages so they can be handled in the same manner as file-backed
pages in shrink_cache().
Andrew Morton [Mon, 27 May 2002 12:11:55 +0000 (05:11 -0700)]
[PATCH] fix loop driver for large BIOs
Fix bug in the loop driver.
When presented with a multipage BIO, loop is overindexing the first
page in the BIO rather than advancing to the second page. It scribbles
on the backing file and/or on kernel memory.
This happens with multipage BIO-based pagecache I/O and presumably with
O_DIRECT also.
The fix is much-needed with the multipage-BIO patches - using that code
on loop-backed filesystems has rather messy results.
Andrew Morton [Mon, 27 May 2002 12:11:29 +0000 (05:11 -0700)]
[PATCH] block_truncate_page fix
Fix bug in block_truncate_page().
When buffers are attached to an uptodate page, they are marked as
being uptodate. To preserve buffer/page state coherency. Dirtiness
is handled in the same way.
But block_truncate_page() assumes that a buffer which is unmapped and
uptodate is over a hole. That's not the case, and the net effect is
that block_truncate_page() is failing to zero the block outside the
truncation point.
This only happens if the page has a disk mapping but has no attached
buffers on entry to block_truncate_page(). That's never the case in
current kernels, so the problem does not exhibit (it _does_ exhibit
with direct-to-BIO bypass-the-buffers I/O).
There are actually three possible states of buffer mappedness:
- Buffer has a disk mapping (buffer_mapped(bh) == true)
- buffer is over a hole (buffer_mapped(bh) == false)
- don't know. Need to run get_block() (buffer_mapped(bh) == false)
This ambiguity could be resolved by added another buffer state bit
(BH_mapping_state_known?) but given that we already elide the get_block
calls for the common case (buffer outside i_size) it is unlikely that
the complexity is worthwhile.
Frank Davis [Mon, 27 May 2002 05:02:30 +0000 (22:02 -0700)]
[PATCH] net/ipv4/ipconfig.c minor fix
Hello all,
The following patch fixes two compile warnings 'defined but not used'.
Since the label and int are only used for IPCONFIG_DYNAMIC, appropriate
fixes were made to remove the warnings.
Rusty Russell [Mon, 27 May 2002 05:01:37 +0000 (22:01 -0700)]
[PATCH] jiffies.h includes asm/param.h
Tim Schmielau <tim@physik3.uni-rostock.de>: provide HZ from jiffies.h:
Most files that include <jiffies.h> also need HZ defined, which is
quite reasonable. So don't require the to include <asm/param.h>
themselves.
Rusty Russell [Mon, 27 May 2002 05:01:11 +0000 (22:01 -0700)]
[PATCH] ppc chrp/start.c warnings removal
Rusty Russell <rusty@rustcorp.com.au>: Finally squish those chrp_start.c warnings:
They finally irritated me enough to patch. 2.5, should apply against 2.4.
Rusty Russell [Mon, 27 May 2002 05:00:46 +0000 (22:00 -0700)]
[PATCH] ppc spinlock warning removal
Rusty Russell <rusty@rustcorp.com.au>: 2.5.17 Warning removal for ppc:
test_and_set_bit now expect an "unsigned long", so we want
&spinlock->lock rather than &spinlock (even though they are
equivalent).
Rusty Russell [Mon, 27 May 2002 04:59:52 +0000 (21:59 -0700)]
[PATCH] semctl SUSv2 compliance
Christopher Yeoh <cyeoh@samba.org>: (Made -p1 compliant by rusty) SUSv2 semctl compliance:
The semctl call with SETVAL currently does not set sempid (at the
moment sempid is only set during a successful semop call). An
explanation from Geoff Clare of the Open Group regarding why sempid
should be set during the semctl call:
"The spec isn't very clear, but there is a statement on the semget()
page which I think justifies the assumption made by the test. It says
that upon creation, the data structure associated with each semaphore
in the set is not initialised, and that the semctl() function with
SETVAL or SETALL can be used to initialise each semaphore.
Therefore semctl() with SETVAL has to set sempid to *something*, and
since sempid contains the "process ID of the last operation", setting
it to anything other than the pid of the calling process would mean
that sempid contained misleading information. It could be argued that
setting it to zero would not be misleading, but zero cannot be the
process ID of a process, and so is not a valid value for sempid anyway."
The following patch changes semctl so when called with SETVAL
sempid is set to the pid of the calling process:
Martin Dalecki [Mon, 27 May 2002 04:20:34 +0000 (21:20 -0700)]
[PATCH] 2.5.18 IDE 71
- Rewritten Artop host chip driver by Vojtech Pavlik. His log entries are:
Cleanup whitespace.
Remove superfluous chip entries in chip table. Remove global variables to
allow more than one controller. Remove other forgotten stuff.
This is a new driver for the Artop (Acard) controllers. It's completely
untested, as I have never seen the hardware. However, I suspect it is much
less broken than the previous one ...
UDMA33 controller cannot detect 80-wire cable.
- Separate ioctl handling out from ide.c. It's big enough.
- Move atapi_read and atapi_write to the new atapi module. Fix the declaration
of those functions. The data buffer did have the void * type!
- Separate module handling code out from actual transfer handling code in to a
new module called main.c. Slowly we are at the stage where the code indeed
has to be organized logically and not just "sporadically" as was the case
before.
- Apply patch by Adam Richter for the ide-scsi.c attach method implementation.
This particular driver is still broken due to generic SCSI layer issues.
- Apply true modularization patch for qd65xx.c by Samuel Thibault. Here
are his notes about it:
Then, patch-modularize-2.[45] is a proposal for modularizing qd65xx.o. As a
single module, one can choose to insmod it before being able to do some
hdparm -p /dev/hd[a-d]. But one can't remove it while tuned, since selectproc
may be needed.
I am sorry I wasn't able to test it under 2.5 series, lacking a functionning
kernel for my test computer, but it seemed to work perfectly under 2.4
series, and patches are almost the same.
- Move PCI device id's to where they belong. Patch by Vojtech Pavlik.
- Don't use BH_Lock in ide-tape.c - somehow this driver scares me sometimes.
- Remove SCO compatibility trick in rio_fw_ioctl, i.e., riocontrol was returning non
negative errors, now it is, so no need to turn it negative in rio_fw_ioctl. Thanks
to Rogier Wolff for pointing this out.
FORCE is the de-facto standard name for a prequisite to force
recompilation, so instead of using a mix of 'dummy','FORCE' and
'FORCE_RECOMPILE' use 'FORCE' everywhere.
Also, move figuring out the path relative to the top level dir
into Rules.make, instead of calling an external script.
kbuild: Simplify rule for just building one subdir
It's possible to say "make <subdir>", to descend into that subdir
and recursively build things there. This patch provides this
facility generally without the arch Makefiles needing to duplicate
it for arch/$(ARCH)/somedir.
We now have the information which objects are being built
modular / built-in in Rules.make, so use this information instead
of passing flags to the sub makes.
If a Makefile defines neither O_TARGET nor L_TARGET, let's assume a
default of 'built-in.o'. The goal of this is, of course, to eventually
get rid of O_TARGET completely.
Pavel Machek [Fri, 24 May 2002 04:55:13 +0000 (21:55 -0700)]
[PATCH] swsusp fixes
This kills unneccessary include from ide-disk.c, kills #ifdef from
reiserfs/journal.c, makes suspend_device local as it should be,
abstains from suspending devices two times in a row (typo), and makes
sure we do not run_task_queue() while we hold spinlock.
Pavel Machek [Fri, 24 May 2002 04:44:55 +0000 (21:44 -0700)]
[PATCH] swsusp: making myself maintainer
I asked Gabor if he'd like me to maintain swsusp, and he liked that
idea [<quote>Would you please take over maintaining? I offered this in
the list a while ago anyway.</quote>].