Linus Torvalds [Tue, 18 Jun 2002 11:23:01 +0000 (04:23 -0700)]
Merge penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/md-merge
into penguin.transmeta.com:/home/penguin/torvalds/repositories/kernel/linux
Neil Brown [Tue, 18 Jun 2002 11:21:49 +0000 (04:21 -0700)]
[PATCH] md 22 of 22 - Generalise md sync threads
Previously each raid personality (Well, 1 and 5) started their
own thread to do resync, but md.c had a single common thread to do
reconstruct. Apart from being untidy, this means that you cannot
have two arrays reconstructing at the same time, though you can have
to array resyncing at the same time..
This patch changes the personalities so they don't start the resync,
but just leave a flag to say that it is needed.
The common thread (mdrecoveryd) now just monitors things and starts a
separate per-array thread whenever resync or recovery (or both) is
needed.
When the recovery finishes, mdrecoveryd will be woken up to re-lock
the device and activate the spares or whatever.
raid1 needs to know when resync/recovery starts and ends so it can
allocate and release resources.
It allocated when a resync request for stripe 0 is received.
Previously it deallocated for resync in it's own thread, and
deallocated for recovery when the spare is made active or inactive
(depending on success).
As raid1 doesn't own a thread anymore this needed to change. So to
match the "alloc on 0", the md_do_resync now calls sync_request one
last time asking to sync one block past the end. This is a signal to
release any resources.
Neil Brown [Tue, 18 Jun 2002 11:17:56 +0000 (04:17 -0700)]
[PATCH] md 21 of 22 - Improve handling of MD super blocks
1/ don't free the rdev->sb on an error -- it might be
accessed again later. Just wait for the device to be
exported.
2/ Change md_update_sb to __md_update_sb and have it
clear the sb_dirty flag.
New md_update_sb locks the device and calls __md_update_sb
if sb_dirty. This avoids any possbile races around
updating the superblock
Neil Brown [Tue, 18 Jun 2002 11:17:48 +0000 (04:17 -0700)]
[PATCH] md 20 of 22 - Provide SMP safe locking for all_mddevs list.
Provide SMP safe locking for all_mddevs list.
the all_mddevs_lock is added to protect all_mddevs and mddev_map.
ITERATE_MDDEV is moved to md.c (it isn't needed elsewhere) and enhanced
to take the lock appropriately and always have a refcount on the object
that is given to the body of the loop.
mddev_find is changed so that the structure is allocated outside a lock,
but test-and-set is done inside the lock.
Neil Brown [Tue, 18 Jun 2002 11:17:40 +0000 (04:17 -0700)]
[PATCH] md 19 of 22 - Improve serialisation of md syncing
If two md arrays which share real devices (i.e they each own a partition
on some device) need to sync/reconstruct at the same time, it is much
more efficient to have one wait while the other completes.
The current code uses interruptible_sleep_on which isn't SMP safe (without the BKL).
This patch re-does this code to make it more secure. Even it two start simultaneously,
one will reliably get priority, and the other wont wait for ever.
Neil Brown [Tue, 18 Jun 2002 11:17:26 +0000 (04:17 -0700)]
[PATCH] md 17 of 22 - Strengthen the locking of mddev.
Strengthen the locking of mddev.
mddev is only ever locked in md.c, so we move {,un}lock_mddev
out of the header and into md.c, and rename to mddev_{,un}lock
for consistancy with mddev_{get,put,find}.
When building arrays (typically at boot time) we now lock, and unlock
as it is the "right" thing to do. The lock should never fail.
When generating /proc/mdstat, we lock each array before inspecting it.
In md_ioctl, we lock the mddev early and unlock at the end, rather than
locking in two different places.
In md_open we make sure we can get a lock before completing the open. This
ensures that we sync with do_md_stop properly.
In md_do_recovery, we lock each mddev before checking it's status.
md_do_recovery must unlock while recovery happens, and a do_md_stop at this
point will deadlock when md_do_recovery tries to regain the lock. This will be
fixed in a later patch.
Neil Brown [Tue, 18 Jun 2002 11:17:12 +0000 (04:17 -0700)]
[PATCH] md 15 of 22 - Get rid of kdev_to_mddev
Only two users of kdev_to_mddev remain, md_release and
md_queue_proc.
For md_release we can store the mddev in the md_inode
at md_open time so we can find it easily.
For md_queue_proc, we use mddev_find because we only have the
device number to work with. Hopefully the ->queue function
will get more arguements one day...
Neil Brown [Tue, 18 Jun 2002 11:17:05 +0000 (04:17 -0700)]
[PATCH] md 14 of 22 - Second step to tidying mddev refcounts and locking
This patch gets md_open to use mddev_find instead of kdev_to_mddev, thus
creating the mddev if necessary.
This guarantees that md_release will be able to find an mddev to
mddev_put.
Now that we are certain of getting the refcount right at open/close time,
we don't need the "countdev" stuff. If START_ARRAY happens to start and
array other than that the one that is currently opened, it won't confuse
things at all.
Neil Brown [Tue, 18 Jun 2002 11:16:59 +0000 (04:16 -0700)]
[PATCH] md 13 of 22 - First step to tidying mddev recounting and locking.
First step to tidying mddev recounting and locking.
This patches introduces
mddev_get which incs the refcount on an mddev
mddev_put which decs it and, if it becomes unused, frees it
mddev_find which finds or allocated an mddev for a given minor
This is mostly the old alloc_mddev
free_mddev no longer actually frees it. It just disconnects all drives
so that mddev_put will do the free.
Now the test for "does an mddev exist" is not "mddev != NULL"
but involves checking if the mddev has disks or a superblock
attached.
This makes the semantics of do_md_stop a bit cleaner. Previously
if do_md_stop succeed for a real stop (not a read-only stop) then
you didn't have to unlock the mddev, otherwise you did. Now
you always unlock the mddev after do_md_stop.
Neil Brown [Tue, 18 Jun 2002 11:16:49 +0000 (04:16 -0700)]
[PATCH] md 12 of 22 - Remove "data" from dev_mapping and tidy up
The mapping from minor number to mddev structure allows for a
'data' that is never used. This patch removes that and explicitly
inlines some inline functions that become trivial.
mddev_map also becomes completely local to md.c
Neil Brown [Tue, 18 Jun 2002 11:16:28 +0000 (04:16 -0700)]
[PATCH] md 10 of 22 - Remove nb_dev from mddev_s
The nb_dev field is not needed.
Most uses are the test if it is zero or not, and they can be replaced
by tests on the emptiness of the disks list.
Other uses are for iterating through devices in numerical order and
it makes the code clearer (IMO) to unroll the devices into an array first
(which has to be done at some stage anyway) and then walk that array.
Neil Brown [Tue, 18 Jun 2002 11:16:14 +0000 (04:16 -0700)]
[PATCH] md 8 of 22 - Discard md_make_request in favour of per-personality make_request functions.
As we now have per-device queues, we don't need a common make_request
function that dispatches, we can dispatch directly.
Each *_make_request function is changed to take a request_queue_t
from which it extract the mddev that it needs, and to deduce the
"rw" flag directly from the bio.
Neil Brown [Tue, 18 Jun 2002 11:15:33 +0000 (04:15 -0700)]
[PATCH] md 2 of 22 - Make device plugging work for md/raid5
We embed a request_queue_t in the mddev structure and so
have a separate one for each mddev.
This is used for plugging (in raid5).
Given this embeded request_queue_t, md_make_request no-longer
needs to make from device number to mddev, but can map from
the queue to the mddev instead.
Neil Brown [Tue, 18 Jun 2002 10:38:57 +0000 (03:38 -0700)]
[PATCH] Umem 2 of 2 - Make device plugging work for umem
We embed a request_queue_t in the card structure and so have a separate
one for each card. This is used for plugging.
Given this embeded request_queue_t, mm_make_request no-longer needs to
make from device number to mddev, but can map from the queue to the card
instead.
Oliver Neukum [Tue, 18 Jun 2002 06:44:22 +0000 (23:44 -0700)]
[PATCH] make kaweth use the sk_buff directly on tx
this change set against 2.5 will make kaweth put its private header
into the sk_buff directly if possible or else allocate a temporary sk_buff.
It saves memory and usually a copy.
David Brownell [Tue, 18 Jun 2002 06:43:07 +0000 (23:43 -0700)]
[PATCH] ohci misc fixes
This patch applies on top of the other two (for init problems):
- Uses time to balance interrupt load, not number of transfers.
One 8-byte lowspeed transfer costs as much as ten same-size
at full speed ... previous code could overcommit branches.
- Shrinks the code a smidgeon, mostly in the submit path.
- Updates comments, remove some magic numbers, etc.
- Adds some debug dump routines for EDs and TDs, which can
be rather helpful when debugging!
- Lays ground work for a "shadow" <linux/list.h> TD queue
(but doesn't enlarge the TD or ED on 32bit cpus)
I'm not sure anyone would have run into that time/balance
issue, though some folk have talked about hooking up lots
of lowspeed devices and that would have made trouble.
Matthew Dharm [Tue, 18 Jun 2002 06:35:32 +0000 (23:35 -0700)]
[PATCH] USB storage: change atomic_t to bitfield, consolidate #defines
This patch changes from using an atomic_t with two states to using a
bitfield to determine if a device is attached. It also moves some common
#defines into a common header file.
seems someone else was faster fixing the hardsects problem in the xpram
driver. We continued with my new version of the xpram driver. Arnd
Bergmann found some bugs and added support for the driverfs.
1) Replace is_read_only with bdev_read_only. The last user of is_read_only
is gone...
2) Remove alloc & free of the label array in dasd_genhd. This is needed for
the label array extension but this is a patch of its own.
3) Maintain the old behaviour of /proc/dasd/devices. Its is possible again
to use "add <devno>" instead of "add device <devno>" or "add range=<devno>".
1) Add __s390__ to the list of architectures that use unsigned int as
type for rautofs_wqt_t. __s390__ is defined for both 31-bit and 64-bit
linux for s/390. Both architectures are fine with unsigned int since
sizeof(unsigned int) == sizeof(unsigned long) for 31 bit s/390.
2) Remove early initialization call ccwcache_init(). It doesn't exists
anymore.
3) Remove special case for irq_stat. We moved the irq_stat structure out
of the lowcore.
4) Replace acquire_console_sem with down_trylock & return to avoid an
endless trap loop if console_unblank is called from interrupt context
and the console semaphore is taken.
some recent changes in the s390 architectures files:
1) Makefile fixes.
2) Add missing include statements.
3) Convert all parametes in the 31 bit emulation wrapper of sys_futex.
4) Remove semicolons after 'fi' in Config.in
5) Fix scheduler defines in system.h
6) Simplifications in qdio.c
Andi Kleen [Tue, 18 Jun 2002 04:10:20 +0000 (21:10 -0700)]
[PATCH] Fix incorrect inline assembly in RAID-5
Pure luck that this ever worked at all. The optimized assembly for XOR
in RAID-5 declared did clobber registers, but did declare them as read-only.
I'm pretty sure that at least the 4 disk and possibly the 5 disk cases
did corrupt callee saved registers. The others probably got away because
they were always used in own functions (and only clobbering caller saved
registers)and only called via pointers, preventing inlining.
Some of the replacements are a bit complicated because the functions
exceed gcc's 10 asm argument limit when each input/output register needs
two arguments. Works around that by saving/restoring some of the registers
manually.
I wasn't able to test it in real-life because I don't have a RAID
setup and the RAID code didn't compile since several 2.5 releases.
I wrote some test programs that did test the XOR and they showed
no regression.
Also aligns to XMM save area to 16 bytes to save a few cycles.
Stephen Rothwell [Tue, 18 Jun 2002 04:00:37 +0000 (21:00 -0700)]
[PATCH] make file leases work as they should
This patch fixes the following problems in the file lease:
when there are multiple shared leases on a file, all the
lease holders get notified when someone opens the
file for writing (used to be only the first).
when a nonblocking open breaks a lease, it will time out
as it should (used to never time out).
This should make the leases code more usable (hopefully).
Stephen Rothwell [Tue, 18 Jun 2002 03:55:28 +0000 (20:55 -0700)]
[PATCH] remove getname32
arch/ppc64/kernel/sys_ppc32.c has a getname32 function. The only
difference between it and getname() is that it calls do_getname32()
instead of do_getname() (see fs/namei.c). The difference between
do_getname and do_getname32 is that the former checks to make sure that
the pointer it is passed is less that TASK_SIZE and restricts the length
copied to the lesser of PATH_MAX and (TASK_SIZE - pointer).
do_getname32 uses PAGE_SIZE instead of PATH_MAX.
Anton Blanchard says it is OK to remove getname32.
arch/ia64/ia32/sys_ia32.c defined a getname32(), but nothing used it.
Ingo Molnar [Tue, 18 Jun 2002 19:25:17 +0000 (21:25 +0200)]
sched_yield() is misbehaving.
the current implementation does the following to 'give up' the CPU:
- it decreases its priority by 1 until it reaches the lowest level
- it queues the task to the end of the priority queue
this scheme works fine in most cases, but if sched_yield()-active tasks
are mixed with CPU-using processes then it's quite likely that the
CPU-using process is in the expired array. In that case the yield()-ing
process only requeues itself in the active array - a true context-switch
to the expired process will only occur once the timeslice of the
yield()-ing process has expired: in ~150 msecs. This leads to the
yield()-ing and CPU-using process to use up rougly the same amount of
CPU-time, which is arguably deficient.
i've fixed this problem by extending sched_yield() the following way:
+ * There are three levels of how a yielding task will give up
+ * the current CPU:
+ *
+ * #1 - it decreases its priority by one. This priority loss is
+ * temporary, it's recovered once the current timeslice
+ * expires.
+ *
+ * #2 - once it has reached the lowest priority level,
+ * it will give up timeslices one by one. (We do not
+ * want to give them up all at once, it's gradual,
+ * to protect the casual yield()er.)
+ *
+ * #3 - once all timeslices are gone we put the process into
+ * the expired array.
+ *
+ * (special rule: RT tasks do not lose any priority, they just
+ * roundrobin on their current priority level.)
+ */
Paul Menage [Tue, 18 Jun 2002 03:46:11 +0000 (20:46 -0700)]
[PATCH] Push BKL into ->permission() calls
This patch (against 2.5.22) removes the BKL from around the call
to i_op->permission() in fs/namei.c, and pushes the BKL into those
filesystems that have permission() methods that require it.
Kai Mäkisara [Tue, 18 Jun 2002 03:44:18 +0000 (20:44 -0700)]
[PATCH] 2.5.22 SCSI tape buffering changes
This contains the following changes to the SCSI tape driver:
- one buffer is used for each tape (no buffer pool)
- buffers allocated when needed and freed when device closed
- common code from read and write moved to a function
- default maximum number of scatter/gather segments increased to 64
- tape status set to "no tape" after succesful unload
Andi Kleen [Tue, 18 Jun 2002 03:27:43 +0000 (20:27 -0700)]
[PATCH] poll/select fast path
This patch streamlines poll and select by adding fast paths for a
small number of descriptors passed. The majority of polls/selects
seem to be of this nature. The main saving comes from not allocating
two pages for wait queue and table, but from using stack allocation
(upto 256bytes) when only a few descriptors are needed. This makes
it as fast again as 2.0 and even a bit faster because the wait queue
page allocation is avoided too (except when the drivers overflow it)
select also skips a lot faster over big holes and avoids the separate
pass of determining the max. number of descriptors in the bitmap.
A typical linux system saves a considerable amount of unswappable memory
with this patch, because it usually has 10+ daemons hanging around in poll or
select with each two pages allocated for data and wait queue.
Andi Kleen [Tue, 18 Jun 2002 03:27:16 +0000 (20:27 -0700)]
[PATCH] x86-64 merge
x86_64 core updates.
- Make it compile again (switch_to macros etc., add dummy suspend.h)
- reenable strength reduce optimization
- Fix ramdisk (patch from Mikael Pettersson)
- Some merges from i386
- Reimplement lazy iobitmap allocation. I reimplemented it based
on bcrl's idea.
- Fix IPC 32bit emulation to actually work and move into own file
- New fixed mtrr.c from DaveJ ported from 2.4 and reenable it.
- Move tlbstate into PDA.
- Add some changes that got lost during the last merge.
- new memset that seems to actually work.
- Align signal handler stack frames to 16 bytes.
- Some more minor bugfixes.
Andrew Morton [Tue, 18 Jun 2002 03:21:34 +0000 (20:21 -0700)]
[PATCH] msync(bad address) should return -ENOMEM
Heaven knows why, but that's what the opengroup say, and returning
-EFAULT causes 2.5 to fail one of the Linux Test Project tests.
[ENOMEM]
The addresses in the range starting at addr and continuing
for len bytes are outside the range allowed for the address
space of a process or specify one or more pages that are not
mapped.
Andrew Morton [Tue, 18 Jun 2002 03:21:21 +0000 (20:21 -0700)]
[PATCH] Reduce the radix tree nodes to 64 slots
Reduce the radix tree nodes from 128 slots to 64.
- The main reason for this is that on 64-bit/4k page machines, the
slab allocator has decided that radix tree nodes will require an
order-1 allocation. Shrinking the nodes to 64 slots pulls that back
to an order-0 allocation.
- On x86 we get fifteen 64-slot nodes per page rather than seven
129-slot nodes, for a modest memory saving.
- Halving the node size will approximately halve the memory use in
the worrisome really-large, really-sparse file case.
Of course, the downside is longer tree walks. Each level of the tree
covers six bits of pagecache index rather than seven. As ever, I am
guided by Anton's profiling on the 12- and 32-way PPC boxes.
radix_tree_lookup() is currently down in the noise floor.
Now, there is one special case: one file which is really big and which
is accessed in a random manner and which is accessed very heavily: the
blockdev mapping. We _are_ showing some locking cost in
__find_get_block (used to be __get_hash_table) and in its call to
find_get_page(). I have a bunch of patches which introduce a generic
per-cpu buffer LRU, and which remove ext2's private bitmap buffer LRUs.
I expect these patches to wipe the blockdev mapping lookup lock contention
off the map, but I'm awaiting test results from Anton before deciding
whether those patches are worth submitting.
Andrew Morton [Tue, 18 Jun 2002 03:20:53 +0000 (20:20 -0700)]
[PATCH] allow GFP_NOFS allocators to perform swapcache writeout
One weakness which was introduced when the buffer LRU went away was
that GFP_NOFS allocations became equivalent to GFP_NOIO. Because all
writeback goes via writepage/writepages, which requires entry into the
filesystem.
However now that swapout no longer calls bmap(), we can honour
GFP_NOFS's intent for swapcache pages. So if the allocation request
specifies __GFP_IO and !__GFP_FS, we can wait on swapcache pages and we
can perform swapcache writeout.
Andrew Morton [Tue, 18 Jun 2002 03:20:24 +0000 (20:20 -0700)]
[PATCH] take bio.h out of highmem.h
highmem.h includes bio.h, so just about every compilation unit in the
kernel gets to process bio.h.
The patch moves the BIO-related functions out of highmem.h and into
bio-related headers. The nested include is removed and all files which
need to include bio.h now do so.
Andrew Morton [Tue, 18 Jun 2002 03:19:54 +0000 (20:19 -0700)]
[PATCH] ext3: clean up journal_try_to_free_buffers()
Clean up ext3's journal_try_to_free_buffers(). Now that the
releasepage() a_op is non-blocking and need not perform I/O, this
function becomes much simpler.
Andrew Morton [Tue, 18 Jun 2002 03:19:26 +0000 (20:19 -0700)]
[PATCH] fix loop driver for large BIOs
Fix the loop driver for loop-on-blockdev setups.
When presented with a multipage BIO, loop_make_request overindexes the
first page and corrupts kernel memory. Fix it to walk the individual
pages.
BTW, I suspect the IV handling in loop may be incorrect for multipage
BIOs. Should we not be recalculating the IV for each page in the BIOs,
or incrementing the offset by the size of the preceding pages, or such?
Andrew Morton [Tue, 18 Jun 2002 03:19:13 +0000 (20:19 -0700)]
[PATCH] direct-to-BIO I/O for swapcache pages
This patch changes the swap I/O handling. The objectives are:
- Remove swap special-casing
- Stop using buffer_heads -> direct-to-BIO
- Make S_ISREG swapfiles more robust.
I've spent quite some time with swap. The first patches converted swap to
use block_read/write_full_page(). These were discarded because they are
still using buffer_heads, and a reasonable amount of otherwise unnecessary
infrastructure had to be added to the swap code just to make it look like a
regular fs. So this code just has a custom direct-to-BIO path for swap,
which seems to be the most comfortable approach.
A significant thing here is the introduction of "swap extents". A swap
extent is a simple data structure which maps a range of swap pages onto a
range of disk sectors. It is simply:
At swapon time (for an S_ISREG swapfile), each block in the file is bmapped()
and the block numbers are parsed to generate the device's swap extent list.
This extent list is quite compact - a 512 megabyte swapfile generates about
130 nodes in the list. That's about 4 kbytes of storage. The conversion
from filesystem blocksize blocks into PAGE_SIZE blocks is performed at swapon
time.
At swapon time (for an S_ISBLK swapfile), we install a single swap extent
which describes the entire device.
The advantages of the swap extents are:
1: We never have to run bmap() (ie: read from disk) at swapout time. So
S_ISREG swapfiles are now just as robust as S_ISBLK swapfiles.
2: All the differences between S_ISBLK swapfiles and S_ISREG swapfiles are
handled at swapon time. During normal operation, we just don't care.
Both types of swapfiles are handled the same way.
3: The extent lists always operate in PAGE_SIZE units. So the problems of
going from fs blocksize to PAGE_SIZE are handled at swapon time and normal
operating code doesn't need to care.
4: Because we don't have to fiddle with different blocksizes, we can go
direct-to-BIO for swap_readpage() and swap_writepage(). This introduces
the kernel-wide invariant "anonymous pages never have buffers attached",
which cleans some things up nicely. All those block_flushpage() calls in
the swap code simply go away.
5: The kernel no longer has to allocate both buffer_heads and BIOs to
perform swapout. Just a BIO.
6: It permits us to perform swapcache writeout and throttling for
GFP_NOFS allocations (a later patch).
(Well, there is one sort of anon page which can have buffers: the pages which
are cast adrift in truncate_complete_page() because do_invalidatepage()
failed. But these pages are never added to swapcache, and nobody except the
VM LRU has to deal with them).
The swapfile parser in setup_swap_extents() will attempt to extract the
largest possible number of PAGE_SIZE-sized and PAGE_SIZE-aligned chunks of
disk from the S_ISREG swapfile. Any stray blocks (due to file
discontiguities) are simply discarded - we never swap to those.
If an S_ISREG swapfile is found to have any unmapped blocks (file holes) then
the swapon attempt will fail.
The extent list can be quite large (hundreds of nodes for a gigabyte S_ISREG
swapfile). It needs to be consulted once for each page within
swap_readpage() and swap_writepage(). Hence there is a risk that we could
blow significant amounts of CPU walking that list. However I have
implemented a "where we found the last block" cache, which is used as the
starting point for the next search. Empirical testing indicates that this is
wildly effective - the average length of the list walk in map_swap_page() is
0.3 iterations per page, with a 130-element list.
It _could_ be that some workloads do start suffering long walks in that code,
and perhaps a tree would be needed there. But I doubt that, and if this is
happening then it means that we're seeking all over the disk for swap I/O,
and the list walk is the least of our problems.
rw_swap_page_nolock() now takes a page*, not a kernel virtual address. It
has been renamed to rw_swap_page_sync() and it takes care of locking and
unlocking the page itself. Which is all a much better interface.
Support for type 0 swap has been removed. Current versions of mkwap(8) seem
to never produce v0 swap unless you explicitly ask for it, so I doubt if this
will affect anyone. If you _do_ have a type 0 swapfile, swapon will fail and
the message
version 0 swap is no longer supported. Use mkswap -v1 /dev/sdb3
is printed. We can remove that code for real later on. Really, all that
swapfile header parsing should be pushed out to userspace.
This code always uses single-page BIOs for swapin and swapout. I have an
additional patch which converts swap to use mpage_writepages(), so we swap
out in 16-page BIOs. It works fine, but I don't intend to submit that.
There just doesn't seem to be any significant advantage to it.
I can't see anything in sys_swapon()/sys_swapoff() which needs the
lock_kernel() calls, so I deleted them.
If you ftruncate an S_ISREG swapfile to a shorter size while it is in use,
subsequent swapout will destroy the filesystem. It was always thus, but it
is much, much easier to do now. Not really a kernel problem, but swapon(8)
should not be allowing the kernel to use swapfiles which are modifiable by
unprivileged users.
Andrew Morton [Tue, 18 Jun 2002 03:18:58 +0000 (20:18 -0700)]
[PATCH] leave swapcache pages unlocked during writeout
Convert swap pages so that they are PageWriteback and !PageLocked while
under writeout, like all other block-backed pages. (Network
filesystems aren't doing this yet - their pages are still locked while
under writeout)
Andrew Morton [Tue, 18 Jun 2002 03:18:44 +0000 (20:18 -0700)]
[PATCH] mark_buffer_dirty_inode() speedup
buffer_insert_list() is showing up on Anton's graphs. It'll be via
ext2's mark_buffer_dirty_inode() against indirect blocks. If the
buffer is already on an inode queue, we know that it is on the correct
inode's queue so we don't need to re-add it.
Andrew Morton [Tue, 18 Jun 2002 03:18:30 +0000 (20:18 -0700)]
[PATCH] go back to 256 requests per queue
The request queue was increased from 256 slots to 512 in 2.5.20. The
throughput of `dbench 128' on Randy's 384 megabyte machine fell 40%.
We do need to understand why that happened, and what we can learn from
it. But in the meanwhile I'd suggest that we go back to 256 slots so
that this known problem doesn't impact people's evaluation and tuning
of 2.5 performance.
Andrew Morton [Tue, 18 Jun 2002 03:18:02 +0000 (20:18 -0700)]
[PATCH] grab_cache_page_nowait deadlock fix
- If grab_cache_page_nowait() is to be called while holding a lock on
a different page, it must perform memory allocations with GFP_NOFS.
Otherwise it could come back onto the locked page (if it's dirty) and
deadlock.
Also tidy this function up a bit - the checks in there were overly
paranoid.
- In a few of places, look to see if we can avoid a buslocked cycle
and dirtying of a cacheline.
Andrew Morton [Tue, 18 Jun 2002 03:17:34 +0000 (20:17 -0700)]
[PATCH] ext3 corruption fix
Stephen and Neil Brown recently worked this out. It's a
rare situation which only affects data=journal mode.
Fix problem in data=journal mode where writeback could be left pending on a
journaled, deleted disk block. If that block then gets reallocated, we can
end up with an alias in which the old data can be written back to disk over
the new. Thanks to Neil Brown for spotting this and coming up with the
initial fix.
these are described in Documentation/filesystems/proc.txt They are
basically the tradiditional knobs which we've always had...
We are accreting a ton of obsolete sysctl numbers under /proc/sys/vm/.
I didn't recycle these - just mark them unused and remove the obsolete
documentation.