Andrew Morton [Fri, 22 Nov 2002 03:32:45 +0000 (19:32 -0800)]
[PATCH] no-buffer-head ext2 option
Implements a new set of block address_space_operations which will never
attach buffer_heads to file pagecache. These can be turned on for ext2
with the `nobh' mount option.
During write-intensive testing on a 7G machine, total buffer_head
storage remained below 0.3 megabytes. And those buffer_heads are
against ZONE_NORMAL pagecache and will be reclaimed by ZONE_NORMAL
memory pressure.
This work is, of course, a special for the huge highmem machines.
Possibly it obsoletes the buffer_heads_over_limit stuff (which doesn't
work terribly well), but that code is simple, and will provide relief
for other filesystems.
It should be noted that the nobh_prepare_write() function and the
PageMappedToDisk() infrastructure is what is needed to solve the
problem of user data corruption when the filesystem which backs a
sparse MAP_SHARED mapping runs out of space. We can use this code in
filemap_nopage() to ensure that all mapped pages have space allocated
on-disk. Deliver SIGBUS on ENOSPC.
This will require a new address_space op, I expect.
Andrew Morton [Fri, 22 Nov 2002 03:32:34 +0000 (19:32 -0800)]
[PATCH] handle zones which are full of unreclaimable pages
This patch is a general solution to the situation where a zone is full
of pinned pages.
This can come about if:
a) Someone has allocated all of ZONE_DMA for IO buffers
b) Some application is mlocking some memory and a zone ends up full
of mlocked pages (can happen on a 1G ia32 system)
c) All of ZONE_HIGHMEM is pinned in hugetlb pages (can happen on 1G
machines)
We'll currently burn 10% of CPU in kswapd when this happens, although
it is quite hard to trigger.
The algorithm is:
- If page reclaim has scanned 2 * the total number of pages in the
zone and there have been no pages freed in that zone then mark the
zone as "all unreclaimable".
- When a zone is "all unreclaimable" page reclaim almost ignores it.
We will perform a "light" scan at DEF_PRIORITY (typically 1/4096'th of
the zone, or 64 pages) and then forget about the zone.
- When a batch of pages are freed into the zone, clear its "all
unreclaimable" state and start full scanning again. The assumption
being that some state change has come about which will make reclaim
successful again.
So if a "light scan" actually frees some pages, the zone will revert to
normal state immediately.
So we're effectively putting the zone into "low power" mode, and lightly
polling it to see if something has changed.
The code works OK, but is quite hard to test - I mainly tested it by
pinning all highmem in hugetlb pages.
Andrew Morton [Fri, 22 Nov 2002 03:32:24 +0000 (19:32 -0800)]
[PATCH] strengthen the `incremental min' logic in the page
Strengthen the `incremental min' logic in the page allocator.
Currently it is allowing the allocation to succeed if the zone has
free_pages >= pages_high.
This was to avoid a lockup corner case in which all the zones were at
pages_high so reclaim wasn't doing anything, but the incremental min
refused to take pages from those zones anyway.
But we want the incremental min zone protection to work. So:
- Only allow the allocator to dip below the incremental min if he
cannot run direct reclaim.
- Change the page reclaim code so that on the direct reclaim path,
the caller can free pages beyond ->pages_high. So if the incremental
min test fails, the caller will go and free some more memory.
Eventually, the caller will have freed enough memory for the
incremental min test to pass against one of the zones.
Andrew Morton [Fri, 22 Nov 2002 03:32:14 +0000 (19:32 -0800)]
[PATCH] Remove mapping->vm_writeback
The vm_writeback address_space operation was designed to provide the VM
with a "clustered writeout" capability. It allowed the filesystem to
perform more intelligent writearound decisions when the VM was trying
to clean a particular page.
I can't say I ever saw any real benefit from this - not much writeout
actually happens on that path - quite a lot of work has gone into
minimising it actually.
The default ->vm_writeback a_op which I provided wrote back the pages
in ->dirty_pages order. But there is one scenario in which this causes
problems - writing a single 4G file with mem=4G. We end up with all of
ZONE_NORMAL full of dirty pages, but all writeback effort is against
highmem pages. (Because there is about 1.5G of dirty memory total).
Net effect: the machine stalls ZONE_NORMAL allocation attempts until
the ->dirty_pages writeback advances onto ZONE_NORMAL pages.
This can be fixed most sweetly with additional radix-tree
infrastructure which will be quite complex. Later.
So this patch dumps it all, and goes back to using writepage
against individual pages as they come off the LRU.
Andrew Morton [Fri, 22 Nov 2002 03:32:03 +0000 (19:32 -0800)]
[PATCH] Fix busy-wait with writeback to large queues
blk_congestion_wait() is a utility function which various callers use
to throttle themselves to the rate at which the IO system can retire
writes.
The current implementation refuses to wait if no queues are "congested"
(>75% of requests are in flight).
That doesn't work if the queue is so huge that it can hold more than
40% (dirty_ratio) of memory. The queue simply cannot enter congestion
because the VM refuses to allow more than 40% of memory to be dirtied.
(This spin could happen with a lot of normal-sized queues too)
So this patch simply changes blk_congestion_wait() to throttle even if
there are no congested queues. It will cause the caller to sleep until
someone puts back a write request against any queue. (Nobody uses
blk_congestion_wait for read congestion).
The patch adds new state to backing_dev_info->state: a couple of flags
which indicate whether there are _any_ reads or writes in flight
against that queue. This was added to prevent blk_congestion_wait()
from taking a nap when there are no writes at all in flight.
But the "are there any reads" info could be used to defer background
writeout from pdflush, to reduce read-vs-write competition. We'll see.
Because the large request queues have made a fundamental change:
blocking in get_request_wait() has been the main form of VM throttling
for years. But with large queues it doesn't work any more - all
throttling happens in blk_congestion_wait().
Also, change io_schedule_timeout() to propagate the schedule_timeout()
return value. I was using that in some debug code, but it should have
been like that from day one.
Andrew Morton [Fri, 22 Nov 2002 03:31:43 +0000 (19:31 -0800)]
[PATCH] ext2/ext3 Orlov directory accounting fix
Patch from Stephen Tweedie
"In looking at the fix for the ext3 Orlov double-accounting bug, I
noticed a change to the sb->s_dir_count accounting, restoring a
missing s_dir_count++ when we allocate a new directory.
However, I can't find anywhere in the code where we decrement this
again on directory deletion, neither in ext2 nor in ext3, in 2.4 nor
in 2.5."
Andrew Morton [Fri, 22 Nov 2002 03:31:33 +0000 (19:31 -0800)]
[PATCH] remove a warning from __block_write_full_page()
There is a warning in there to detect when block_write_full_page()
attaches buffers to a blockdev page. This is a bad thing because that
page's blocks may then overlap blocks from a different address_space.
So I disallowed it.
But the message can be triggered when an application is mmapping a
blockdev MAP_SHARED. Apparently INND likes to do this.
Andrew Morton [Fri, 22 Nov 2002 03:31:03 +0000 (19:31 -0800)]
[PATCH] Expanded bad page handling
The page allocator has traditionally just gone BUG when it sees a page
in a bad state. This is usually due to hardware errors, sometimes
software errors.
I'm proposing that we not go BUG() any more, but print lots (and lots)
of diagnostic info and try to continue.
Andrew Morton [Fri, 22 Nov 2002 03:30:41 +0000 (19:30 -0800)]
[PATCH] Add SMP barrier to ipc's grow_ary()
From Dipanker Sarma.
Before setting the ids->entries to the new array, there must be a wmb()
to make sure that the memcpyed contents of the new array are visible
before the new array becomes visible.
Andrew Morton [Fri, 22 Nov 2002 03:30:31 +0000 (19:30 -0800)]
[PATCH] radix-tree reinitialisation fix
This patch fixes a problem which was discovered by Vladimir Saveliev
<vs@namesys.com>
Radix trees have a `height' field, which defines how far the pages are
from the root of the tree. It starts out at zero and increases as the
trees depth is grown.
But it is never decreased. It cannot be decreased without a full tree
traversal.
Because radix_tree_delete() does not decrease `height', we end up
returning inodes to their filesystem's inode slab cache with a non-zero
height.
And when that inode is reused from slab for a new file, it still has a
non-zero height. So we're breaking the slab rules by not putting
objects back in a fully reinitialised state.
So the new file starts out life with whatever height the previous owner
of the inode had. Which is space- and speed-inefficient.
The most efficient place to fix this would be in destroy_inode(). But
that only fixes the problem for inodes - there are other users of radix
trees.
So fix it in radix_tree_delete(): if the tree was emptied, reset
`height' to zero.
Neil Brown [Fri, 22 Nov 2002 03:20:11 +0000 (19:20 -0800)]
[PATCH] kNFSd - 2 of 2 - Change NFSv4 reply encoding to cope with multiple pages.
This allows NFSv4 responses to cover move than one page. There are
still limits though. There can be at most one 'data' response which
includes READ, READLINK, READDIR. For these responses, the interesting
data goes in a separate page or, for READ, list of pages.
All responses before the 'data' response must fit in one page, and all
responses after it must also fit in one (separate) page.
Neil Brown [Fri, 22 Nov 2002 03:20:01 +0000 (19:20 -0800)]
[PATCH] kNFSd - 1 of 2 - Change NFSv4 xdr decoding to cope with separate pages.
Now that nfsd uses a list of pages for requests instead of
one large buffer, NFSv4 need to know about this.
The most interesting part of this is that it is possible
that section of a request, like a path name, could span
two pages, so we need to be able to kmalloc as little bit
of space to copy them into, and make sure they get
freed later.
Trond Myklebust [Fri, 22 Nov 2002 03:08:38 +0000 (19:08 -0800)]
[PATCH] Split buffer overflow checking out of struct nfs4_compound
Here is the a pre-patch in the attempt to get rid of 'struct
nfs4_compound', and the associated horrible union in 'struct
nfs4_op'.
It splits out the fields that are meant to do buffer overflow checking
and iovec adjusting on the XDR received/sent data. It moves support
for that nto the dedicated structure 'xdr_stream', and the associated
functions 'xdr_reserve_space()', 'xdr_inline_decode()'.
The patch also expands out the all macros ENCODE_HEAD, ENCODE_TAIL,
ADJUST_ARGS and DECODE_HEAD, as well as most of the DECODE_TAILs.
This changes the return type of the verify and setpolicy functions from
void to int. While doing this, I've changed the values for minimum and
maximum supported frequency to be per CPU, as UltraSPARC needs this.
Stelian Pop [Fri, 22 Nov 2002 03:03:44 +0000 (19:03 -0800)]
[PATCH] meye driver update
The most important changes are:
- allocate buffers on open(), not module load;
- correct some failed allocation paths;
- use wait_event;
- C99 structs inits;
Stelian Pop [Fri, 22 Nov 2002 03:03:32 +0000 (19:03 -0800)]
[PATCH] sonypi driver update
The most important changes are:
* add suspend/resume support to the sonypi driver (not
based on driverfs however) (Florian Lohoff);
* add "Zoom" and "Thumbphrase" buttons (Francois Gurin);
* add camera and lid events for C1XE (Kunihiko IMAI);
* add a mask parameter letting the user choose what kind
of events he wants;
* use ACPI ec_read/ec_write when available in order to
play nice when latest ACPI is enabled;
* several source cleanups.
Today we return EINVAL for fcntl with a lock with negative length.
POSIX-2001 says that the lock covers start .. start+len-1 if len >= 0
and start+len .. start-1 if len < 0.
The i_dev field is deleted and the few uses are replaced by i_sb->s_dev.
There is a single side effect: a stat on a socket now sees a nonzero
st_dev. There is nothing against that - FreeBSD has a nonzero value as
well - but there is at least one utility (fuser) that will need an
update.
Rik van Riel [Thu, 21 Nov 2002 06:47:05 +0000 (22:47 -0800)]
[PATCH] advansys.c buffer overflow
The Stanford checker found an error in advansys.c, the driver
is accessing field 6 in an array[6]. Since this is the only
place where this field is accessed it should be safe to simply
remove this line.
Neil Brown [Thu, 21 Nov 2002 06:45:18 +0000 (22:45 -0800)]
[PATCH] Only set dest addr in NFS/udp reply, not NFS/tcp.
We don't need to send an empty message to
set up remote address when sending tcp reply, so
we don't. Also, as the data is empty, we don't
need to set_fs.
Neil Brown [Thu, 21 Nov 2002 06:44:51 +0000 (22:44 -0800)]
[PATCH] Fix bug in svc_udp_recvfrom
Hirokazu Takahashi <taka@valinux.co.jp> noticed that
svc_udp_recvfrom wouild set some fields in rqstp->rq_arg
wrongly if the request was shorter than one page.
This patch makes the code in udp_recvfrom the same
as the (correct) code in tcp_recvfrom.
Neil Brown [Thu, 21 Nov 2002 06:44:40 +0000 (22:44 -0800)]
[PATCH] Fix err in size calculation for readdir response.
If the 'data' component of a readdir response is
exactly one page (the max allowed) then we currently
only send 0 bytes of it, instead of PAGE_SIZE bytes.
Neil Brown [Thu, 21 Nov 2002 06:44:30 +0000 (22:44 -0800)]
[PATCH] NFSv3 to extract large symlinks from paginated requests.
Now that requests are broken into non-contiguous pages,
an NFSv3 symlink request could be larger than a page and
so non-continguous.
This patch copies the symlink into a new page (while checking
for nul bytes) so nfsd_symlink will definately get a
contiguous link.
Andrew Morton [Thu, 21 Nov 2002 06:37:01 +0000 (22:37 -0800)]
[PATCH] detect uninitialised per-cpu storage
So poor old Dave spent days hunting down memory corruption because the
`kstat' per-cpu storage is not initialised (it needs to be, it's a workaround
for ancient gcc's).
The same problem had me hunting for a day too.
This patch, based on an initial version from Rusty will
parse System.map at final link and will fail the build if
any per-cpu symbols are found to be not in the percpu section.
Nicolas Mailhot [Thu, 21 Nov 2002 06:25:51 +0000 (22:25 -0800)]
[PATCH] Via KT400 agp support
This adds the KT400 pci ID and lists it as using Via generic setup
routines. This patch has been tested with all GL xscreensavers I could
find, and been reviewed by Dave Jones (full patch history at
http://bugzilla.kernel.org/show_bug.cgi?id=3D14).
Dave Jones [Thu, 21 Nov 2002 04:24:53 +0000 (20:24 -0800)]
[PATCH] A new Athlon 'bug'.
Very recent Athlons (Model 8 stepping 1 and above) (XPs/MPs and mobiles)
have an interesting problem. Certain bits in the CLK_CTL register need
to be programmed differently to those in earlier models. The problem arises
when people plug these new CPUs into boards running BIOSes that are unaware
of this fact.
The fix is to reprogram CLK_CTL to 200xxxxx instead of 0x600xxxxx as it was
in previous models. The AMD folks have found that this improves stability.
The patch below does this reprogramming if an affected model/bios is
detected.
I'm interested if someone with an affected model could run some
benchmarks before and after to also see if this affects performance.
Ivan Kokshaysky [Thu, 21 Nov 2002 00:59:50 +0000 (16:59 -0800)]
[PATCH] compile fixes
- iovec stuff in linux/uio.h is needed for CONFIG_OSF4_COMPAT;
- pcibios_{read,write}_config_xx has gone - replaced with
respective pci_bus_xx functions.
Russell King [Thu, 21 Nov 2002 00:04:45 +0000 (00:04 +0000)]
[ARM] Fix ARM module support
This cset allows ARM modules to work again. The solution was
suggested by Andi Kleen.
We shrink the available user space size by 16MB, thereby opening up
a window in virtual memory space between user space and the kernel
direct mapped RAM. We place modules into this space, and, since the
kernel image is always at the bottom of kernel direct mapped RAM, we
can be assured that any 24-bit PC relocations (which have a range
of +/- 32MB) will always be able to reach the kernel.
Patrick Mochel [Wed, 20 Nov 2002 15:08:40 +0000 (09:08 -0600)]
partitions: use the name in disk->kobj.name, instead of disk->disk_name.
Some names (for some reason) have a '/' in them, making them no good for directory
names. disk->kobj.name has already been transformed to turn those into '!', so this
makes sure we use those when setting the name for the partitions' names.
Patrick Mochel [Wed, 20 Nov 2002 14:08:19 +0000 (08:08 -0600)]
driver model: update and clean bus and driver support.
This a multi-pronged attack aimed at exploiting the kobject infrastructure mor.
- Remove bus_driver_list, in favor of list in bus_subys.
- Remove bus_for_each_* and driver_for_each_dev(). They're not being used by
anyone, have questionable locking semantics, and really don't provide that
much use, as the function returns once the callback fails, with no indication
of where it failed. Forget them, at least for now.
- Make sure that we return success from bus_match() if device matches, but
doesn't have a probe method.
- Remove extraneous get_{device,driver}s from bus routines that are serialized
by the bus's rwsem. bus_{add,remove}_{device,driver} all take the rwsem, so there
is no way we can get a non-existant object when in those functions.
- Use the rwsem in the struct subsystem the bus has embedded in it, and kill the
separate one in struct bus_type.
- Move bulk of driver_register() into bus_add_driver(), which holds the bus's
rwsem during the entirety. this will prevent the driver from being unloaded while
it's being registered, and two drivers with the same name getting registered
at the same time.
- Ditto for driver_unregister() and bus_remove_driver().
- Add driver_release() method for the driver bus driver subsystems. (Explained later)
- Use only the refcounts in the buses' kobjects, and kill the one in struct bus_type.
- Kill struct bus_type::present and struct device_driver::present. These didn't
work out the way we intended them to. The idea was to not let a user obtain a
rerference count to the object if it was in the process of being unregistered.
All the code paths should be fixed now such that their registration is protected with
a semaphore, so no partially initialized objects can be removed, and enough
infrastructure is moved to the kobject model so that once the object is publically
visible, it should be usable by other sources.
- Add a bus_sem to serialize bus registration and unregistration.
- Add struct device_driver::unload_sem to prevent unloading of drivers
with a positive reference count.
The driver model has always had a bug that would allow a driver with a
positive reference count to be unloaded. It would decrement the reference
count and return, letting the module be unloaded, without accounting for
the other users of the object. This has been discussed many times, though
never resolved cleanly. This should fix the problem in the simplest manner.
struct device_driver gets unload_sem, which is initialized to _locked_. When
the reference count for the driver reaches 0, the semaphore is unlocked.
driver_unregister() blocks on acquiring this lock before it exits. In the
normal case that driver_unregister() drops the last reference to the driver,
the lock will be acquired immediately, and the module will unload.
In the case that someone else is using the driver object, driver_unregister()
will not be able to acquire the lock, since the refcount has not reached 0,
and the lock has not been released.
This means that rmmod(8) will block while drivers' sysfs files are open.
There are no sysfs files for drivers yet, but note this when they do have
some.
Patrick Mochel [Wed, 20 Nov 2002 13:37:09 +0000 (07:37 -0600)]
sysfs: various updates.
- Don't do extra dget() when creating symlink.
This is a long-standing bug with a simple and obvious fix. We were doing
an extra dget() on the dentry after d_instantiate(). It only gets decremented
once on removal, so the dentry was never really going away, and the directory
wasn't, either.
- Use simple_unlink() instead of sysfs_unlink().
- Use simple_rmdir() instead of our own, unrolled, version.
- Remove MODULE_LICENSE(), since it's always in the kernel.