Linus Torvalds [Sat, 21 Dec 2002 11:17:47 +0000 (03:17 -0800)]
More mtrr/if.c fixes
- printk is not an acceptable substitute for errors
- fix indentation of mtrr_close()
- fix duplicate mtrr "release" fn pointer initializer
Attached is a patch that passes the correct information back to user
land for number of attachments to shared memory segment. I could have
done few more changes in a way nattach is getting set for regular cases
now, but just want to limit it at this point.
Andrew Morton [Sat, 21 Dec 2002 09:08:17 +0000 (01:08 -0800)]
[PATCH] hugetlb bugfixes
From Rohit Seth
1) Bug fixes (mainly in the unsuccessful attempts of hugepages).
i) not modifying the value of key for unsuccessful key
allocation
ii) Correct usage of mmap_sem in free_hugepages
iii) Proper unlocking of key->lock for partial hugepage
allocations
2) Include the IPC_LOCK for permission to use hugepages via the
syscall interface. This brings the syscall interface into line with
the hugetlbfs interface.
It also adds permits users who are in the superuser group to
access hugetlb resources. This is so that database servers can run
without elevated permissions.
3) Increment the key_counts during forks to correctly identify the
number of processes references a key.
Andrew Morton [Sat, 21 Dec 2002 09:08:12 +0000 (01:08 -0800)]
[PATCH] ext3: fix buffer dirtying
This is a forward-port from 2.4. One of Stephen's recent fixes. I
managed to merge up only half of it. Here is the rest. It should fix
the asserton failure reported by Robert Macaulay
<robert_macaulay@dell.com>
"There was a race window in buffer refiling where we could temporarily
expose the journal's internal BH_JBDDirect flag as BH_Dirty, which is
visible to the rest of the VFS. That doesn't affect the journaling,
because we hold journal_head locks while the buffer is in this
transient state, but bdflush can see the buffer and write it out
unexpectedly, causing ext3 to find the buffer in an unexpected state
later."
The fix simply keeps the dirty bits clear during the internal buffer
processing, restoring the state to the private BH_JBDDirect once
refiling is complete."
Andrew Morton [Sat, 21 Dec 2002 09:08:07 +0000 (01:08 -0800)]
[PATCH] ext3 use-after-free bugfix
If ext3_add_nondir() fails it will do an iput() of the inode. But we
continue to run ext3_mark_inode_dirty() against the potentially-freed
inode. This oopses when slab poisoning is enabled.
Fix it so that we only run ext3_mark_inode_dirty() if the inode was
successfully instantiated.
Andrew Morton [Sat, 21 Dec 2002 09:07:52 +0000 (01:07 -0800)]
[PATCH] ext3: smarter block allocation startup
When an ext3 (or ext2) file is first created the filesystem has to
choose the initial starting block for its data allocations. In the
usual (new-file) case, that initial goal block is the zeroeth block of
a particular blockgroup.
This is the worst possible choice. Because it _guarantees_ that this
file's blocks will be pessimally intermingled with the blocks of
another file which is growing within the same blockgroup.
We've always had this problem with files in the same directory. With
the introduction of the Orlov allocator we now have the problem with
files in different directories. And it got noticed. This is the cause
of the post-Orlov 50% slowdown in dbench throughput on ext3 on
write-through caching SCSI on SMP. And 25% in ext2.
It doesn't happen on uniprocessor because a single CPU will not exhibit
sufficient concurrency in allocation against two or more files.
It will happen on uniprocessor if the files are growing slowly.
It has always happened if the files are in the same directory.
ext2 has the same problem but it is siginficantly less damaging there
because of ext2's eight-block per-inode preallocation window.
The patch largely solves this problem by not always starting the
allocation goal at the zeroeth block of the blockgroup. We instead
chop the blockgroup into sixteen starting points and select one of those
based on the lower four bits of the calling process's PID.
The PID was chosen as the index because this will help to ensure that
related files have the same starting goal. If one process is slowly
writing two files in the same directory, we still lose.
Using the PID in the heuristic is a bit weird. As an alternative I
tried using the file's directory's i_ino. That fixed the dbench
problem OK but caused a 15% slowdown in the fast-growth `untar a kernel
tree' workload. Because this approach will cause files which are in
different directories to spread out more. Suppressing that behaviour
when the files are all being created by the same process is a
reasonable heuristic.
I changed dbench to never unlink its files, and used e2fsck to
determine how many fragmented files were present after a `dbench 32'
run. With this patch and the next couple, ext2's fragmentation went
from 22% to 13% and ext3's from 25% to 10.4%.
Andrew Morton [Sat, 21 Dec 2002 09:07:46 +0000 (01:07 -0800)]
[PATCH] ext2/3: better starting group for S_ISREG files
ext2 places non-directory objects into the same blockgroup as their
directory, as long as that directory has free inodes. It does this
even if there are no free blocks in that blockgroup (!).
This means that if there are lots of files being created at a common
point in the tree, they _all_ have the same starting blockgroup. For
each file we do a big search forwards for the first block and the
allocations end up getting intermingled.
So this patch will avoid placing new inodes in block groups which have
no free blocks.
So far so good. But this means that if a lot of new files are being
created under a directory (or multiple directories) which are in the
same blockgroup, all the new inodes will overflow into the same
blockgroup. No improvement at all.
So the patch arranges for the new inode locations to be "spread out"
across different blockgroups if they are not going to be placed in
their directory's block group. This is done by adding parent->i_ino
into the starting point for the quadratic hash. i_ino was chosen so
that files which are in the same directory will tend to all land in the
same new blockgroup.
Andrew Morton [Sat, 21 Dec 2002 09:07:33 +0000 (01:07 -0800)]
[PATCH] Give kswapd writeback higher priority than pdflush
The `low latency page reclaim' design works by preventing page
allocators from blocking on request queues (and by preventing them from
blocking against writeback of individual pages, but that is immaterial
here).
This has a problem under some situations. pdflush (or a write(2)
caller) could be saturating the queue with highmem pages. This
prevents anyone from writing back ZONE_NORMAL pages. We end up doing
enormous amounts of scenning.
A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
then kill the mmapping applications. The machine instantly goes from
0% of memory dirty to 95% or more. pdflush kicks in and starts writing
the least-recently-dirtied pages, which are all highmem. The queue is
congested so nobody will write back ZONE_NORMAL pages. kswapd chews
50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
efficiency (pages_reclaimed/pages_scanned) falls to 2%.
So this patch changes the policy for kswapd. kswapd may use all of a
request queue, and is prepared to block on request queues.
What will now happen in the above scenario is:
1: The page alloctor scans some pages, fails to reclaim enough
memory and takes a nap in blk_congetion_wait().
2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
back pages. (These pages will be rotated to the tail of the
inactive list at IO-completion interrupt time).
This writeback will saturate the queue with ZONE_NORMAL pages.
Conveniently, pdflush will avoid the congested queues. So we end up
writing the correct pages.
In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
efficiency rises from 2% to 40% and things are generally a lot happier.
The downside is that kswapd may now do a lot less page reclaim,
increasing page allocation latency, causing more direct reclaim,
increasing lock contention in the VM, etc. But I have not been able to
demonstrate that in testing.
The other problem is that there is only one kswapd, and there are lots
of disks. That is a generic problem - without being able to co-opt
user processes we don't have enough threads to keep lots of disks saturated.
One fix for this would be to add an additional "really congested"
threshold in the request queues, so kswapd can still perform
nonblocking writeout. This gives kswapd priority over pdflush while
allowing kswapd to feed many disk queues. I doubt if this will be
called for.
Andrew Morton [Sat, 21 Dec 2002 09:07:12 +0000 (01:07 -0800)]
[PATCH] more informative slab poisoning
slab poisons objects with 0x5a both when they are constructed and when
they are freed. So it is not possible to tell whether a deref of
0x5a5a5a5a was a use-before-initialisation bug or a use-after-free bug.
The patch changes it so that
1) A deref of 0x5a5a5a5a means use-of-uninitialised-memory
2) A deref of 0x6b6b6b6b means use-of-freed-memory.
Andrew Morton [Sat, 21 Dec 2002 09:07:06 +0000 (01:07 -0800)]
[PATCH] fix use-after-free bug in move_vma()
move_vma() calls do_munmap() and then uses the memory at *new_vma.
But when starting X11 it just happens that the memory which do_munmap
unmapped had the same start address and the range at *new_vma. So new_vma
is freed by do_munmap().
This was never noticed before because (vm_flags & VM_LOCKED) evaluates
false when vm_flags is 0x5a5a5a5a. But I just changed that to 0x6b6b6b6b
and boom - we call make_pages_present() with start == end == 0x6b6b6b6b and
it goes BUG.
So I think the right fix here is for move_vma() to not inspect the values
of any vma's after it has called do_munmap().
The patch does that, for `new_vma'.
The local variable `vma' is also being used after the call do do_munmap(),
and this may also be a bug. Proving that this is not so, and adding a
comment to explain why is hereby added to Hugh's todo list ;)
Andrew Morton [Sat, 21 Dec 2002 09:07:00 +0000 (01:07 -0800)]
[PATCH] fix a page dirtying race in vmscan.c
There's a small window in which another CPU could dirty the page after
we've cleaned it, and before we've moved it to mapping->dirty_pages().
The end result is a dirty page on mapping->locked_pages, which is
wrong.
So take mapping->page_lock before clearing the dirty bit.
Andrew Morton [Sat, 21 Dec 2002 09:06:54 +0000 (01:06 -0800)]
[PATCH] sync_fs deadlock fix
Running a `mount -o remount' against ext3 deadlocks if there is heavy
write activity. It's a sort of AB/BA deadlock caused by calling
log_wait_commit() under lock_super(). The caller holds lock_super()
and is waiting for a commit, but the commit cannot complete because
lock_super() is also used in the block allocator.
The way we fixed this in tha past is to drop the superblock lock inside
ext3. The way this patch fixes it is to arrange for lock_super() to
not be held around the ->sync_fs() call.
Also: sync_filesystems is on the sys_sync() path and is racy wrt
unmount. Check sb->s_root after taking sb->s_umount.
Linus Torvalds [Sat, 21 Dec 2002 08:02:05 +0000 (00:02 -0800)]
Sysenter cleanups (originals by Brian Gerst, updated and expanded by me):
- set up kernel stack pointer for sysenter at each context switch.
- disable sysenter while in vm86 mode.
- clean up mtrr number defines and SEP feature testing
Ivan Kokshaysky [Sat, 21 Dec 2002 05:24:28 +0000 (21:24 -0800)]
[PATCH] PCI: setup-xx fixes
Don't disable PCI devices before changing the BARs, as discussed
recently. Disabling PCI_COMMAND_MASTER bit is an obvious bug.
Further, pdev_enable_device() is a leftover from very old (2.0, I guess)
alpha PCI code. It's used in pci_assign_unassigned_resources() to
enable *every* PCI device in the system. So, if we have two graphic
cards on the same bus, both with legacy VGA IO... oops.
Actually, only alpha relied on that due to the lack of
pcibios_enable_device (which has been already fixed).
Manfred Spraul [Sat, 21 Dec 2002 04:38:39 +0000 (20:38 -0800)]
[PATCH] new attempt at sys_poll allocation (was: Re: Poll patches..)
This replaces the dynamically allocated two-level array in sys_poll with
a dynamically allocated linked list. The current implementation causes
at least two alloc/free calls, even if only one or two descriptors are
polled. This reduces that to one alloc/free, and the .text segment is
around 220 bytes shorter. The microbenchmark that polls one pipe fd is
around 30% faster. [1140 cycles instead of 1604 cycles, Celeron mobile
1.13 GHz]
Chuck Lever [Fri, 20 Dec 2002 14:27:42 +0000 (06:27 -0800)]
[PATCH] cleanup: simplify req_offset function in NFS client
Description:
everywhere the NFS client uses the req_offset() function today, it adds
req->wb_offset to the result. this patch simply makes "+req->wb_offset"
a part of the req_offset() function.
Test status:
Passes all Connectathon '02 tests with v2, v3, UDP and TCP. Passes
NFS torture tests on an x86 UP highmem system.
Chuck Lever [Fri, 20 Dec 2002 14:27:31 +0000 (06:27 -0800)]
[PATCH] use kmap_atomic instaed of kmap in NFS client
Description:
andrew morton suggested there are places in the NFS client that could
make use of kmap_atomic instead of vanilla kmap in order to improve
scalability on 8-way and higher SMP systems.
Test status:
Passes all Connectathon '02 tests with v2 and v3, UDP and TCP; passes
NFS torture tests on a UP HIGHMEM x86 system.
Miles Bader [Fri, 20 Dec 2002 14:18:55 +0000 (06:18 -0800)]
[PATCH] Reduce redundancy in v850 linker scripts
This moves most of the duplicated text in the various v850 platform-
specific linker scripts (each of which was previously completely
standalone) into cpp macros in vmlinux.lds.S, which are then used by the
platform linker scripts as appropriate. This should make the scripts a
lot easier to maintain.
Miles Bader [Fri, 20 Dec 2002 14:18:29 +0000 (06:18 -0800)]
[PATCH] Add some v850 elf constants
These are used for the new in-kernel module loader (actually not all the
relocation types are used right now, but are included for completeness).
Only the EM_CYGNUS_V850 macro, which is in a global namespace, is added
to <linux/elf.h>; the relocation types, which are private to the v850,
are added to <asm-v850/elf.h>. [Perhaps some other archs can do a
similar split, to reduce the bloat in <linux/elf.h>]
Nathan Scott [Fri, 20 Dec 2002 22:05:45 +0000 (23:05 +0100)]
[XFS] Fix up setting up of sector size for the superblock buffer after the
very first read on mount. Make some of the surrounding code dealing
with buffers consistent.
Trond Myklebust [Fri, 20 Dec 2002 13:43:41 +0000 (05:43 -0800)]
[PATCH] Support for NFSv4 READ + WRITE attribute cache consistency
Retrieve the post-operation attribute changes for NFSv4 READ and
WRITE operations. Unlike for NFSv2 and NFSv3, we do not retrieve the
full set of file attributes. The main reason for this is that
interpreting attributes is a much heavier task on NFSv4 (requiring, for
instance, translation of file owner names into uids ...). Hence
For a READ request, we retrieve only the 'change attribute' (for cache
consistency checking) and the atime.
For a WRITE request, we retrieve the 'change attribute' and the file size.
In addition, we retrieve the value of the change attribute prior to the
write operation, in order to be able to do weak cache consistency checking.
Trond Myklebust [Fri, 20 Dec 2002 13:42:20 +0000 (05:42 -0800)]
[PATCH] Clean up NFSv4 READ xdr path
This creates a clean XDR path for the NFSv4 read requests instead of
routing through encode_compound()/decode_compound(). This eliminates
the intermediate step of setting up a struct nfs4_compound before
proceeding to XDR encoding, and removes the large 'switch()' statements
from the codepath altogether.
Andi Kleen [Fri, 20 Dec 2002 13:37:00 +0000 (05:37 -0800)]
[PATCH] x86-64 merge
This patch depends on the i386 MTRR driver cleanup I sent earlier.
- Support non executable mappings for x86-64. data/heap are non executable
by default now.
- Beginnings of software suspend from Pavel (not working yet)
- Support generic compat functions and remove some shared code
in the 32bit emulation (Stephen Rothwell)
- Support hugetlbfs
- Some makefile updates
- Make sure all 32bit emulation functions return long, not int.
This fixes some problems with ERESTARTNOSYS.et.al. leaking to userspace.
- Add new system calls.
- Fix long standing fs/gs context switch bugs (thanks to Karsten Keil
for helping to fix that mess). Also make sure the gs selector is
set to 0 after an exec.
- Simplify TLS switching
- Paranoid CPUID check at bootup
- Reorder scatterlist to be more space efficient (Jes Soerensen)
- Enlarge 32bit address space to full 4GB.
- Beginnings of 32bit SYSCALL support (not completely working yet
and vsyscall page miss yet)
- Various merges from i386
- New module loader
- Support threaded core dump (XMM saving for 32bit programs doesn't
work, but it appears to be broken on i386 too)
- Fix bug in signal stack rounding
- Remove DRM 32bit emulation.
- Use MTRR driver from i386
- Use bootflag.c from i386
- Various other fixes and cleanups.
Andi Kleen [Fri, 20 Dec 2002 13:35:10 +0000 (05:35 -0800)]
[PATCH] Some i386 cleanups - MTRR, bootflag
This does:
- fix one warning in bootflag.c
- change a few longs to int and int to long in the MTRR driver
to make it 64bit clean (should be a NOP for 32bit i386, but is needed
for x86-64)
- Convert the MTRR /proc interface to seq_file and remove the broken
compute_ascii() hack. This fixes some broken code e.g. the old
mtrr_write was completely broken because the loop checking for
commands started with a "continue" - remove duplicated mtrr type
strings.
Doug Ledford [Fri, 20 Dec 2002 16:00:06 +0000 (11:00 -0500)]
Change all uses of device->request_queue (was struct, now pointer)
Update scsi_scan so that we don't pass around a scsi_device struct for
scanning. Instead, we pass around a request_queue during
scanning and create and destroy device structs as needed. This
allows us to have a 1:1 correlation between scsi_alloc_sdev()
and scsi_free_sdev() calls, which we didn't have before.