This implements one of the last missing POSIX threading details - exec()
semantics. Previous kernels had code that tried to handle it, but that
code had a number of disadvantages:
- it only worked if the exec()-ing thread was the thread group leader,
creating an assymetry. This does not work if the thread group leader
has exited already.
- it was racy: it sent a SIGKILL to every thread in the group but did not
wait for them to actually process the SIGKILL. It did a yield() but
that is not enough. All 'other' threads have to finish processing
before we can continue with the exec().
This adds the same logic, but extended with the following enhancements:
- works from non-leader threads just as much as the thread group leader.
- waits for all other threads to exit before continuing with the exec().
- reuses the PID of the group.
It would perhaps be a more generic approach to add a new syscall,
sys_ungroup() - which would do largely what de_thread() does in this
patch.
But it's not really needed now - posix_spawn() is currently implemented
via starting a non-CLONE_THREAD helper thread that does a sys_exec().
There's no API currently that needs a direct exec() from a thread - but
it could be created (such as pthread_exec_np()). It would have the
advantage of not having to go through a helper thread, but the
difference is minimal.
This fixes one more exit-time resource accounting issue - and it's also
a speedup and a thread-tree (to-be thread-aware pstree) visual
improvement.
In the current code we reparent detached threads to the init thread.
This worked but was not very nice in ps output: threads showed up as
being related to init. There was also a resource-accounting issue, upon
exit they update their parent's (ie. init's) rusage fields -
effectively losing these statistics. Eg. 'time' under-reports CPU
usage if the threaded app is Ctrl-C-ed prematurely.
The solution is to reparent threads to the group leader - this is now
very easy since we have p->group_leader cached and it's also valid all
the time. It's also somewhat faster for applications that use
CLONE_THREAD but do not use the CLONE_DETACHED feature.
This fixes three resource accounting related bugs introduced by detached
threads:
- the 'child CPU usage' fields were updated in wait4 until now - this was
slightly buggy for a number of reasons, eg. if the exit_code writout
faults then it's possible to trigger this code multiple times.
- those threads that do not go through wait4 were not properly accounted.
- sched_exit() was incorrectly assuming that current == parent. In the
detached case p->parent is the real parent.
with this patch applied things like 'time' work again for new-style
threaded apps.
This fixes a clone-flags bug noticed by Roland McGrath. The current
CLONE_DETACHED & CLONE_THREAD forcing code did things in the wrong
order, which makes it possible to force an oops the following way:
main () { syscall(120, 0x00400000); }
instead of changing the order of CLONE_SIGHAND and CLONE_THREAD flag
forcing (which would fix the bug), the proper approach is to fail with
-EINVAL if invalid combinations of clone flags are detected. This
change does not affect existing applications.
the attached patch (against BK-curr) fixes a sys_wait4() bug noticed by
Ulrich Drepper. The kernel would not block properly if there are eligible
children delayed due to the new delayed thread-group-leader logic. The
solution is to introduce a new type of 'eligible child' type - and skip
over delayed children but set the wait4 flag nevertheless.
The libpthreads testcase that failed due to it now it works fine.
I fixed up the 'remove thread group inferiors from the tasklist' patch. I
think i managed to find a reasonably good construct to iterate over all
threads:
the only caveat with this is that the construct suggests a single-loop -
while it's two loops internally - and 'break' will not work. I added a
comment to sched.h that warns about this, but perhaps it would help more
to have naming that suggests two loops:
but this looks a bit too long. I dont know. We might as well use it all
unrolled and no helper macros - although with the above construct it's
pretty straightforward to iterate over all threads in the system.
Petr Vandrovec [Sat, 14 Sep 2002 02:33:50 +0000 (19:33 -0700)]
[PATCH] 2.5.34-bk fcntl lockup
This fixes endless loop without schedule which happens as soon as smbd
invokes fcntl64(7, F_SETLK64, ...). fcntl_setlk64 gets cmd F_SETLK64,
not F_SETLK tested in the loop;
Maybe return value from posix_lock_file should be changed to -EINPROGRESS
or -EJUKEBOX instead of testing passed cmd in callers, but this oneliner
works too. If you preffer changing posix_lock_file return value to clearly
distinugish between -EAGAIN and lock request queued, I'll do that.
On 13 Sep 2002, Paul Larson wrote:
>
> The nightly LTP test against the 2.5 kernel bk tree last night turned up
> some test failures we don't normally see. These failures did not show
> up in the run from the previous night.
[...]
> I found what was breaking this, looks like it was this change from your
> shared thread signals patch:
> - if (sig < 1 || sig > _NSIG ||
> - (act && (sig == SIGKILL || sig == SIGSTOP)))
> + if (sig < 1 || sig > _NSIG || (act && sig_kernel_only(sig)))
This fixes this bug and a number of others in the same class - the
signal behavior bitmasks should never be consulted before making sure
that the signal is in the word range.
[LLC] remove all tmr ev structs & fix psnap and p8022 wrt ui sending
. No need for the timer_running member on llc_timer,
we only need it in one place, and timer_pending is
equivalent. One more procom OS generalisation killed.
. Move the skb->protocol assignment in llc_build_and_send_pkt
routines and llc_ui_send_data to the caller, this is the common
practice in Linux networking code (think netif_rx) and required
to keep the request functions in psnap and p8022 simple.
. Remove the rpt_status (report status) ev members, not
used at all, not even in the original procom code.
. Convert psnap and p8022 request functions to use
llc_ui_build_and_send_ui_pkt, removing all the prim cruft.
Andrew Morton [Fri, 13 Sep 2002 12:57:07 +0000 (05:57 -0700)]
[PATCH] Use a sync iocb for generic_file_read
This adds support for synchronous iocbs and converts generic_file_read
to use a sync iocb to call into generic_file_aio_read.
The tests I've run with lmbench on a piii-866 showed no difference in
file re-read speed when forced to use a completion path via aio_complete
and an -EIOCBQUEUED return from generic_file_aio_read -- people with
slower machines might want to test this to see if we can tune it any
better. Also, a bug fix to correct a missing call into the aio code
from the fork code is present. This patch sets things up for making
generic_file_aio_read actually asynchronous.
Andrew Morton [Fri, 13 Sep 2002 12:57:02 +0000 (05:57 -0700)]
[PATCH] readv/writev speedup
This is Janet Morgan's patch which converts the readv/writev code
to submit all segments for IO before waiting on them, rather than
submitting each segment separately.
This is a critical performance fix for O_DIRECT reads and writes.
Prior to this change, O_DIRECT vectored IO was forced to wait for
completion against each segment of the iovec rather than submitting all
segments and waiting on the lot. ie: for ten segments, this code will
be ten times faster.
There will also be moderate improvements for buffered IO - smaller code
paths, plus writev() only takes i_sem once.
The patch ended up quite large unfortunately - turned out that the only
sane way to implement this without duplicating significant amounts of
code (the generic_file_write() bounds checking, all the O_DIRECT
handling, etc) was to redo generic_file_read() and generic_file_write()
to take an iovec/nr_segs pair rather than `buf, count'.
New exported functions generic_file_readv() and generic_file_writev()
have been added:
If a driver does not use these in their file_operations then they will
continue to use the old readv/writev code, which sits in a loop calling
calls fops->read() or fops->write().
ext2, ext3, JFS and the blockdev driver are currently using this
capability.
Some coding cleanups were made in fs/read_write.c. Mainly:
- pass "READ" or "WRITE" around to indicate the diretion of the
operation, rather than the (confusing, inverted)
VERIFY_READ/VERIFY_WRITE.
- Use the identifier `nr_segs' everywhere to indicate the iovec
length rather than `count', which is often used to indicate the
number of bytes in the syscall. It was confusing the heck out of me.
- Some cleanups to the raw driver.
- Some additional generality in fs/direct_io.c: the core `struct dio'
used to be a "populate-and-go" thing. Janet has broken that up so
you can initialise a struct dio once, then loop around feeding it
more file segments, then wait on completion against everything.
- In a couple of places we needed to handle the situation where we
knew, a-priori, that the user was going to get a short read or write.
File size limit exceeded, read past i_size, etc. We handled that by
shortening the iovec in-place with iov_shorten(). Which is not
particularly pretty, but neither were the alternatives.
Intermediate patch for the PF_LLC SOCK_DGRAM prim clean-up, now
PF_LLC is prims in the sending side, now to hack the core to
not use prims to send to PF_LLC.
This also fixes a skb leak on llc_sap_state_process.
David S. Miller [Fri, 13 Sep 2002 07:45:06 +0000 (00:45 -0700)]
[SPARC]: Update ide headers. WARNING: this is known broken, fixes coming from Jens Axboe.
- Jens needs to seperate out the IN/OUT macros to seperate what accesses
are to the IDE_DATA register and the rest. On big-endian platforms
the IDE_DATA register should be accessed in big-endian for it to all
work out correctly or at least be compatible with the behavior existing
before the IDE platform macro interface changes in 2.5.x
This implements the 'keep the initial thread around until every thread
in the group exits' concept in a different, less intrusive way, along
your suggestions. There is no exit_done completion handling anymore,
freeing of the task is still done by wait4(). This has the following
side-effect: detached threads/processes can only be started within a
thread group, not in a standalone way.
(This also fixes the bugs introduced by the ->exit_done code, which made
it possible for a zombie task to be reactivated.)
I've introduced the p->group_leader pointer, which can/will be used for
other purposes in the future as well - since from now on the thread
group leader is always existent. Right now it's used to notify the
parent of the thread group leader from the last non-leader thread that
exits [if the thread group leader is a zombie already].
I distilled the attached fix-patch from Daniel's bigger patch - it
includes all fixes for all currently known ptrace related breakages,
which include things like bad behavior (crash) if the tracer process
dies unexpectedly.
Neil Brown [Thu, 12 Sep 2002 08:42:59 +0000 (01:42 -0700)]
[PATCH] kNFSd 14: Filehandle lookup makes use of new export table structure.
Filehandle lookup currently breaks out the interesting pieces of
a filehandle and passes them to exp_get or exp_get_fsid, which put the
pieces back into a filehandle fragment.
We define a new interface "exp_find" which does a lookup based on
a filehandle fragment to avoid this double handling.
In the process, common code in exp_get_key and exp_get_fsid_key is united
into exp_find_key.
Also, filehandle composition now uses the mk_fsid_v? inline functions.
Neil Brown [Thu, 12 Sep 2002 08:42:43 +0000 (01:42 -0700)]
[PATCH] kNFSd 13: Separate out the multiple keys in the export hash table.
Currently each entry in the export table had two hash chains
going through it, one for hash-by-dev/ino, One for hash-by-fsid.
This is contrary to the goal of a simple hash table structure.
The two hash-tables per client are replace by one which stores 'exp_key's
which contain the key (as a file handle fragment) and a pointer to the
real export entry.
The export entries are then all stored in a single hash table indexed
by client+vfsmount+dentry;
Neil Brown [Thu, 12 Sep 2002 08:42:25 +0000 (01:42 -0700)]
[PATCH] kNFSd 12: Change exp_parent to talk directory tree, not hash table.
Currently get_parent (needed to find the exportpoint
above a given dentry) walks the hash table of export points
checking each with is_subdir. Now it walks up the d_parent
link checking each for membership in the hashtable.
nfsd_lookup currently does that walk too (when crossing
a mountpoint backwards) so the code gets unified.
This approach makes more sense as we move towards a cache
for export information that can be filled on demand.
It also assumes less about the hash table (which will change).
Neil Brown [Thu, 12 Sep 2002 08:42:09 +0000 (01:42 -0700)]
[PATCH] kNFSd 11: Remove problematic "security" checks when NFS exporting.
The nfs server currently doesn't allow you to export both a
directory and an ancestor of that directory on the same filesystem.
This check is more of a problem than a solution and can be
done in user-space if needed, so it is removed.
The potential for a security problem is because the files
below the lower directory could be accessed as though it were under
either of the export points, and so the access control that is
applied might not be what is expected (by the nieve admin).
e.g. export /a as readwrite and /a/b as readonly. Then a/b/c
can be accessed readwrite as it is in /a which might not be the
intend. Altering the user to this can be done in userspace though.
The current restriction also stops exporting / as readonly and
/tmp as read-write which some people want to do. Providing
/tmp is also exported subtree_check (the default) there is no
security issue here.