Problem w/ exec()-family after restart.

From: Paul H. Hargrove (PHHargrove_at_lbl.gov)
Date: Tue Oct 22 2002 - 19:55:10 PDT


Tonight after our conferenfce call I was able to trace the problem while 
on the subway.  It seems this is another manifestation of the parentage 
problem.

When you invoke execle() or one of its relatives, libpthread must 
terminate all the other threads.  To do this the main thread write()s to 
the pthread manager which then sends the pthread cancellation signal to 
all the other threads and then waits for them before exiting.  Meanwhile 
the main thread waits for the manager thread to exit.  The problem 
arrises because we are not yet rebuilding the proper parent-child 
relationships among the threads.  So, one or more of the waits is failing.

I was able to find that I could sometimes get lucky at the exec() would 
work while sometimes it would not.

This is not a problem we can fix from user-space.  It must be fixed in 
the kernel.  This is among the things that Eric is working to fix before 
SC2002.

-Paul

-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-495-2998