Thanks

drbj153_at_iitg.ernet.in
Date: Mon Nov 03 2008 - 20:27:43 PST

  • Next message: Neal Becker: "problem with permission()"
    Many many thanks. For your kind information, now checkpointing is working.
    If i face further any problem ,then i will inform you.
    In spite of your busyness, you have replied me, thats enough.
    Thank you again.
    
    ---Dhruba
    
    > I am sorry about the slow reply.  I am very busy right now.
    >
    > We don't have LAM/MPI installed anywhere for testing of our own.  However,
    > I
    > have tried the following simple non-MPI program based on your code:
    >
    > #include <stdio.h>
    > int main(int argc, char **argv)
    > {
    >    int i;
    >    scanf("%d",&i); printf("1st read: %d\n", i);
    >    scanf("%d",&i); printf("2nd read: %d\n", i);
    >    return 0;
    > }
    >
    > If I checkpoint while the program is blocked at the first scanf(), and
    > then I
    > restart, I find that the application is not responding to input.  However,
    > if
    > I hit ^Z and then type "fg" to the shell the application behaves normally:
    >
    > $ ./bin/cr_restart context.2307
    > [ENTER]
    > [ENTER]
    > [^Z]
    > [1]+  Stopped                 ./bin/cr_restart context.2307
    > $ fg
    > ./bin/cr_restart context.2307
    > 1
    > 1st read: 1
    > 1
    > 2nd read: 1
    >
    >
    > So, there does appear to be something odd about how the read() has been
    > restarted.  We don't normally deal much with applications with standard
    > input,
    > but this certainly seems like a BLCR bug.
    >
    > My recommendation is to avoid using I/O in this way.  In general, reading
    > stdin in an MPI program is poorly defined anyway (for instance, does only
    > rank
    > 0 get the input, or is it cloned for all ranks?).
    >
    > I am guessing you wanted a way to cause your program to wait for a
    > checkpoint
    > to be taken.  In my own test codes, I often call "pause()" for this
    > reason.
    > Because BLCR's checkpoints are initiated using signals, the pause() will
    > return only after the checkpoint has been taken.
    >
    > Let us (checkpoint_at_lbl_dot_gov) know if you need any more assistance, but be
    > warned that our response is likely to be slow between now and the end of
    > November.
    >
    >
    > -Paul
    >
    >
    > [email protected] wrote:
    >>
    >> Please response to the previous mail. Till now i could not determine
    >> what
    >> to do now. Please do reply me.I will be thankful to you.
    >>
    >> Thanking you.
    >>
    >>> Dear Paul,
    >>>
    >>> I have executed a simple program as per instruction of LAM/MPI
    >>> documentation.Once I have run mpirun only in head node and next time
    >>> for
    >>> all the node, In both cases "lamcheckpoint" is successfull and
    >>> generated
    >>> the context file(i.e. context.mpirun.3270,context.3270-n0-3271 etc. )
    >>> for
    >>> all the process. To this step i think evething is ok.
    >>>
    >>> Again, it is to inform you that after executed the program it will ask
    >>> for
    >>> an input. In this time i checkpointed the program and kill it.
    >>>
    >>> But problem is in restart. when i give the "lamrestart" command, the
    >>> job
    >>> is restart but the behaviour is not according to the program. It does
    >>> not
    >>> respond. In this situation the process can not be killed. the PID
    >>> status
    >>> is Dl+ for the job.
    >>>
    >>> Am i doing right ? Or my testing program has anything wrong. For your
    >>> convenience i have attached my test program.
    >>>
    >>> Please advice me how i proceed.
    >>>
    >>>
    >>> Thanking you.
    >>>
    >>> Dhruba
    >>> IIT Guwahati, India
    >>>
    >>>
    >>>
    >>>> If you have not yet done so, please read the instructions for
    >>>> "lamcheckpoint",
    >>>> "lamrestart" and "checkpoint/restart of MPI jobs" - these are sections
    >>>> 7.2,
    >>>> 7.9 and 9.5 in the LAM/MPI User Guide:
    >>>> http://www.lam-mpi.org/download/files/7.1.4-user.pdf
    >>>>
    >>>> If after following the instructions in the User Guide, you still have
    >>>> questions, you should ask again with some information about how the
    >>>> restart
    >>>> fails.  For instance, if there are any error messages or syslog
    >>>> messages
    >>>> from
    >>>> the compute nodes that might explain the failure.
    >>>>
    >>>> -Paul
    >>>>
    >>>>
    >>>> [email protected] wrote:
    >>>>> Dear sir,
    >>>>>
    >>>>> I m working in a project named "Fault tolerance using checkpoint and
    >>>>> recovery protcol using cluster based Distributed system" in Computer
    >>>>> Science and Engineering Department, Indian Institute of
    >>>>> Technology,Guwahati,India.
    >>>>>
    >>>>> Already i have setup a cluster using one head node and six client
    >>>>> node
    >>>>> using oscar 5.0. and install LAM-MPI beta version integrated with
    >>>>> BLCR.
    >>>>> Now i have got some problem in restarting the checkpointed process.
    >>>>> Can
    >>>>> you tell me proper procedure how to checkpoint a MPI program.
    >>>>>
    >>>>> Thanking you.
    >>>>>
    >>>>> Dhruba
    >>>>> IIT Guwahati,India
    >>>>>
    >>>>>
    >>>>
    >>>> --
    >>>> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>>> Future Technologies Group
    >>>> HPC Research Department                   Tel: +1-510-495-2352
    >>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >>>>
    >>
    >>
    >>
    >>
    >
    >
    > --
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group
    > HPC Research Department                   Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >
    >
    

  • Next message: Neal Becker: "problem with permission()"