Re: dlopen() libcr.so, and problem with C++ compilers?

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Feb 13 2009 - 11:56:48 PST

  • Next message: Ted Cabeen: "Problems with --enable-restore-ids"
    Alan,
      Thanks for your interest in BLCR.  Please see my answers below.
    -Paul
    
    Alan Woodland wrote:
    > Hi,
    >
    > I've been working on using BLCR with my application, and I've
    > encountered a few issues:
    >
    > Q1:
    >
    > I've been trying to integrate BLCR into one of my applications such
    > that it will transparently work provided the machine the application
    > is running on has the BLCR library and kernel modules available.
    >
    > I was under the impression from the documentation that a sensible way
    > to make this work on both machines with and without BLCR was to use
    > dlopen()/dlsym() at run time, but the problem is that
    > cr_initialize_restart_args_t and cr_initialize_checkpoint_args_t are
    > both macros, which means they're not symbols in libcr.so - my only
    > options are to use the private interfaces they call (nasty) or link at
    > compile time here.
    >
    > Any suggestions for a better work-around than using the private
    > interfaces? It makes the software engineer in me die a little to do
    > that!
    >   
    
    I am glad to here that you have that inner software engineer inside 
    you.  The point of making internal interfaces internal is that we need 
    to change them from time to time, and they DO change.  I know of at 
    least one high-profile project that is stuck with an older version of 
    BLCR because they are using some internal interfaces that changed.
    
    The OpenMPI project is doing pretty much what you are (one build that 
    works both w/ and w/o BLCR present) using dlopen().  The only difference 
    is that you will need one level of indirection.  You are almost there 
    when you say "or link at compile time". What you need is to build 
    "myblcrsupport.o" or "myblcrsupprt.so" that does link to libcr.so at 
    compile time, but the rest of your project does not.  In that 
    object/library your calls to the initializer macros will be expanded.  
    Then it is the "myblcrsupport.{o,so}" that you dlopen() instead of libcr 
    (or in addition to it, depending how you want to deal with RTLD_NOW vs 
    RTLD_LAZY).
    
    > Q2:
    >
    > This one's quite minor - I should really just use a C compiler instead
    > I guess...
    >
    > The macro  CR_RSTRT_RELOCATE_SIZE(CR_MAX_RSTRT_RELOC) has made life in
    > C++ harder. It evaluates to:
    >
    > (sizeof(struct cr
    > _rstrt_relocate) + (16) * sizeof(struct cr_rstrt_relocate_pair))
    >
    > But I think in C++ this is one of those subtle C/C++ differences. In
    > C++ I think it needs to be
    > sizeof(cr_rstrt_relocate::cr_rstrt_relocate_pair)?
    > (sizeof(struct cr_rstrt_relocate) + (CR_MAX_RSTRT_RELOC *
    > sizeof(cr_rstrt_relocate::cr_rstrt_relocate_pair))) works instead.
    >
    > If cr_rstrt_relocate_pair were defined and declared outside of
    > cr_rstrt_relocate that would work for both C and C++ with the current
    > macro? Or a #ifdef __CPLUSPLUS for two versions of that macro?
    >
    >   
    
    I see the point here and since I almost never write C++ code myself 
    (though I read it just fine), I missed this problem when writing this 
    macro.  I think the preferred solution is your first: define the 
    relocate_pair at file scope, rather than in a nested scope.  I will try 
    to get that change in 0.8.1 (expected early March or when the 2.6.29 
    kernel is released).
    
    As a side note, you are lucky you didn't try a C++ compiler with the 
    BLCR 0.7.x series.  Back then the "newpath" member in struct 
    cr_rstrt_relocate_pair was named "new"!!
    
    > Q3:
    >
    > Is it safe to write things into the file before the checkpoint itself?
    > I want to write information about relocations that will be needed by
    > my application into the same file. It seems to be working, but would
    > it be better being written after? Could it ever seek to the beginning
    > of a file explicitly during loading? Or will it always just start from
    > where the file was when it was given it? Does it ignore extra bits at
    > the end of a file? It seems to work fine with extra info at the
    > beginning of the file provided I make the seek on the file handle to
    > the appropriate point first. Is this 'as designed' and guaranteed to
    > work with future versions?
    >   
    
    By design, BLCR will never seek to the beginning of the context file (no 
    seeks at all, in fact).  This was decided both to allow exactly what you 
    are doing now, as well as to allow a checkpoint to be sent through a 
    non-seekable channel such as a pipe between processes or a socket 
    between nodes.  So, it *is* guaranteed to continue working.   The one 
    thing you might want to be aware of is that if there is an error while 
    checkpointing (or restarting), there is no guarantee about how many 
    bytes have been written (or read).  For instance, a failed checkpoint 
    may have written a useless partial file, while a failed restart may have 
    read only a portion of the file that was written at checkpoint time.  
    So, if you plan to have some way to recover from such a failure, then 
    you may need to take this in to account if you ever did place your own 
    data after BLCR's data.
    
    > Thanks,
    > Alan
    >   
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: Ted Cabeen: "Problems with --enable-restore-ids"