Re: Problem using BLCR with mvapich2

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Sep 20 2007 - 10:38:35 PDT

  • Next message: caldwelln: "Rising tensions as Rice,"
    Patrice,
      In general, questions involving checkpointing with MVAPICH2 should be 
    sent to the MVAPICH2 folks (you may have also done that, I am not 
    certain).  However, in this case I am pretty sure I can guess the most 
    likely problems.
    
    My guess is that LD_LIBRARY_PATH and/or LD_PRELOAD have been set on the 
    "front end" but not on the machines where mpd will spawn the MPI 
    application processes, or they have been set in a manner (such as .login 
    files) that mpd is not reading on the "remote" nodes.  I understand that 
    in your case this is a single machine, but it is still possible that the 
    environment variables are not set in the context of the mpd daemons that 
    actually spawn MPI application processes.
    
    You may try a command like
      mpdrun -n 1 env | grep LD_
    to see what values (if any) the LD_LIBRARY_PATH and LD_PRELOAD variables 
    have in processes spawned by mpd.
    
    If my guess above isn't correct or doesn't provide enough information 
    for you to resolve the problem, then you will need to ask the MVAPICH2 
    (or MPICH2) folks about how mpd is handling the environment for the 
    spawned processes.
    
    Sorry I cannot be of more help, but since ldd finds the libraries I can 
    only assume the problem is related to how mpd is starting the processes.
    
    -Paul
    
    Patrice Martinez wrote:
    > Hello,
    > I'm trying to run linpack benchmark using blcr and mvapich2 (and 
    > Infiniband).
    >
    > I'm using:
    > blcr-0.6.0,
    > mvapich2-1.0 compiled with blcr support
    > OFED-1.2.5.1,
    > linpack linked with pvapich, and ofed libs
    >
    > I'm using (for this test) a single em64t computer, running a 2.6.21 
    > kernel above a RHEL U4:
    > uname -a
    > Linux twing 2.6.21.5 #1 SMP Wed Jun 13 10:29:09 CEST 2007 x86_64 
    > x86_64 x86_64 GNU/Linux
    >
    > BLCR is compiled with this kernel, the modules are inserted, and the 
    > following env vars are set as follows:
    >
    > echo $LD_LIBRARY_PATH
    > /opt/intel/fce/10.0.023/lib:/opt/intel/cc/9.1.039/lib::/usr/local/lib
    >
    > export LD_PRELOAD=/usr/local/lib/libcr.so.0:/lib64/tls/libpthread.so.0
    >
    >
    > I started mpdboot:
    > mpdboot --ncpus=4
    >
    > Then I try to run linpack:
    > mpiexec -n 4 ./xhpl
    > /usr/local/bin/mpdroot: error while loading shared libraries: 
    > libcr.so.0: cannot open shared object file: No such file or directory
    > mpiexec_twing (__init__ 1171): forked process failed; status=127
    > CTRL+C Caught... exiting
    >
    > It doesn't work, however, the libcr is located:
    >
    > ldd /usr/local/bin/mpdroot
    >         /usr/local/lib/libcr.so (0x0000002a95557000)
    >         /lib64/tls/libpthread.so.0 (0x000000323fa00000)
    >         libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000002a95690000)
    >         libibumad.so.1 => /usr/lib64/libibumad.so.1 (0x0000002a9579b000)
    >         libc.so.6 => /lib64/tls/libc.so.6 (0x000000323ef00000)
    >         libdl.so.2 => /lib64/libdl.so.2 (0x000000323ed00000)
    >         /lib64/ld-linux-x86-64.so.2 (0x000000323eb00000)
    >         libibcommon.so.1 => /usr/lib64/libibcommon.so.1 
    > (0x0000002a958a6000)
    >
    >
    > ldd ./xhpl
    >         libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x000000323fa00000)
    >         libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00002b4a28296000)
    >         libibumad.so.1 => /usr/lib64/libibumad.so.1 (0x00002b4a283a1000)
    >         libcr.so.0 => /usr/local/lib/libcr.so.0 (0x00002b4a284ab000)
    >         libc.so.6 => /lib64/tls/libc.so.6 (0x000000323ef00000)
    >         /lib64/ld-linux-x86-64.so.2 (0x000000323eb00000)
    >         libdl.so.2 => /lib64/libdl.so.2 (0x000000323ed00000)
    >         libibcommon.so.1 => /usr/lib64/libibcommon.so.1 
    > (0x00002b4a285b4000)
    >
    >
    > Any idea about whta's wrong with this?
    >
    > Linpacks runs well if linked with a release of mvapich2 compiled 
    > without blcr support.
    > -- 
    >
    > Cordialement/Best regards
    >
    > Patrice Martinez
    >
    > Linux Kernel Architect.
    > Bull, Architect of an Open World
    >
    > OFFICE : B1-405
    > PHONE  : +33 (0)4 76 29 74 69
    > EMAIL  : Patrice.martinez_at_bull_dot_net
    > ADDR   : BULL, 1 rue de Provence, BP 208, 38432 Echirolles Cedex, FRANCE
    >
    > Bull recrute : http://www.bull.fr/emploi
    >
    >   
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: caldwelln: "Rising tensions as Rice,"