Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Nov 04 2005 - 09:48:35 PST

  • Next message: Michael Klemm: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"
    I may have a solution.  The attached patch should cause BLCR to store
    the actual contents of any deleted mmaped file, rather than storing just
    the filename.  This should solve the problem if the file is not still
    open within NSCD (and thus potentially changing).  However, if NCSD is
    also attached to the file (via open() or mmap()) and expects to
    communicate with the application through this file, then there is no
    good way for BLCR to save and restore this "communication channel" - the
    best we could hope for in that case would be to "undelete" the file by
    linking it back into the filesystem with its original name.  That is
    likely to create a "leak" of such files and so I'd not consider it a
    general-purpose solution.
    Let me know if this patch works or not so I can include in the next
    release (which I am hoping to put out next week).
    
    -Paul
    
    Michael Klemm wrote:
    > Hi Paul!
    >
    > We did some investigation here, and found the cause of the corrupted
    > checkpoints. (Call me Sherlock by now :-) ).
    >
    > Paul H. Hargrove wrote:
    >>   The second, less likely, option is that BLCR is terribly confused.  If
    >> you could 'ls -l /proc/<pid>/fds' and 'cat /proc/<pid>/maps' for the
    >> running application, look for the /var/run/nscd file in either place and
    >> let me know what you find.  If it is in either place, then BLCR is not
    >> confused.
    >
    > The file name "/var/run/nscd/xxxxxxxx" that was sketched by Christian is
    > a cache file of the NSCD (Name Service Cache Daemon) of Linux.  Today, I
    > disabled the service on the machine and the checkpoints can be restarted
    > now.
    >
    > It looks like, that BLCR gets confused by the mmap of NSCD's cache file.
    >  For now, we're perfectly satisfied as long as our local admin won't
    > complain about the missing NSCD.  However, for his thesis, Christian
    > will be forced to make tests on the cluster of our local computing
    > center.  On these machines, getting the NSCD disabled won't be that easy.
    >
    > Do you have any hints how to get the problem solved?
    >
    > Viele Gr��e
    >     -michael
    >
    > -- 
    > Computer Science Department 2, University of Erlangen-Nuremberg
    > Martensstrasse 3, D-91058 Erlangen, Germany
    > phone: ++49 (0)9131 85-28995, fax: ++49 (0)9131 85-28809
    > web: http://www2.informatik.uni-erlangen.de/~klemm
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    
    
    Index: vmadump/vmadump.c
    ===================================================================
    RCS file: /var/local/cvs/lbnl_cr/vmadump/vmadump.c,v
    retrieving revision 1.59
    diff -u -u -r1.59 vmadump.c
    --- vmadump/vmadump.c	26 Sep 2005 20:19:59 -0000	1.59
    +++ vmadump/vmadump.c	2 Nov 2005 22:38:55 -0000
    @@ -1210,7 +1210,10 @@
     	    filename = default_map_name(map->vm_file, buffer, PAGE_SIZE);
     	head.namelen = strlen(filename);
     
    -	if (map->vm_flags & VM_IO) {
    +	if ((head.namelen > 10) && !strcmp(filename + head.namelen - 10, " (deleted)")) {
    +	    /* Region is a deleted file */
    +	    head.namelen = 0;
    +	} else if (map->vm_flags & VM_IO) {
     	    /* Region is an IO map. */
     	    
     	    /* Never store the contents of a VM_IO region */
    Index: vmadump4/vmadump_common.c
    ===================================================================
    RCS file: /var/local/cvs/lbnl_cr/vmadump4/vmadump_common.c,v
    retrieving revision 1.17
    diff -u -u -r1.17 vmadump_common.c
    --- vmadump4/vmadump_common.c	27 Oct 2005 18:29:32 -0000	1.17
    +++ vmadump4/vmadump_common.c	2 Nov 2005 22:38:55 -0000
    @@ -924,7 +924,10 @@
     	    filename = default_map_name(map->vm_file, buffer, PAGE_SIZE);
     	head.namelen = strlen(filename);
     
    -	if (map->vm_flags & VM_IO) {
    +	if ((head.namelen > 10) && !strcmp(filename + head.namelen - 10, " (deleted)")) {
    +	    /* Region is a deleted file */
    +	    head.namelen = 0;
    +	} else if (map->vm_flags & VM_IO) {
     	    /* Region is an IO map. */
     
     	    /* Never store the contents of a VM_IO region */
    

  • Next message: Michael Klemm: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"