12cd07c9a9
The problem was tracked to use of the grpcomm.onesided_barrier to control daemon/mpirun termination. This relied on messaging -and- required that the program counter jump from the errmgr back to grpcomm. On rare occasions, this jump did not occur, causing mpirun to hang. This patch looks more invasive than it is - most of the affected files simply had one or two lines removed. The essence of the change is: * pulled the job_complete and quit routines out of orterun and orted_main and put them in a common place * modified the errmgr to directly call the new routines when termination is detected * removed the grpcomm.onesided_barrier and its associated RML tag * add a new "num_routes" API to the routed framework that reports back the number of dependent routes. When route_lost is called, the daemon's list of "children" is checked and adjusted if that route went to a "leaf" in the routing tree * use connection termination between daemons to track rollup of the daemon tree. Daemons and HNP now terminate once num_routes returns zero Also picked up in this commit is the addition of a new bool flag to the app_context struct, and increasing the job_control field from 8 to 16 bits. Both trivial. This commit was SVN r23429.
49 строки
1.3 KiB
C
49 строки
1.3 KiB
C
/*
|
|
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
|
* University Research and Technology
|
|
* Corporation. All rights reserved.
|
|
* Copyright (c) 2004-2006 The University of Tennessee and The University
|
|
* of Tennessee Research Foundation. All rights
|
|
* reserved.
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
* University of Stuttgart. All rights reserved.
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
* All rights reserved.
|
|
* $COPYRIGHT$
|
|
*
|
|
* Additional copyrights may follow
|
|
*
|
|
* $HEADER$
|
|
*/
|
|
|
|
/**
|
|
* @file
|
|
*
|
|
* Locks to prevent loops inside ORTE
|
|
*/
|
|
#ifndef ORTE_LOCKS_H
|
|
#define ORTE_LOCKS_H
|
|
|
|
#include "orte_config.h"
|
|
|
|
#include "opal/sys/atomic.h"
|
|
|
|
BEGIN_C_DECLS
|
|
|
|
/* for everyone */
|
|
ORTE_DECLSPEC extern opal_atomic_lock_t orte_finalize_lock;
|
|
|
|
/* for HNPs */
|
|
ORTE_DECLSPEC extern opal_atomic_lock_t orte_abort_inprogress_lock;
|
|
ORTE_DECLSPEC extern opal_atomic_lock_t orte_jobs_complete_lock;
|
|
ORTE_DECLSPEC extern opal_atomic_lock_t orte_quit_lock;
|
|
|
|
/**
|
|
* Initialize the locks
|
|
*/
|
|
ORTE_DECLSPEC int orte_locks_init(void);
|
|
|
|
END_C_DECLS
|
|
|
|
#endif /* #ifndef ORTE_LOCKS_H */
|