2004-07-08 12:43:30 +00:00
|
|
|
/* -*- C -*-
|
2004-11-22 01:38:40 +00:00
|
|
|
*
|
2007-03-16 23:11:45 +00:00
|
|
|
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
2005-11-05 19:57:48 +00:00
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
2006-08-23 03:32:36 +00:00
|
|
|
* Copyright (c) 2004-2006 The University of Tennessee and The University
|
2005-11-05 19:57:48 +00:00
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
2004-11-28 20:09:25 +00:00
|
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
|
|
* University of Stuttgart. All rights reserved.
|
2005-03-24 12:43:37 +00:00
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
2004-11-22 01:38:40 +00:00
|
|
|
* $COPYRIGHT$
|
|
|
|
*
|
|
|
|
* Additional copyrights may follow
|
2004-07-08 12:43:30 +00:00
|
|
|
*
|
|
|
|
* $HEADER$
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
#ifndef NS_REPLICA_H
|
|
|
|
#define NS_REPLICA_H
|
|
|
|
|
2005-03-14 20:57:21 +00:00
|
|
|
#include "orte_config.h"
|
2006-02-12 01:33:29 +00:00
|
|
|
#include "orte/orte_types.h"
|
|
|
|
#include "orte/orte_constants.h"
|
2005-07-03 22:45:48 +00:00
|
|
|
#include "opal/threads/mutex.h"
|
2005-08-07 13:21:52 +00:00
|
|
|
#include "opal/class/opal_object.h"
|
|
|
|
#include "orte/class/orte_pointer_array.h"
|
2006-02-07 03:32:36 +00:00
|
|
|
#include "orte/dss/dss.h"
|
2005-08-07 13:21:52 +00:00
|
|
|
#include "orte/mca/oob/oob_types.h"
|
|
|
|
#include "orte/mca/ns/base/base.h"
|
2005-03-14 20:57:21 +00:00
|
|
|
|
2004-10-20 22:31:03 +00:00
|
|
|
#if defined(c_plusplus) || defined(__cplusplus)
|
|
|
|
extern "C" {
|
|
|
|
#endif
|
2005-05-16 21:01:09 +00:00
|
|
|
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
/*
|
|
|
|
* globals
|
2005-05-16 21:01:09 +00:00
|
|
|
*/
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
#define NS_REPLICA_MAX_STRING_SIZE 256
|
|
|
|
|
2004-07-08 14:52:14 +00:00
|
|
|
/*
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
* object for tracking vpids and jobids for job families
|
|
|
|
* This structure is used to track the parent-child relationship between
|
|
|
|
* jobs. The "root" of the family is the initial parent - each child has
|
|
|
|
* a record under that parent. Any child that subsequently spawns its own
|
|
|
|
* children will form a list of jobids beneath them.
|
|
|
|
*
|
|
|
|
* each object records the jobid of the job it represents, and the next vpid
|
|
|
|
* that will be assigned when a range is requested.
|
2004-07-08 14:52:14 +00:00
|
|
|
*/
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
typedef struct {
|
|
|
|
opal_list_item_t super;
|
|
|
|
orte_jobid_t jobid;
|
2005-08-07 13:21:52 +00:00
|
|
|
orte_vpid_t next_vpid;
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
opal_list_t children;
|
|
|
|
} orte_ns_replica_jobitem_t;
|
|
|
|
OBJ_CLASS_DECLARATION(orte_ns_replica_jobitem_t);
|
2005-03-14 20:57:21 +00:00
|
|
|
|
|
|
|
|
|
|
|
struct orte_ns_replica_tagitem_t {
|
2005-08-07 13:21:52 +00:00
|
|
|
opal_object_t super;
|
2005-03-14 20:57:21 +00:00
|
|
|
orte_rml_tag_t tag; /**< OOB tag */
|
|
|
|
char *name; /**< Name associated with tag */
|
2004-07-08 14:52:14 +00:00
|
|
|
};
|
2005-03-14 20:57:21 +00:00
|
|
|
typedef struct orte_ns_replica_tagitem_t orte_ns_replica_tagitem_t;
|
2004-07-08 14:52:14 +00:00
|
|
|
|
2005-03-14 20:57:21 +00:00
|
|
|
OBJ_CLASS_DECLARATION(orte_ns_replica_tagitem_t);
|
2004-07-08 14:52:14 +00:00
|
|
|
|
2005-05-01 00:54:12 +00:00
|
|
|
struct orte_ns_replica_dti_t {
|
2005-08-07 13:21:52 +00:00
|
|
|
opal_object_t super;
|
2005-05-01 00:54:12 +00:00
|
|
|
orte_data_type_t id; /**< data type id */
|
|
|
|
char *name; /**< Name associated with data type */
|
|
|
|
};
|
|
|
|
typedef struct orte_ns_replica_dti_t orte_ns_replica_dti_t;
|
|
|
|
|
|
|
|
OBJ_CLASS_DECLARATION(orte_ns_replica_dti_t);
|
|
|
|
|
2004-07-08 14:52:14 +00:00
|
|
|
/*
|
|
|
|
* globals needed within component
|
|
|
|
*/
|
2005-08-07 13:21:52 +00:00
|
|
|
typedef struct {
|
|
|
|
size_t max_size, block_size;
|
2007-07-20 02:34:29 +00:00
|
|
|
orte_nodeid_t next_nodeid;
|
|
|
|
orte_pointer_array_t *nodenames;
|
2005-08-07 13:21:52 +00:00
|
|
|
orte_jobid_t num_jobids;
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
opal_list_t jobs;
|
2005-08-07 13:21:52 +00:00
|
|
|
orte_pointer_array_t *tags;
|
|
|
|
orte_rml_tag_t num_tags;
|
|
|
|
orte_pointer_array_t *dts;
|
|
|
|
orte_data_type_t num_dts;
|
|
|
|
int debug;
|
|
|
|
bool isolate;
|
|
|
|
opal_mutex_t mutex;
|
|
|
|
} orte_ns_replica_globals_t;
|
|
|
|
|
|
|
|
extern orte_ns_replica_globals_t orte_ns_replica;
|
2004-07-08 14:52:14 +00:00
|
|
|
|
2004-07-08 12:43:30 +00:00
|
|
|
/*
|
|
|
|
* Module open / close
|
|
|
|
*/
|
2005-03-14 20:57:21 +00:00
|
|
|
int orte_ns_replica_open(void);
|
|
|
|
int orte_ns_replica_close(void);
|
2004-07-08 12:43:30 +00:00
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Startup / Shutdown
|
|
|
|
*/
|
2005-03-14 20:57:21 +00:00
|
|
|
mca_ns_base_module_t* orte_ns_replica_init(int *priority);
|
|
|
|
int orte_ns_replica_module_init(void);
|
|
|
|
int orte_ns_replica_finalize(void);
|
2004-07-08 12:43:30 +00:00
|
|
|
|
2004-07-12 20:35:19 +00:00
|
|
|
/*
|
|
|
|
* oob interface
|
|
|
|
*/
|
2004-08-13 16:42:29 +00:00
|
|
|
|
2005-03-14 20:57:21 +00:00
|
|
|
void orte_ns_replica_recv(int status, orte_process_name_t* sender,
|
|
|
|
orte_buffer_t* buffer, orte_rml_tag_t tag, void* cbdata);
|
2004-07-12 20:35:19 +00:00
|
|
|
|
2004-07-08 12:43:30 +00:00
|
|
|
/*
|
2007-07-20 02:34:29 +00:00
|
|
|
* NODE FUNCTIONS
|
2004-07-08 12:43:30 +00:00
|
|
|
*/
|
2007-07-20 02:34:29 +00:00
|
|
|
int orte_ns_replica_create_nodeids(orte_nodeid_t **nodeids, orte_std_cntr_t *nnodes, char **nodenames);
|
2005-05-16 21:01:09 +00:00
|
|
|
|
2007-07-20 02:34:29 +00:00
|
|
|
int orte_ns_replica_get_node_info(char ***nodenames, orte_std_cntr_t num_nodes, orte_nodeid_t *nodeids);
|
2004-07-08 12:43:30 +00:00
|
|
|
|
|
|
|
/*
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
* JOB FUNCTIONS
|
2004-07-08 12:43:30 +00:00
|
|
|
*/
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
int orte_ns_replica_create_jobid(orte_jobid_t *jobid, opal_list_t *attrs);
|
|
|
|
|
|
|
|
int orte_ns_replica_get_job_descendants(orte_jobid_t **descendants, orte_std_cntr_t *num_desc, orte_jobid_t job);
|
|
|
|
|
|
|
|
int orte_ns_replica_get_job_children(orte_jobid_t **descendants, orte_std_cntr_t *num_desc, orte_jobid_t job);
|
|
|
|
|
|
|
|
int orte_ns_replica_get_root_job(orte_jobid_t *root_job, orte_jobid_t job);
|
|
|
|
|
|
|
|
int orte_ns_replica_get_parent_job(orte_jobid_t *parent, orte_jobid_t job);
|
|
|
|
|
2007-04-23 12:48:19 +00:00
|
|
|
int orte_ns_replica_get_job_family(orte_jobid_t **family, orte_std_cntr_t *num_members, orte_jobid_t job);
|
|
|
|
|
2005-03-14 20:57:21 +00:00
|
|
|
int orte_ns_replica_reserve_range(orte_jobid_t job,
|
|
|
|
orte_vpid_t range,
|
|
|
|
orte_vpid_t *startvpid);
|
2004-07-08 12:43:30 +00:00
|
|
|
|
2007-04-23 12:48:19 +00:00
|
|
|
int orte_ns_replica_get_vpid_range(orte_jobid_t job, orte_vpid_t *range);
|
|
|
|
|
2005-08-07 13:21:52 +00:00
|
|
|
/*
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
* GENERAL FUNCTIONS
|
2005-08-07 13:21:52 +00:00
|
|
|
*/
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
int orte_ns_replica_get_peers(orte_process_name_t **procs,
|
|
|
|
orte_std_cntr_t *num_procs, opal_list_t *attrs);
|
|
|
|
|
|
|
|
int orte_ns_replica_assign_rml_tag(orte_rml_tag_t *tag,
|
|
|
|
char *name);
|
|
|
|
|
|
|
|
|
|
|
|
int orte_ns_replica_define_data_type(const char *name,
|
|
|
|
orte_data_type_t *type);
|
|
|
|
|
|
|
|
int orte_ns_replica_create_my_name(void);
|
2005-08-07 13:21:52 +00:00
|
|
|
|
|
|
|
|
|
|
|
/*
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
* DIAGNOSTIC FUNCTIONS
|
2005-08-07 13:21:52 +00:00
|
|
|
*/
|
2006-04-04 11:05:52 +00:00
|
|
|
int orte_ns_replica_dump_jobs(void);
|
2005-08-07 13:21:52 +00:00
|
|
|
int orte_ns_replica_dump_jobs_fn(orte_buffer_t *buffer);
|
|
|
|
|
2006-04-04 11:05:52 +00:00
|
|
|
int orte_ns_replica_dump_tags(void);
|
2005-08-07 13:21:52 +00:00
|
|
|
int orte_ns_replica_dump_tags_fn(orte_buffer_t *buffer);
|
|
|
|
|
2006-04-04 11:05:52 +00:00
|
|
|
int orte_ns_replica_dump_datatypes(void);
|
2005-08-07 13:21:52 +00:00
|
|
|
int orte_ns_replica_dump_datatypes_fn(orte_buffer_t *buffer);
|
|
|
|
|
2007-03-16 23:11:45 +00:00
|
|
|
int orte_ns_replica_ft_event(int state);
|
2005-08-07 13:21:52 +00:00
|
|
|
|
2005-01-07 16:03:55 +00:00
|
|
|
/*
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
* INTERNAL SUPPORT FUNCTIONS
|
2005-01-07 16:03:55 +00:00
|
|
|
*/
|
|
|
|
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
/* find a job's record, wherever it may be located on the list of job families.
|
|
|
|
* this function searches the entire list of job families, traversing the list
|
|
|
|
* of all jobs in each family, until it finds the specified job. It then returns
|
|
|
|
* a pointer to the that job's info structure. It returns
|
|
|
|
* NULL (without error_logging an error) if no record is found
|
|
|
|
*/
|
|
|
|
orte_ns_replica_jobitem_t* orte_ns_replica_find_job(orte_jobid_t job);
|
2005-02-10 19:08:35 +00:00
|
|
|
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
/* find the root job for the specified job.
|
|
|
|
* this function searches the entire list of job families, traversing the list
|
|
|
|
* of all jobs in each family, until it finds the specified job. It then returns
|
|
|
|
* a pointer to the root job's info structure for that job family. It returns
|
|
|
|
* NULL (without error_logging an error) if no record is found
|
|
|
|
*/
|
|
|
|
orte_ns_replica_jobitem_t* orte_ns_replica_find_root_job(orte_jobid_t job);
|
|
|
|
|
|
|
|
/* find a job's record on a specified root's family tree.
|
|
|
|
* this function finds the family record for the specified root job. It then
|
|
|
|
* traverses the children of that root until it finds the specified job, and then
|
|
|
|
* returns a pointer to that job's info structure. If root=jobid, then it will
|
|
|
|
* return a pointer to the root job's info structure. It returns
|
|
|
|
* NULL (without error_logging an error) if no record is found
|
|
|
|
*/
|
|
|
|
orte_ns_replica_jobitem_t* orte_ns_replica_search_job_family_tree(orte_jobid_t root, orte_jobid_t jobid);
|
2005-05-01 00:54:12 +00:00
|
|
|
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
/* given a job's record, create a flattened list of descendants below it */
|
|
|
|
void orte_ns_replica_construct_flattened_tree(opal_list_t *tree, orte_ns_replica_jobitem_t *ptr);
|
2005-05-24 13:39:15 +00:00
|
|
|
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
/* search down a tree, following all the children's branches, to find the specified
|
|
|
|
* job. Return a pointer to that object, and a pointer to the parent object
|
|
|
|
* This function is called recursively, so it passes into it the ptr to the
|
|
|
|
* current object being looked at
|
2006-08-23 03:32:36 +00:00
|
|
|
*/
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
orte_ns_replica_jobitem_t *down_search(orte_ns_replica_jobitem_t *ptr,
|
|
|
|
orte_ns_replica_jobitem_t **parent_ptr,
|
|
|
|
orte_jobid_t job);
|
|
|
|
|
2006-08-23 03:32:36 +00:00
|
|
|
ORTE_MODULE_DECLSPEC extern mca_ns_base_component_t mca_ns_replica_component;
|
|
|
|
|
2004-10-20 22:31:03 +00:00
|
|
|
#if defined(c_plusplus) || defined(__cplusplus)
|
|
|
|
}
|
|
|
|
#endif
|
2004-07-08 12:43:30 +00:00
|
|
|
#endif
|