1
1

These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.

The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.

This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:

As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.

In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.

The incoming changes revamp these procedures in three ways:

1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.

The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.

Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.


2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.

The size of this data has been reduced in three ways:

(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.

To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.

(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.

(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.

While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.


3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.

It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.

Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.


There are a few minor additional changes in the commit that I'll just note in passing:

* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.

* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.

* cleanup of some stale header files

This commit was SVN r16364.
Этот коммит содержится в:
Ralph Castain 2007-10-05 19:48:23 +00:00
родитель ada43fef9e
Коммит 54b2cf747e
138 изменённых файлов: 4212 добавлений и 2933 удалений

Просмотреть файл

@ -517,6 +517,25 @@ AC_DEFINE_UNQUOTED([OPAL_ENABLE_TRACE], [$opal_want_trace],
[Enable run-time tracing of internal functions])
#
# Jumbo application support
#
AC_MSG_CHECKING([if want jumbo app support])
AC_ARG_ENABLE([jumbo-apps],
[AC_HELP_STRING([--enable-jumbo-apps],
[Enable support for applications in excess of 32K processes and/or 32K jobs, or running on clusters in excess of 32k nodes (default: disabled)])])
if test "$enable_jumbo_apps" = "yes"; then
AC_MSG_RESULT([yes])
orte_want_jumbo_apps=1
else
AC_MSG_RESULT([no])
orte_want_jumbo_apps=0
fi
AC_DEFINE_UNQUOTED([ORTE_ENABLE_JUMBO_APPS], [$orte_want_jumbo_apps],
[Enable support for applications in excess of 32K processes and/or 32K jobs, or running on clusters in excess of 32k nodes])
#
# Cross-compile data
#

Просмотреть файл

@ -1141,8 +1141,10 @@ ompi_proc_t **ompi_comm_get_rprocs ( ompi_communicator_t *local_comm,
goto err_exit;
}
/* decode the names into a proc-list */
rc = ompi_proc_unpack(rbuf, rsize, &rprocs );
/* decode the names into a proc-list -- will never add a new proc
as the result of this operation, so no need to get the newprocs
list or call PML add_procs(). */
rc = ompi_proc_unpack(rbuf, rsize, &rprocs, NULL, NULL);
OBJ_RELEASE(rbuf);
err_exit:

Просмотреть файл

@ -51,6 +51,7 @@
#include "ompi/info/info.h"
#include "ompi/constants.h"
#include "ompi/mca/pml/pml.h"
#include "ompi/runtime/ompi_module_exchange.h"
#include "orte/util/proc_info.h"
#include "orte/dss/dss.h"
@ -63,6 +64,7 @@
#include "orte/mca/rmgr/base/base.h"
#include "orte/mca/smr/smr_types.h"
#include "orte/mca/rml/rml.h"
#include "orte/mca/grpcomm/grpcomm.h"
#include "orte/runtime/runtime.h"
@ -86,8 +88,8 @@ int ompi_comm_connect_accept ( ompi_communicator_t *comm, int root,
ompi_group_t *group=comm->c_local_group;
orte_process_name_t *rport=NULL, tmp_port_name;
orte_buffer_t *nbuf=NULL, *nrbuf=NULL;
ompi_proc_t **proc_list=NULL;
int i,j;
ompi_proc_t **proc_list=NULL, **new_proc_list;
int i,j, new_proc_len;
ompi_group_t *new_group_pointer;
size = ompi_comm_size ( comm );
@ -219,11 +221,77 @@ int ompi_comm_connect_accept ( ompi_communicator_t *comm, int root,
goto exit;
}
rc = ompi_proc_unpack(nrbuf, rsize, &rprocs);
rc = ompi_proc_unpack(nrbuf, rsize, &rprocs, &new_proc_len, &new_proc_list);
if ( OMPI_SUCCESS != rc ) {
goto exit;
}
/* If we added new procs, we need to do the modex and then call
PML add_procs */
if (new_proc_len > 0) {
opal_list_t all_procs;
orte_namelist_t *name;
orte_buffer_t mdx_buf, rbuf;
OBJ_CONSTRUCT(&all_procs, opal_list_t);
if (send_first) {
for (i = 0 ; i < group->grp_proc_count ; ++i) {
name = OBJ_NEW(orte_namelist_t);
name->name = &(ompi_group_peer_lookup(group, i)->proc_name);
opal_list_append(&all_procs, &name->item);
}
for (i = 0 ; i < rsize ; ++i) {
name = OBJ_NEW(orte_namelist_t);
name->name = &(rprocs[i]->proc_name);
opal_list_append(&all_procs, &name->item);
}
} else {
for (i = 0 ; i < rsize ; ++i) {
name = OBJ_NEW(orte_namelist_t);
name->name = &(rprocs[i]->proc_name);
opal_list_append(&all_procs, &name->item);
}
for (i = 0 ; i < group->grp_proc_count ; ++i) {
name = OBJ_NEW(orte_namelist_t);
name->name = &(ompi_group_peer_lookup(group, i)->proc_name);
opal_list_append(&all_procs, &name->item);
}
}
OBJ_CONSTRUCT(&mdx_buf, orte_buffer_t);
if (OMPI_SUCCESS != (rc = ompi_modex_get_my_buffer(&mdx_buf))) {
ORTE_ERROR_LOG(rc);
goto exit;
}
OBJ_CONSTRUCT(&rbuf, orte_buffer_t);
if (OMPI_SUCCESS != (rc = orte_grpcomm.allgather_list(&all_procs,
&mdx_buf,
&rbuf))) {
ORTE_ERROR_LOG(rc);
goto exit;
}
OBJ_DESTRUCT(&mdx_buf);
if (OMPI_SUCCESS != (rc = ompi_modex_process_data(&rbuf))) {
ORTE_ERROR_LOG(rc);
goto exit;
}
OBJ_DESTRUCT(&rbuf);
/*
while (NULL != (item = opal_list_remove_first(&all_procs))) {
OBJ_RELEASE(item);
}
OBJ_DESTRUCT(&all_procs);
*/
MCA_PML_CALL(add_procs(new_proc_list, new_proc_len));
}
OBJ_RELEASE(nrbuf);
if ( rank == root ) {
OBJ_RELEASE(nbuf);
@ -407,7 +475,6 @@ ompi_comm_start_processes(int count, char **array_of_commands,
orte_std_cntr_t num_apps, ai;
orte_jobid_t new_jobid=ORTE_JOBID_INVALID;
orte_app_context_t **apps=NULL;
orte_proc_state_t state;
opal_list_t attributes;
opal_list_item_t *item;
@ -651,6 +718,7 @@ ompi_comm_start_processes(int count, char **array_of_commands,
return MPI_ERR_SPAWN;
}
#if 0
/* tell the RTE that we want to be cross-connected to the children so we receive
* their ORTE-level information - e.g., OOB contact info - when they
* reach the STG1 stage gate
@ -664,6 +732,7 @@ ompi_comm_start_processes(int count, char **array_of_commands,
opal_progress_event_users_decrement();
return MPI_ERR_SPAWN;
}
#endif
/* check for timing request - get stop time and report elapsed time if so */
if (timing) {

Просмотреть файл

@ -459,6 +459,7 @@ int mca_pml_ob1_ft_event( int state )
ompi_proc_t** procs = NULL;
size_t num_procs;
int ret, p;
orte_buffer_t mdx_buf, rbuf;
if(OPAL_CRS_CHECKPOINT == state) {
;
@ -469,6 +470,10 @@ int mca_pml_ob1_ft_event( int state )
else if(OPAL_CRS_RESTART == state) {
/*
* Get a list of processes
* NOTE: Do *not* call ompi_proc_finalize as there are many places in
* the code that point to indv. procs in this strucutre. For our
* needs here we only need to fix up the modex, bml and pml
* references.
*/
procs = ompi_proc_all(&num_procs);
if(NULL == procs) {
@ -487,6 +492,9 @@ int mca_pml_ob1_ft_event( int state )
return ret;
}
/*
* Make sure the modex is NULL so it can be re-initalized
*/
for(p = 0; p < (int)num_procs; ++p) {
if( NULL != procs[p]->proc_modex ) {
OBJ_RELEASE(procs[p]->proc_modex);
@ -494,6 +502,9 @@ int mca_pml_ob1_ft_event( int state )
}
}
/*
* Init the modex structures
*/
if (OMPI_SUCCESS != (ret = ompi_modex_init())) {
opal_output(0,
"pml:ob1: ft_event(Restart): modex_init Failed %d",
@ -501,6 +512,16 @@ int mca_pml_ob1_ft_event( int state )
return ret;
}
/*
* Load back up the hostname/arch information into the modex
*/
if (OMPI_SUCCESS != (ret = ompi_proc_publish_info())) {
opal_output(0,
"pml:ob1: ft_event(Restart): proc_init Failed %d",
ret);
return ret;
}
}
else if(OPAL_CRS_TERM == state ) {
;
@ -527,52 +548,61 @@ int mca_pml_ob1_ft_event( int state )
}
else if(OPAL_CRS_RESTART == state) {
/*
* Re-subscribe to the modex information
* Exchange the modex information once again
*/
if (OMPI_SUCCESS != (ret = ompi_modex_subscribe_job(ORTE_PROC_MY_NAME->jobid))) {
OBJ_CONSTRUCT(&mdx_buf, orte_buffer_t);
if (OMPI_SUCCESS != (ret = ompi_modex_get_my_buffer(&mdx_buf))) {
opal_output(0,
"pml:ob1: ft_event(Restart): Failed to subscribe to the modex information %d",
"pml:ob1: ft_event(Restart): Failed ompi_modex_get_my_buffer() = %d",
ret);
return ret;
}
opal_output_verbose(10, ompi_cr_output,
"pml:ob1: ft_event(Restart): Enter Stage Gate 1");
if (ORTE_SUCCESS != (ret = orte_smr.set_proc_state(orte_process_info.my_name,
ORTE_PROC_STATE_AT_STG1, 0))) {
opal_output(0,
"pml:ob1: ft_event(Restart): Stage Gate 1 Failed %d",
ret);
return ret;
}
if (ORTE_SUCCESS != (ret = orte_grpcomm.xcast_gate(orte_gpr.deliver_notify_msg))) {
opal_output(0,
"pml:ob1: ft_event(Restart): Stage Gate 1 Failed %d",
ret);
return ret;
}
if( OMPI_SUCCESS != (ret = mca_pml_ob1_add_procs(procs, num_procs) ) ) {
opal_output(0, "pml:ob1: readd_procs: Failed in add_procs (%d)", ret);
return ret;
}
/*
* Set the STAGE 2 State
* Do the allgather exchange of information
*/
opal_output_verbose(10, ompi_cr_output,
"pml:ob1: ft_event(Restart): Enter Stage Gate 2");
if (ORTE_SUCCESS != (ret = orte_smr.set_proc_state(orte_process_info.my_name,
ORTE_PROC_STATE_AT_STG2, 0))) {
opal_output(0,"pml:ob1: ft_event(Restart): Stage Gate 1 Failed %d",
OBJ_CONSTRUCT(&rbuf, orte_buffer_t);
if (OMPI_SUCCESS != (ret = orte_grpcomm.allgather(&mdx_buf, &rbuf))) {
opal_output(0,
"pml:ob1: ft_event(Restart): Failed orte_grpcomm.allgather() = %d",
ret);
return ret;
}
OBJ_DESTRUCT(&mdx_buf);
/*
* Process the modex data into the proc structures
*/
if (OMPI_SUCCESS != (ret = ompi_modex_process_data(&rbuf))) {
opal_output(0,
"pml:ob1: ft_event(Restart): Failed ompi_modex_process_data() = %d",
ret);
return ret;
}
OBJ_DESTRUCT(&rbuf);
/*
* Fill in remote proc information
*/
if (OMPI_SUCCESS != (ret = ompi_proc_get_info())) {
opal_output(0,
"pml:ob1: ft_event(Restart): Failed ompi_proc_get_info() = %d",
ret);
return ret;
}
if (ORTE_SUCCESS != (ret = orte_grpcomm.xcast_gate(orte_gpr.deliver_notify_msg))) {
opal_output(0,"pml:ob1: ft_event(Restart): Stage Gate 1 Failed %d",
ret);
/*
* Startup the PML stack now that the modex is running again
* Add the new procs
*/
if( OMPI_SUCCESS != (ret = mca_pml_ob1_add_procs(procs, num_procs) ) ) {
opal_output(0, "pml:ob1: fr_event(Restart): Failed in add_procs (%d)", ret);
return ret;
}
/* Is this barrier necessary ? JJH */
if (OMPI_SUCCESS != (ret = orte_grpcomm.barrier())) {
opal_output(0, "pml:ob1: fr_event(Restart): Failed in orte_grpcomm.barrier (%d)", ret);
return ret;
}

Просмотреть файл

@ -104,11 +104,9 @@ void ompi_proc_destruct(ompi_proc_t* proc)
int ompi_proc_init(void)
{
orte_process_name_t *peers;
orte_std_cntr_t i, npeers, datalen;
void *data;
orte_buffer_t* buf;
uint32_t ui32;
orte_std_cntr_t i, npeers;
int rc;
uint32_t ui32;
OBJ_CONSTRUCT(&ompi_proc_list, opal_list_t);
OBJ_CONSTRUCT(&ompi_proc_lock, opal_mutex_t);
@ -132,13 +130,43 @@ int ompi_proc_init(void)
rc = ompi_arch_compute_local_id(&ui32);
if (OMPI_SUCCESS != rc) return rc;
ompi_proc_local_proc->proc_nodeid = orte_system_info.nodeid;
ompi_proc_local_proc->proc_arch = ui32;
ompi_proc_local_proc->proc_hostname = strdup(orte_system_info.nodename);
if (ompi_mpi_keep_peer_hostnames) {
if (ompi_mpi_keep_fqdn_hostnames) {
/* use the entire FQDN name */
ompi_proc_local_proc->proc_hostname = strdup(orte_system_info.nodename);
} else {
/* use the unqualified name */
char *tmp, *ptr;
tmp = strdup(orte_system_info.nodename);
if (NULL != (ptr = strchr(tmp, '.'))) {
*ptr = '\0';
}
ompi_proc_local_proc->proc_hostname = strdup(tmp);
free(tmp);
}
}
rc = ompi_proc_publish_info();
return rc;
}
int ompi_proc_publish_info(void)
{
orte_std_cntr_t datalen;
void *data;
orte_buffer_t* buf;
int rc;
/* pack our local data for others to use */
buf = OBJ_NEW(orte_buffer_t);
rc = ompi_proc_pack(&ompi_proc_local_proc, 1, buf);
if (OMPI_SUCCESS != rc) return rc;
if (OMPI_SUCCESS != rc) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* send our data into the ether */
rc = orte_dss.unload(buf, &data, &datalen);
@ -169,6 +197,7 @@ ompi_proc_get_info(void)
char *hostname;
void *data;
size_t datalen;
orte_nodeid_t nodeid;
if (ORTE_EQUAL != orte_ns.compare_fields(ORTE_NS_CMP_JOBID,
&ompi_proc_local_proc->proc_name,
@ -189,7 +218,7 @@ ompi_proc_get_info(void)
if (OMPI_SUCCESS != ret)
goto out;
/* This isn't needed here, but packed just so that you
/* This isn't needed here, but packed just so that you
could, in theory, use the unpack code on this proc. We
don't,because we aren't adding procs, but need to
update them */
@ -197,23 +226,34 @@ ompi_proc_get_info(void)
if (ret != ORTE_SUCCESS)
goto out;
ret = orte_dss.unpack(buf, &arch, &count, ORTE_UINT32);
if (ret != ORTE_SUCCESS)
goto out;
ret = orte_dss.unpack(buf, &hostname, &count, ORTE_STRING);
if (ret != ORTE_SUCCESS)
ret = orte_dss.unpack(buf, &nodeid, &count, ORTE_NODEID);
if (ret != ORTE_SUCCESS) {
ORTE_ERROR_LOG(ret);
goto out;
}
ret = orte_dss.unpack(buf, &arch, &count, ORTE_UINT32);
if (ret != ORTE_SUCCESS) {
ORTE_ERROR_LOG(ret);
goto out;
}
ret = orte_dss.unpack(buf, &hostname, &count, ORTE_STRING);
if (ret != ORTE_SUCCESS) {
ORTE_ERROR_LOG(ret);
goto out;
}
/* Free the buffer for the next proc */
OBJ_RELEASE(buf);
} else if (OMPI_ERR_NOT_IMPLEMENTED == ret) {
arch = ompi_proc_local_proc->proc_arch;
hostname = strdup("");
ret = ORTE_SUCCESS;
ret = ORTE_SUCCESS;
} else {
goto out;
}
proc->proc_nodeid = nodeid;
proc->proc_arch = arch;
/* if arch is different than mine, create a new convertor for this proc */
if (proc->proc_arch != ompi_proc_local_proc->proc_arch) {
@ -229,16 +269,14 @@ ompi_proc_get_info(void)
ret = OMPI_ERR_NOT_SUPPORTED;
goto out;
#endif
} else if (0 == strcmp(hostname, orte_system_info.nodename)) {
}
if (ompi_proc_local_proc->proc_nodeid == proc->proc_nodeid) {
proc->proc_flags |= OMPI_PROC_FLAG_LOCAL;
}
/* Save the hostname */
if (ompi_mpi_keep_peer_hostnames) {
/* the dss code will have strdup'ed this for us -- no need
to do so again */
proc->proc_hostname = hostname;
}
/* Save the hostname. The dss code will have strdup'ed this
for us -- no need to do so again */
proc->proc_hostname = hostname;
}
out:
@ -415,16 +453,25 @@ ompi_proc_pack(ompi_proc_t **proclist, int proclistsize, orte_buffer_t* buf)
for (i=0; i<proclistsize; i++) {
rc = orte_dss.pack(buf, &(proclist[i]->proc_name), 1, ORTE_NAME);
if(rc != ORTE_SUCCESS) {
ORTE_ERROR_LOG(rc);
OPAL_THREAD_UNLOCK(&ompi_proc_lock);
return rc;
}
rc = orte_dss.pack(buf, &(proclist[i]->proc_nodeid), 1, ORTE_NODEID);
if(rc != ORTE_SUCCESS) {
ORTE_ERROR_LOG(rc);
OPAL_THREAD_UNLOCK(&ompi_proc_lock);
return rc;
}
rc = orte_dss.pack(buf, &(proclist[i]->proc_arch), 1, ORTE_UINT32);
if(rc != ORTE_SUCCESS) {
ORTE_ERROR_LOG(rc);
OPAL_THREAD_UNLOCK(&ompi_proc_lock);
return rc;
}
rc = orte_dss.pack(buf, &(proclist[i]->proc_hostname), 1, ORTE_STRING);
if(rc != ORTE_SUCCESS) {
ORTE_ERROR_LOG(rc);
OPAL_THREAD_UNLOCK(&ompi_proc_lock);
return rc;
}
@ -435,7 +482,9 @@ ompi_proc_pack(ompi_proc_t **proclist, int proclistsize, orte_buffer_t* buf)
int
ompi_proc_unpack(orte_buffer_t* buf, int proclistsize, ompi_proc_t ***proclist)
ompi_proc_unpack(orte_buffer_t* buf,
int proclistsize, ompi_proc_t ***proclist,
int *newproclistsize, ompi_proc_t ***newproclist)
{
int i;
size_t newprocs_len = 0;
@ -460,17 +509,26 @@ ompi_proc_unpack(orte_buffer_t* buf, int proclistsize, ompi_proc_t ***proclist)
char *new_hostname;
bool isnew = false;
int rc;
orte_nodeid_t new_nodeid;
rc = orte_dss.unpack(buf, &new_name, &count, ORTE_NAME);
if (rc != ORTE_SUCCESS) {
ORTE_ERROR_LOG(rc);
return rc;
}
rc = orte_dss.unpack(buf, &new_nodeid, &count, ORTE_NODEID);
if (rc != ORTE_SUCCESS) {
ORTE_ERROR_LOG(rc);
return rc;
}
rc = orte_dss.unpack(buf, &new_arch, &count, ORTE_UINT32);
if (rc != ORTE_SUCCESS) {
ORTE_ERROR_LOG(rc);
return rc;
}
rc = orte_dss.unpack(buf, &new_hostname, &count, ORTE_STRING);
if (rc != ORTE_SUCCESS) {
ORTE_ERROR_LOG(rc);
return rc;
}
@ -478,6 +536,7 @@ ompi_proc_unpack(orte_buffer_t* buf, int proclistsize, ompi_proc_t ***proclist)
if (isnew) {
newprocs[newprocs_len++] = plist[i];
plist[i]->proc_nodeid = new_nodeid;
plist[i]->proc_arch = new_arch;
/* if arch is different than mine, create a new convertor for this proc */
@ -494,16 +553,21 @@ ompi_proc_unpack(orte_buffer_t* buf, int proclistsize, ompi_proc_t ***proclist)
return OMPI_ERR_NOT_SUPPORTED;
#endif
}
if (ompi_proc_local_proc->proc_nodeid == plist[i]->proc_nodeid) {
plist[i]->proc_flags |= OMPI_PROC_FLAG_LOCAL;
}
/* Save the hostname */
if (ompi_mpi_keep_peer_hostnames) {
plist[i]->proc_hostname = new_hostname;
}
plist[i]->proc_hostname = new_hostname;
}
}
if (newprocs_len > 0) MCA_PML_CALL(add_procs(newprocs, newprocs_len));
if (newprocs != NULL) free(newprocs);
if (NULL != newproclistsize) *newproclistsize = newprocs_len;
if (NULL != newproclist) {
*newproclist = newprocs;
} else if (newprocs != NULL) {
free(newprocs);
}
*proclist = plist;
return OMPI_SUCCESS;

Просмотреть файл

@ -54,6 +54,8 @@ struct ompi_proc_t {
opal_list_item_t super;
/** this process' name */
orte_process_name_t proc_name;
/** "nodeid" on which the proc resides */
orte_nodeid_t proc_nodeid;
/** PML specific proc data */
struct mca_pml_base_endpoint_t* proc_pml;
/** BML specific proc data */
@ -119,6 +121,23 @@ OMPI_DECLSPEC extern ompi_proc_t* ompi_proc_local_proc;
*/
int ompi_proc_init(void);
/**
* Publish local process information
*
* Used by ompi_proc_init() and elsewhere in the code to refresh any
* local information not easily determined by the run-time ahead of time
* (architecture and hostname).
*
* @note While an ompi_proc_t will exist with mostly valid information
* for each process in the MPI_COMM_WORLD at the conclusion of this
* call, some information will not be immediately available. This
* includes the architecture and hostname, which will be available by
* the conclusion of the stage gate.
*
* @retval OMPI_SUCESS Information available in the modex
* @retval OMPI_ERRROR Failure due to unspecified error
*/
int ompi_proc_publish_info(void);
/**
* Get data exchange information from remote processes
@ -267,18 +286,37 @@ int ompi_proc_pack(ompi_proc_t **proclist, int proclistsize,
* provided in the buffer. The lookup actions are always entirely
* local. The proclist returned is a list of pointers to all procs in
* the buffer, whether they were previously known or are new to this
* process. PML_ADD_PROCS will be called on the list of new processes
* discovered during this operation.
* process.
*
* @note In previous versions of this function, The PML's add_procs()
* function was called for any new processes discovered as a result of
* this operation. That is no longer the case -- the caller must use
* the newproclist information to call add_procs() if necessary.
*
* @note The reference count for procs created as a result of this
* operation will be set to 1. Existing procs will not have their
* reference count changed. The reference count of a proc at the
* return of this function is the same regardless of whether NULL is
* provided for newproclist. The user is responsible for freeing the
* newproclist array.
*
* @param[in] buf orte_buffer containing the packed names
* @param[in] proclistsize number of expected proc-pointres
* @param[out] proclist list of process pointers
* @param[out] newproclistsize Number of new procs added as a result
* of the unpack operation. NULL may be
* provided if information is not needed.
* @param[out] newproclist List of new procs added as a result of
* the unpack operation. NULL may be
* provided if informationis not needed.
*
* Return value:
* OMPI_SUCCESS on success
* OMPI_ERROR else
*/
int ompi_proc_unpack(orte_buffer_t *buf, int proclistsize, ompi_proc_t ***proclist);
int ompi_proc_unpack(orte_buffer_t *buf,
int proclistsize, ompi_proc_t ***proclist,
int *newproclistsize, ompi_proc_t ***newproclist);
END_C_DECLS

Просмотреть файл

@ -48,6 +48,7 @@
#include "orte/util/proc_info.h"
#include "orte/mca/snapc/snapc.h"
#include "orte/mca/snapc/base/base.h"
#include "orte/mca/smr/smr.h"
#include "ompi/constants.h"
#include "ompi/mca/pml/pml.h"
@ -334,6 +335,12 @@ static int ompi_cr_coord_post_restart(void) {
opal_output_verbose(10, ompi_cr_output,
"ompi_cr: coord_post_restart: ompi_cr_coord_post_restart()");
/* register myself to require that I finalize before exiting */
if (ORTE_SUCCESS != (ret = orte_smr.register_sync())) {
exit_status = ret;
goto cleanup;
}
/*
* Notify PML
* - Will notify BML and BTL's

Просмотреть файл

@ -152,7 +152,7 @@ ompi_modex_destruct(ompi_modex_proc_data_t * modex)
}
OBJ_CLASS_INSTANCE(ompi_modex_proc_data_t, opal_object_t,
ompi_modex_construct, ompi_modex_destruct);
ompi_modex_construct, ompi_modex_destruct);
@ -196,15 +196,15 @@ ompi_modex_module_destruct(ompi_modex_module_data_t * module)
{
opal_list_item_t *item;
while (NULL != (item = opal_list_remove_first(&module->module_cbs))) {
OBJ_RELEASE(item);
OBJ_RELEASE(item);
}
OBJ_DESTRUCT(&module->module_cbs);
}
OBJ_CLASS_INSTANCE(ompi_modex_module_data_t,
opal_list_item_t,
ompi_modex_module_construct,
ompi_modex_module_destruct);
opal_list_item_t,
ompi_modex_module_construct,
ompi_modex_module_destruct);
/**
* Callback data for modex updates
@ -220,43 +220,12 @@ struct ompi_modex_cb_t {
typedef struct ompi_modex_cb_t ompi_modex_cb_t;
OBJ_CLASS_INSTANCE(ompi_modex_cb_t,
opal_list_item_t,
NULL,
NULL);
opal_list_item_t,
NULL,
NULL);
/**
* Container for segment subscription data
*
* Track segments we have subscribed to. Any jobid segment we are
* subscribed to for updates will have one of these containers,
* hopefully put on the ompi_modex_subscriptions list.
*/
struct ompi_modex_subscription_t {
opal_list_item_t item;
orte_jobid_t jobid;
};
typedef struct ompi_modex_subscription_t ompi_modex_subscription_t;
OBJ_CLASS_INSTANCE(ompi_modex_subscription_t,
opal_list_item_t,
NULL,
NULL);
/**
* Global modex list for tracking subscriptions
*
* A list of ompi_modex_subscription_t structures, each representing a
* jobid to which we have subscribed for modex updates.
*
* \note The ompi_modex_lock mutex should be held whenever this list
* is being updated or searched.
*/
static opal_list_t ompi_modex_subscriptions;
/**
* Global modex list of proc data
*
@ -278,14 +247,24 @@ static opal_mutex_t ompi_modex_lock;
static opal_mutex_t ompi_modex_string_lock;
/*
* Global buffer we use to collect modex info for later
* transmission
*/
static orte_buffer_t ompi_modex_buffer;
static orte_std_cntr_t ompi_modex_num_entries;
int
ompi_modex_init(void)
{
OBJ_CONSTRUCT(&ompi_modex_data, opal_hash_table_t);
OBJ_CONSTRUCT(&ompi_modex_subscriptions, opal_list_t);
OBJ_CONSTRUCT(&ompi_modex_lock, opal_mutex_t);
OBJ_CONSTRUCT(&ompi_modex_string_lock, opal_mutex_t);
OBJ_CONSTRUCT(&ompi_modex_buffer, orte_buffer_t);
ompi_modex_num_entries = 0;
opal_hash_table_init(&ompi_modex_data, 256);
return OMPI_SUCCESS;
@ -295,17 +274,12 @@ ompi_modex_init(void)
int
ompi_modex_finalize(void)
{
opal_list_item_t *item;
opal_hash_table_remove_all(&ompi_modex_data);
OBJ_DESTRUCT(&ompi_modex_data);
while (NULL != (item = opal_list_remove_first(&ompi_modex_subscriptions)))
OBJ_RELEASE(item);
OBJ_DESTRUCT(&ompi_modex_subscriptions);
OBJ_DESTRUCT(&ompi_modex_string_lock);
OBJ_DESTRUCT(&ompi_modex_lock);
OBJ_DESTRUCT(&ompi_modex_buffer);
return OMPI_SUCCESS;
}
@ -326,11 +300,11 @@ ompi_modex_lookup_module(ompi_modex_proc_data_t *proc_data,
{
ompi_modex_module_data_t *module_data = NULL;
for (module_data = (ompi_modex_module_data_t *) opal_list_get_first(&proc_data->modex_module_data);
module_data != (ompi_modex_module_data_t *) opal_list_get_end(&proc_data->modex_module_data);
module_data = (ompi_modex_module_data_t *) opal_list_get_next(module_data)) {
if (mca_base_component_compatible(&module_data->component, component) == 0) {
return module_data;
}
module_data != (ompi_modex_module_data_t *) opal_list_get_end(&proc_data->modex_module_data);
module_data = (ompi_modex_module_data_t *) opal_list_get_next(module_data)) {
if (mca_base_component_compatible(&module_data->component, component) == 0) {
return module_data;
}
}
if (create_if_not_found) {
@ -365,7 +339,7 @@ ompi_modex_lookup_orte_proc(orte_process_name_t *orte_proc)
orte_hash_table_get_proc(&ompi_modex_data, orte_proc);
if (NULL == proc_data) {
/* The proc clearly exists, so create a modex structure
for it and try to subscribe */
for it */
proc_data = OBJ_NEW(ompi_modex_proc_data_t);
if (NULL == proc_data) {
opal_output(0, "ompi_modex_lookup_orte_proc: unable to allocate ompi_modex_proc_data_t\n");
@ -403,8 +377,6 @@ ompi_modex_lookup_proc(ompi_proc_t *proc)
OBJ_RETAIN(proc_data);
proc->proc_modex = &proc_data->super.super;
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
/* verify that we have subscribed to this segment */
ompi_modex_subscribe_job(proc->proc_name.jobid);
} else {
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
}
@ -415,345 +387,233 @@ ompi_modex_lookup_proc(ompi_proc_t *proc)
/**
* Callback for registry notifications.
* Get the local buffer's data
*/
static void
ompi_modex_registry_callback(orte_gpr_notify_data_t * data,
void *cbdata)
int
ompi_modex_get_my_buffer(orte_buffer_t *buf)
{
orte_std_cntr_t i, j, k;
orte_gpr_value_t **values, *value;
orte_gpr_keyval_t **keyval;
orte_process_name_t *proc_name;
int rc;
OPAL_THREAD_LOCK(&ompi_modex_lock);
/* put our process name in the buffer so it can be unpacked later */
if (ORTE_SUCCESS != (rc = orte_dss.pack(buf, ORTE_PROC_MY_NAME, 1, ORTE_NAME))) {
ORTE_ERROR_LOG(rc);
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
return rc;
}
/* put the number of entries into the buffer */
if (ORTE_SUCCESS != (rc = orte_dss.pack(buf, &ompi_modex_num_entries, 1, ORTE_STD_CNTR))) {
ORTE_ERROR_LOG(rc);
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
return rc;
}
/* if there are entries, copy the data across */
if (0 < ompi_modex_num_entries) {
if (ORTE_SUCCESS != (orte_dss.copy_payload(buf, &ompi_modex_buffer))) {
ORTE_ERROR_LOG(rc);
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
return rc;
}
}
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
return ORTE_SUCCESS;
}
/**
* Process modex data
*/
int
ompi_modex_process_data(orte_buffer_t *buf)
{
orte_std_cntr_t i, j, num_procs, num_entries;
opal_list_item_t *item;
void *bytes = NULL;
orte_std_cntr_t cnt;
orte_process_name_t proc_name;
ompi_modex_proc_data_t *proc_data;
ompi_modex_module_data_t *module_data;
mca_base_component_t component;
int rc;
/* process the callback */
values = (orte_gpr_value_t **) (data->values)->addr;
for (i = 0, k = 0; k < data->cnt &&
i < (data->values)->size; i++) {
if (NULL != values[i]) {
k++;
value = values[i];
if (0 < value->cnt) { /* needs to be at least one keyval */
/* Find the process name in the keyvals */
keyval = value->keyvals;
for (j = 0; j < value->cnt; j++) {
if (0 != strcmp(keyval[j]->key, ORTE_PROC_NAME_KEY)) continue;
/* this is the process name - extract it */
if (ORTE_SUCCESS != orte_dss.get((void**)&proc_name, keyval[j]->value, ORTE_NAME)) {
opal_output(0, "ompi_modex_registry_callback: unable to extract process name\n");
return; /* nothing we can do */
}
goto GOTNAME;
/* extract the number of entries in the buffer */
cnt=1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf, &num_procs, &cnt, ORTE_STD_CNTR))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* process the buffer */
for (i=0; i < num_procs; i++) {
/* unpack the process name */
cnt=1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf, &proc_name, &cnt, ORTE_NAME))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* look up the modex data structure */
proc_data = ompi_modex_lookup_orte_proc(&proc_name);
if (proc_data == NULL) {
/* report the error */
opal_output(0, "ompi_modex_process_data: received modex info for unknown proc %s\n",
ORTE_NAME_PRINT(&proc_name));
return OMPI_ERR_NOT_FOUND;
}
/* unpack the number of entries for this proc */
cnt=1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf, &num_entries, &cnt, ORTE_STD_CNTR))) {
ORTE_ERROR_LOG(rc);
return rc;
}
OPAL_THREAD_LOCK(&proc_data->modex_lock);
/*
* Extract the component name and version - since there is one for each
* component type/name/version - process them all
*/
for (j = 0; j < num_entries; j++) {
size_t num_bytes;
char *ptr;
cnt = 1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf, &ptr, &cnt, ORTE_STRING))) {
ORTE_ERROR_LOG(rc);
return rc;
}
strcpy(component.mca_type_name, ptr);
free(ptr);
cnt = 1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf, &ptr, &cnt, ORTE_STRING))) {
ORTE_ERROR_LOG(rc);
return rc;
}
strcpy(component.mca_component_name, ptr);
free(ptr);
cnt = 1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf,
&component.mca_component_major_version, &cnt, ORTE_INT32))) {
ORTE_ERROR_LOG(rc);
return rc;
}
cnt = 1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf,
&component.mca_component_minor_version, &cnt, ORTE_INT32))) {
ORTE_ERROR_LOG(rc);
return rc;
}
cnt = 1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf, &num_bytes, &cnt, ORTE_SIZE))) {
ORTE_ERROR_LOG(rc);
return rc;
}
if (num_bytes != 0) {
if (NULL == (bytes = malloc(num_bytes))) {
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
return ORTE_ERR_OUT_OF_RESOURCE;
}
opal_output(0, "ompi_modex_registry_callback: unable to find process name in notify message\n");
return; /* if the name wasn't here, there is nothing we can do */
GOTNAME:
/* look up the modex data structure */
proc_data = ompi_modex_lookup_orte_proc(proc_name);
if (proc_data == NULL) continue;
OPAL_THREAD_LOCK(&proc_data->modex_lock);
/*
* Extract the component name and version from the keyval object's key
* Could be multiple keyvals returned since there is one for each
* component type/name/version - process them all
*/
keyval = value->keyvals;
for (j = 0; j < value->cnt; j++) {
orte_buffer_t buffer;
opal_list_item_t *item;
char *ptr;
void *bytes = NULL;
orte_std_cntr_t cnt;
size_t num_bytes;
orte_byte_object_t *bo;
if (strcmp(keyval[j]->key, OMPI_MODEX_KEY) != 0)
continue;
OBJ_CONSTRUCT(&buffer, orte_buffer_t);
if (ORTE_SUCCESS != (rc = orte_dss.get((void **) &bo, keyval[j]->value, ORTE_BYTE_OBJECT))) {
ORTE_ERROR_LOG(rc);
continue;
}
if (ORTE_SUCCESS != (rc = orte_dss.load(&buffer, bo->bytes, bo->size))) {
ORTE_ERROR_LOG(rc);
continue;
}
cnt = 1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&buffer, &ptr, &cnt, ORTE_STRING))) {
ORTE_ERROR_LOG(rc);
continue;
}
strncpy(component.mca_type_name, ptr, MCA_BASE_MAX_COMPONENT_NAME_LEN);
free(ptr);
cnt = 1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&buffer, &ptr, &cnt, ORTE_STRING))) {
ORTE_ERROR_LOG(rc);
continue;
}
strncpy(component.mca_component_name, ptr, MCA_BASE_MAX_COMPONENT_NAME_LEN);
free(ptr);
cnt = 1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&buffer,
&component.mca_component_major_version, &cnt, ORTE_INT32))) {
ORTE_ERROR_LOG(rc);
continue;
}
cnt = 1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&buffer,
&component.mca_component_minor_version, &cnt, ORTE_INT32))) {
ORTE_ERROR_LOG(rc);
continue;
}
cnt = 1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&buffer, &num_bytes, &cnt, ORTE_SIZE))) {
ORTE_ERROR_LOG(rc);
continue;
}
if (num_bytes != 0) {
if (NULL == (bytes = malloc(num_bytes))) {
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
continue;
}
cnt = (orte_std_cntr_t) num_bytes;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&buffer, bytes, &cnt, ORTE_BYTE))) {
ORTE_ERROR_LOG(rc);
continue;
}
num_bytes = cnt;
} else {
bytes = NULL;
}
/*
* Lookup the corresponding modex structure
*/
if (NULL == (module_data = ompi_modex_lookup_module(proc_data,
&component,
true))) {
opal_output(0, "ompi_modex_registry_callback: ompi_modex_lookup_module failed\n");
OBJ_RELEASE(data);
OPAL_THREAD_UNLOCK(&proc_data->modex_lock);
return;
}
module_data->module_data = bytes;
module_data->module_data_size = num_bytes;
proc_data->modex_received_data = true;
opal_condition_signal(&proc_data->modex_cond);
if (opal_list_get_size(&module_data->module_cbs)) {
ompi_proc_t *proc = ompi_proc_find(proc_name);
if (NULL != proc) {
OPAL_THREAD_LOCK(&proc->proc_lock);
/* call any registered callbacks */
for (item = opal_list_get_first(&module_data->module_cbs);
item != opal_list_get_end(&module_data->module_cbs);
item = opal_list_get_next(item)) {
ompi_modex_cb_t *cb = (ompi_modex_cb_t *) item;
cb->cbfunc(&module_data->component,
proc, bytes, num_bytes, cb->cbdata);
}
OPAL_THREAD_UNLOCK(&proc->proc_lock);
}
}
cnt = (orte_std_cntr_t) num_bytes;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf, bytes, &cnt, ORTE_BYTE))) {
ORTE_ERROR_LOG(rc);
return rc;
}
num_bytes = cnt;
} else {
bytes = NULL;
}
/*
* Lookup the corresponding modex structure
*/
if (NULL == (module_data = ompi_modex_lookup_module(proc_data,
&component,
true))) {
opal_output(0, "ompi_modex_process_data: ompi_modex_lookup_module failed\n");
OPAL_THREAD_UNLOCK(&proc_data->modex_lock);
} /* if value[i]->cnt > 0 */
} /* if value[i] != NULL */
}
}
int
ompi_modex_subscribe_job(orte_jobid_t jobid)
{
char *segment, *sub_name, *trig_name;
orte_gpr_subscription_id_t sub_id;
opal_list_item_t *item;
ompi_modex_subscription_t *subscription;
int rc;
char *keys[] = {
ORTE_PROC_NAME_KEY,
OMPI_MODEX_KEY,
NULL
};
/* check for an existing subscription */
OPAL_THREAD_LOCK(&ompi_modex_lock);
for (item = opal_list_get_first(&ompi_modex_subscriptions) ;
item != opal_list_get_end(&ompi_modex_subscriptions) ;
item = opal_list_get_next(item)) {
subscription = (ompi_modex_subscription_t *) item;
if (subscription->jobid == jobid) {
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
return OMPI_SUCCESS;
}
}
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
/* otherwise - subscribe to get this jobid's contact info */
if (ORTE_SUCCESS != (rc = orte_schema.get_std_subscription_name(&sub_name,
OMPI_MODEX_SUBSCRIPTION, jobid))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* attach to the stage-1 standard trigger */
if (ORTE_SUCCESS != (rc = orte_schema.get_std_trigger_name(&trig_name,
ORTE_STG1_TRIGGER, jobid))) {
ORTE_ERROR_LOG(rc);
free(sub_name);
return rc;
}
/* define the segment */
if (ORTE_SUCCESS != (rc = orte_schema.get_job_segment_name(&segment, jobid))) {
ORTE_ERROR_LOG(rc);
free(sub_name);
free(trig_name);
return rc;
}
if (jobid != orte_process_info.my_name->jobid) {
if (ORTE_SUCCESS != (rc = orte_gpr.subscribe_N(&sub_id, NULL, NULL,
ORTE_GPR_NOTIFY_ADD_ENTRY |
ORTE_GPR_NOTIFY_VALUE_CHG |
ORTE_GPR_NOTIFY_PRE_EXISTING,
ORTE_GPR_KEYS_OR | ORTE_GPR_TOKENS_OR | ORTE_GPR_STRIPPED,
segment,
NULL, /* look at all
* containers on this
* segment */
2, keys,
ompi_modex_registry_callback, NULL))) {
ORTE_ERROR_LOG(rc);
free(sub_name);
free(trig_name);
free(segment);
return rc;
}
} else {
if (ORTE_SUCCESS != (rc = orte_gpr.subscribe_N(&sub_id, trig_name, sub_name,
ORTE_GPR_NOTIFY_ADD_ENTRY |
ORTE_GPR_NOTIFY_VALUE_CHG |
ORTE_GPR_NOTIFY_STARTS_AFTER_TRIG,
ORTE_GPR_KEYS_OR | ORTE_GPR_TOKENS_OR | ORTE_GPR_STRIPPED,
segment,
NULL, /* look at all
* containers on this
* segment */
2, keys,
ompi_modex_registry_callback, NULL))) {
ORTE_ERROR_LOG(rc);
free(sub_name);
free(trig_name);
free(segment);
return rc;
return OMPI_ERR_NOT_FOUND;
}
module_data->module_data = bytes;
module_data->module_data_size = num_bytes;
proc_data->modex_received_data = true;
opal_condition_signal(&proc_data->modex_cond);
if (opal_list_get_size(&module_data->module_cbs)) {
ompi_proc_t *proc = ompi_proc_find(&proc_name);
if (NULL != proc) {
OPAL_THREAD_LOCK(&proc->proc_lock);
/* call any registered callbacks */
for (item = opal_list_get_first(&module_data->module_cbs);
item != opal_list_get_end(&module_data->module_cbs);
item = opal_list_get_next(item)) {
ompi_modex_cb_t *cb = (ompi_modex_cb_t *) item;
cb->cbfunc(&module_data->component,
proc, bytes, num_bytes, cb->cbdata);
}
OPAL_THREAD_UNLOCK(&proc->proc_lock);
}
}
}
OPAL_THREAD_UNLOCK(&proc_data->modex_lock);
}
free(sub_name);
free(trig_name);
free(segment);
/* add this jobid to our list of subscriptions */
OPAL_THREAD_LOCK(&ompi_modex_lock);
subscription = OBJ_NEW(ompi_modex_subscription_t);
subscription->jobid = jobid;
opal_list_append(&ompi_modex_subscriptions, &subscription->item);
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
return OMPI_SUCCESS;
}
int
ompi_modex_send(mca_base_component_t * source_component,
const void *data,
size_t size)
const void *data,
size_t size)
{
orte_jobid_t jobid;
int rc;
orte_buffer_t buffer;
orte_std_cntr_t i, num_tokens;
char *ptr, *segment, **tokens;
orte_byte_object_t bo;
orte_data_value_t value = ORTE_DATA_VALUE_EMPTY;
char *ptr;
/* get location in GPR for the data */
jobid = ORTE_PROC_MY_NAME->jobid;
if (ORTE_SUCCESS != (rc = orte_schema.get_job_segment_name(&segment, jobid))) {
ORTE_ERROR_LOG(rc);
return rc;
}
if (ORTE_SUCCESS != (rc = orte_schema.get_proc_tokens(&tokens,
&num_tokens, orte_process_info.my_name))) {
ORTE_ERROR_LOG(rc);
free(segment);
return rc;
}
OBJ_CONSTRUCT(&buffer, orte_buffer_t);
/* Pack the component name information into the buffer */
OPAL_THREAD_LOCK(&ompi_modex_lock);
/* Pack the component name information into the local buffer */
ptr = source_component->mca_type_name;
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buffer, &ptr, 1, ORTE_STRING))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
if (ORTE_SUCCESS != (rc = orte_dss.pack(&ompi_modex_buffer, &ptr, 1, ORTE_STRING))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
}
ptr = source_component->mca_component_name;
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buffer, &ptr, 1, ORTE_STRING))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
if (ORTE_SUCCESS != (rc = orte_dss.pack(&ompi_modex_buffer, &ptr, 1, ORTE_STRING))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
}
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buffer, &source_component->mca_component_major_version, 1, ORTE_INT32))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
if (ORTE_SUCCESS != (rc = orte_dss.pack(&ompi_modex_buffer, &source_component->mca_component_major_version, 1, ORTE_INT32))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
}
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buffer, &source_component->mca_component_minor_version, 1, ORTE_INT32))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
if (ORTE_SUCCESS != (rc = orte_dss.pack(&ompi_modex_buffer, &source_component->mca_component_minor_version, 1, ORTE_INT32))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
}
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buffer, &size, 1, ORTE_SIZE))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
if (ORTE_SUCCESS != (rc = orte_dss.pack(&ompi_modex_buffer, &size, 1, ORTE_SIZE))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
}
/* Pack the actual data into the buffer */
if (0 != size) {
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buffer, (void *) data, size, ORTE_BYTE))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
}
if (ORTE_SUCCESS != (rc = orte_dss.pack(&ompi_modex_buffer, (void *) data, size, ORTE_BYTE))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
}
}
if (ORTE_SUCCESS != (rc = orte_dss.unload(&buffer, (void **) &(bo.bytes), &(bo.size)))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
}
OBJ_DESTRUCT(&buffer);
/* track the number of entries */
++ompi_modex_num_entries;
/* setup the data_value structure to hold the byte object */
if (ORTE_SUCCESS != (rc = orte_dss.set(&value, (void *) &bo, ORTE_BYTE_OBJECT))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
}
/* Put data in registry */
rc = orte_gpr.put_1(ORTE_GPR_TOKENS_AND | ORTE_GPR_KEYS_OR,
segment, tokens, OMPI_MODEX_KEY, &value);
cleanup:
free(segment);
for (i = 0; i < num_tokens; i++) {
free(tokens[i]);
tokens[i] = NULL;
}
if (NULL != tokens)
free(tokens);
cleanup:
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
return rc;
}
@ -773,7 +633,7 @@ ompi_modex_recv(mca_base_component_t * component,
"null")) {
return OMPI_ERR_NOT_IMPLEMENTED;
}
proc_data = ompi_modex_lookup_proc(proc);
if (NULL == proc_data) return OMPI_ERR_NOT_FOUND;
@ -781,7 +641,7 @@ ompi_modex_recv(mca_base_component_t * component,
/* wait until data is available */
while (proc_data->modex_received_data == false) {
opal_condition_wait(&proc_data->modex_cond, &proc_data->modex_lock);
opal_condition_wait(&proc_data->modex_cond, &proc_data->modex_lock);
}
/* look up module */
@ -790,17 +650,19 @@ ompi_modex_recv(mca_base_component_t * component,
/* copy the data out to the user */
if ((NULL == module_data) ||
(module_data->module_data_size == 0)) {
*buffer = NULL;
*size = 0;
opal_output(0, "modex recv: no module avail or zero byte size");
*buffer = NULL;
*size = 0;
} else {
void *copy = malloc(module_data->module_data_size);
if (copy == NULL) {
void *copy = malloc(module_data->module_data_size);
if (copy == NULL) {
OPAL_THREAD_UNLOCK(&proc_data->modex_lock);
return OMPI_ERR_OUT_OF_RESOURCE;
}
memcpy(copy, module_data->module_data, module_data->module_data_size);
*buffer = copy;
*size = module_data->module_data_size;
return OMPI_ERR_OUT_OF_RESOURCE;
}
memcpy(copy, module_data->module_data, module_data->module_data_size);
*buffer = copy;
*size = module_data->module_data_size;
}
OPAL_THREAD_UNLOCK(&proc_data->modex_lock);
@ -826,8 +688,8 @@ ompi_modex_recv_nb(mca_base_component_t *component,
/* lookup / create module */
module_data = ompi_modex_lookup_module(proc_data, component, true);
if (NULL == module_data) {
OPAL_THREAD_UNLOCK(&proc_data->modex_lock);
return OMPI_ERR_OUT_OF_RESOURCE;
OPAL_THREAD_UNLOCK(&proc_data->modex_lock);
return OMPI_ERR_OUT_OF_RESOURCE;
}
/* register the callback */

Просмотреть файл

@ -51,6 +51,7 @@
#include <sys/types.h>
#endif
#include "orte/dss/dss_types.h"
#include "orte/mca/ns/ns_types.h"
struct mca_base_component_t;
@ -248,26 +249,40 @@ OMPI_DECLSPEC int ompi_modex_recv_string(const char* key,
/**
* Subscribe to resource updates for a specific job
* Retrieve a copy of the modex buffer
*
* Generally called during process initialization, after all the data
* has been loaded into the module exchange system, but before the
* data is actually used.
*
* Intended to help the scalability of start-up by not subscribing to
* the job updates until all data is in the system (and not firing
* updates along the way) and launching the asynchronous request for
* the data before it is actually needed later in init.
* Each component will "send" its data on its own. The modex
* collects that data into a local static buffer. At some point,
* we need to provide a copy of the collected info so someone
* (usually mpi_init) can send it to everyone else. This function
* xfers the payload in the local static buffer into the provided
* buffer, thus resetting the local buffer for future use.
*
* @note This function is probably not useful outside of application
* initialization code.
*
* @param[in] jobid Jobid for which information is needed
* @param[in] *buf Pointer to the target buffer
*
* @retval OMPI_SUCCESS Successfully subscribed to information
* @retval OMPI_SUCCESS Successfully exchanged information
* @retval OMPI_ERROR An unspecified error occurred
*/
OMPI_DECLSPEC int ompi_modex_subscribe_job(orte_jobid_t jobid);
OMPI_DECLSPEC int ompi_modex_get_my_buffer(orte_buffer_t *buf);
/**
* Process the data in a modex buffer
*
* Given a buffer containing a set of modex entries, this
* function will destructively read the buffer, adding the
* modex info to each proc. An error will be returned if
* modex info is found for a proc that is not yet in the
* ompi_proc table
*
* @param[in] *buf Pointer to a buffer containing the data
*
* @retval OMPI_SUCCESS Successfully exchanged information
* @retval OMPI_ERROR An unspecified error occurred
*/
OMPI_DECLSPEC int ompi_modex_process_data(orte_buffer_t *buf);
/**

Просмотреть файл

@ -135,32 +135,23 @@ int ompi_mpi_finalize(void)
MPI lifetime, to get better latency when not using TCP */
opal_progress_event_users_increment();
/* mark that I called finalize before exiting */
if (ORTE_SUCCESS != (ret = orte_smr.register_sync())) {
ORTE_ERROR_LOG(ret);
return ret;
}
/* If maffinity was setup, tear it down */
if (ompi_mpi_maffinity_setup) {
opal_maffinity_base_close();
}
/* begin recording compound command */
/* if (OMPI_SUCCESS != (ret = orte_gpr.begin_compound_cmd())) {
return ret;
}
*/
/* Set process status to "at stg3" */
if (ORTE_SUCCESS != (ret = orte_smr.set_proc_state(orte_process_info.my_name,
ORTE_PROC_STATE_AT_STG3, 0))) {
ORTE_ERROR_LOG(ret);
}
/* execute the compound command - no return data requested
*/
/* if (OMPI_SUCCESS != (ret = orte_gpr.exec_compound_cmd())) {
return ret;
}
*/
/*
* Wait for everyone to get here
*/
if (ORTE_SUCCESS != (ret = orte_grpcomm.xcast_gate(orte_gpr.deliver_notify_msg))) {
/* wait for everyone to reach this point
This is a grpcomm barrier instead of an MPI barrier because an
MPI barrier doesn't ensure that all messages have been transmitted
before exiting, so the possibility of a stranded message exists.
*/
if (OMPI_SUCCESS != (ret = orte_grpcomm.barrier())) {
ORTE_ERROR_LOG(ret);
return ret;
}
@ -308,23 +299,6 @@ int ompi_mpi_finalize(void)
return ret;
}
/* Set process status to "finalized" */
if (ORTE_SUCCESS != (ret = orte_smr.set_proc_state(orte_process_info.my_name,
ORTE_PROC_STATE_FINALIZED, 0))) {
ORTE_ERROR_LOG(ret);
}
/*
* Wait for everyone to get here. This is necessary to allow the smr
* to update the job state for singletons. Otherwise, we finalize
* the RTE while the smr is trying to do the update - which causes
* an ugly race condition
*/
if (ORTE_SUCCESS != (ret = orte_grpcomm.xcast_gate(orte_gpr.deliver_notify_msg))) {
ORTE_ERROR_LOG(ret);
return ret;
}
/* Leave the RTE */
if (OMPI_SUCCESS != (ret = orte_finalize())) {

Просмотреть файл

@ -217,8 +217,7 @@ int ompi_mpi_init(int argc, char **argv, int requested, int *provided)
ompi_proc_t** procs;
size_t nprocs;
char *error = NULL;
bool compound_cmd = false;
orte_buffer_t *cmd_buffer = NULL;
orte_buffer_t mdx_buf, rbuf;
bool timing = false;
int param, value;
struct timeval ompistart, ompistop;
@ -246,39 +245,17 @@ int ompi_mpi_init(int argc, char **argv, int requested, int *provided)
/* Setup ORTE stage 1, note that we are not infrastructre */
if (ORTE_SUCCESS != (ret = orte_init_stage1(false))) {
error = "ompi_mpi_init: orte_init_stage1 failed";
if (ORTE_SUCCESS != (ret = orte_init(ORTE_NON_INFRASTRUCTURE))) {
error = "ompi_mpi_init: orte_init failed";
goto error;
}
/* If we are not the seed nor a singleton, AND we have not set the
orte_debug flag, then start recording the compound command that
starts us up. if we are the seed or a singleton, then don't do
this - the registry is local, so we'll just drive it
directly */
if (orte_process_info.seed ||
NULL == orte_process_info.ns_replica ||
orte_debug_flag) {
compound_cmd = false;
} else {
cmd_buffer = OBJ_NEW(orte_buffer_t);
if (ORTE_SUCCESS != (ret = orte_gpr.begin_compound_cmd(cmd_buffer))) {
ORTE_ERROR_LOG(ret);
error = "ompi_mpi_init: orte_gpr.begin_compound_cmd failed";
goto error;
}
compound_cmd = true;
}
/* Now do the things that hit the registry */
if (ORTE_SUCCESS != (ret = orte_init_stage2(ORTE_STG1_TRIGGER))) {
ORTE_ERROR_LOG(ret);
error = "ompi_mpi_init: orte_init_stage2 failed";
/* register myself to require that I finalize before exiting */
if (ORTE_SUCCESS != (ret = orte_smr.register_sync())) {
error = "ompi_mpi_init: register sync failed";
goto error;
}
/* check for timing request - get stop time and report elapsed time if so */
if (timing) {
gettimeofday(&ompistop, NULL);
@ -343,6 +320,14 @@ int ompi_mpi_init(int argc, char **argv, int requested, int *provided)
goto error;
}
/* Initialize module exchange - this MUST happen before proc_init
* as proc_init needs to send modex info!
*/
if (OMPI_SUCCESS != (ret = ompi_modex_init())) {
error = "ompi_modex_init() failed";
goto error;
}
/* Initialize OMPI procs */
if (OMPI_SUCCESS != (ret = ompi_proc_init())) {
error = "mca_proc_init() failed";
@ -400,13 +385,6 @@ int ompi_mpi_init(int argc, char **argv, int requested, int *provided)
relevant functions (e.g., MPI_FILE_*, MPI_CART_*, MPI_GRAPH_*),
so they are not opened here. */
/* Initialize module exchange */
if (OMPI_SUCCESS != (ret = ompi_modex_init())) {
error = "ompi_modex_init() failed";
goto error;
}
/* Select which MPI components to use */
if (OMPI_SUCCESS !=
@ -514,68 +492,50 @@ int ompi_mpi_init(int argc, char **argv, int requested, int *provided)
error = "ompi_attr_init() failed";
goto error;
}
/* do module exchange */
if (OMPI_SUCCESS != (ret = ompi_modex_subscribe_job(ORTE_PROC_MY_NAME->jobid))) {
error = "ompi_modex_subscribe_job() failed";
goto error;
}
/* Let system know we are at STG1 Barrier */
if (ORTE_SUCCESS != (ret = orte_smr.set_proc_state(orte_process_info.my_name,
ORTE_PROC_STATE_AT_STG1, 0))) {
ORTE_ERROR_LOG(ret);
error = "set process state failed";
goto error;
}
/* check for timing request - get stop time and report elapsed time if so */
if (timing) {
gettimeofday(&ompistop, NULL);
opal_output(0, "ompi_mpi_init[%ld]: time from completion of orte_init to exec_compound_cmd %ld usec",
opal_output(0, "ompi_mpi_init[%ld]: time from completion of orte_init to modex %ld usec",
(long)ORTE_PROC_MY_NAME->vpid,
(long int)((ompistop.tv_sec - ompistart.tv_sec)*1000000 +
(ompistop.tv_usec - ompistart.tv_usec)));
gettimeofday(&ompistart, NULL);
}
/* if the compound command is operative, execute it */
if (compound_cmd) {
if (OMPI_SUCCESS != (ret = orte_gpr.exec_compound_cmd(cmd_buffer))) {
ORTE_ERROR_LOG(ret);
error = "ompi_rte_init: orte_gpr.exec_compound_cmd failed";
goto error;
}
OBJ_RELEASE(cmd_buffer);
/* get the modex buffer so we can exchange it */
OBJ_CONSTRUCT(&mdx_buf, orte_buffer_t);
if (OMPI_SUCCESS != (ret = ompi_modex_get_my_buffer(&mdx_buf))) {
error = "ompi_modex_execute() failed";
goto error;
}
/* execute the exchange - this function also acts as a barrier
* as it will not return until the exchange is complete
*/
OBJ_CONSTRUCT(&rbuf, orte_buffer_t);
if (OMPI_SUCCESS != (ret = orte_grpcomm.allgather(&mdx_buf, &rbuf))) {
error = "orte_gprcomm_allgather failed";
goto error;
}
OBJ_DESTRUCT(&mdx_buf);
/* check for timing request - get stop time and report elapsed time if so */
/* process the modex data into the proc structures */
if (OMPI_SUCCESS != (ret = ompi_modex_process_data(&rbuf))) {
error = "ompi_modex_process_data failed";
goto error;
}
OBJ_DESTRUCT(&rbuf);
if (timing) {
gettimeofday(&ompistop, NULL);
opal_output(0, "ompi_mpi_init[%ld]: time to execute compound command %ld usec",
opal_output(0, "ompi_mpi_init[%ld]: time to execute modex %ld usec",
(long)ORTE_PROC_MY_NAME->vpid,
(long int)((ompistop.tv_sec - ompistart.tv_sec)*1000000 +
(ompistop.tv_usec - ompistart.tv_usec)));
gettimeofday(&ompistart, NULL);
}
/* FIRST BARRIER - WAIT FOR XCAST STG1 MESSAGE TO ARRIVE */
if (ORTE_SUCCESS != (ret = orte_grpcomm.xcast_gate(orte_gpr.deliver_notify_msg))) {
ORTE_ERROR_LOG(ret);
error = "ompi_mpi_init: failed to see all procs register\n";
goto error;
}
/* check for timing request - get start time */
if (timing) {
gettimeofday(&ompistop, NULL);
opal_output(0, "ompi_mpi_init[%ld]: time to execute xcast %ld usec",
(long)ORTE_PROC_MY_NAME->vpid,
(long int)((ompistop.tv_sec - ompistart.tv_sec)*1000000 +
(ompistop.tv_usec - ompistart.tv_usec)));
gettimeofday(&ompistart, NULL);
}
/* Fill in remote proc information */
if (OMPI_SUCCESS != (ret = ompi_proc_get_info())) {
ORTE_ERROR_LOG(ret);
@ -655,37 +615,12 @@ int ompi_mpi_init(int argc, char **argv, int requested, int *provided)
orte_system_info.nodename);
}
/* Let system know we are at STG2 Barrier */
if (ORTE_SUCCESS != (ret = orte_smr.set_proc_state(orte_process_info.my_name,
ORTE_PROC_STATE_AT_STG2, 0))) {
ORTE_ERROR_LOG(ret);
error = "set process state failed";
/* wait for everyone to reach this point */
if (OMPI_SUCCESS != (ret = orte_grpcomm.barrier())) {
error = "orte_grpcomm_barrier failed";
goto error;
}
/* check for timing request - get stop time and report elapsed time if so */
if (timing) {
gettimeofday(&ompistop, NULL);
opal_output(0, "ompi_mpi_init[%ld]: time from stage1 to stage2 %ld usec",
(long)ORTE_PROC_MY_NAME->vpid,
(long int)((ompistop.tv_sec - ompistart.tv_sec)*1000000 +
(ompistop.tv_usec - ompistart.tv_usec)));
}
/* Second barrier -- wait for XCAST STG2 MESSAGE to arrive */
if (ORTE_SUCCESS != (ret = orte_grpcomm.xcast_gate(orte_gpr.deliver_notify_msg))) {
ORTE_ERROR_LOG(ret);
error = "ompi_mpi_init: failed to see all procs register\n";
goto error;
}
/* check for timing request - get start time */
if (timing) {
gettimeofday(&ompistart, NULL);
}
/* wire up the oob interface, if requested. Do this here because
it will go much faster before the event library is switched
into non-blocking mode */

Просмотреть файл

@ -52,6 +52,7 @@ bool ompi_mpi_paffinity_alone = false;
bool ompi_mpi_abort_print_stack = false;
int ompi_mpi_abort_delay = 0;
bool ompi_mpi_keep_peer_hostnames = true;
bool ompi_mpi_keep_fqdn_hostnames = false;
bool ompi_mpi_leave_pinned = false;
bool ompi_mpi_leave_pinned_pipeline = false;
bool ompi_have_sparse_group_storage = OPAL_INT_TO_BOOL(OMPI_GROUP_SPARSE);
@ -165,6 +166,11 @@ int ompi_mpi_register_params(void)
false, false, 1, &value);
ompi_mpi_keep_peer_hostnames = OPAL_INT_TO_BOOL(value);
mca_base_param_reg_int_name("mpi", "keep_fqdn_hostnames",
"If nonzero, use the FQDN host name when saving hostnames. This can add quite a bit of memory usage to each MPI process.",
false, false, 1, &value);
ompi_mpi_keep_fqdn_hostnames = OPAL_INT_TO_BOOL(value);
/* MPI_ABORT controls */
mca_base_param_reg_int_name("mpi", "abort_delay",

Просмотреть файл

@ -106,6 +106,14 @@ OMPI_DECLSPEC extern bool ompi_mpi_paffinity_alone;
*/
OMPI_DECLSPEC extern bool ompi_mpi_keep_peer_hostnames;
/**
* Whether or not to use the FQDN for the peer hostnames. This
* can eat up a good bit of memory as well as a lot of communication
* during startup - it can be reduced by just using the hostname
* instead of the FQDN
*/
OMPI_DECLSPEC extern bool ompi_mpi_keep_fqdn_hostnames;
/**
* Whether an MPI_ABORT should print out a stack trace or not.
*/

Просмотреть файл

@ -138,24 +138,28 @@ static inline int opal_condition_timedwait(opal_condition_t *c,
absolute.tv_sec = abstime->tv_sec;
absolute.tv_usec = abstime->tv_nsec * 1000;
gettimeofday(&tv,NULL);
while (c->c_signaled == 0 &&
(tv.tv_sec <= absolute.tv_sec ||
(tv.tv_sec == absolute.tv_sec && tv.tv_usec < absolute.tv_usec))) {
opal_mutex_unlock(m);
opal_progress();
gettimeofday(&tv,NULL);
opal_mutex_lock(m);
if (c->c_signaled == 0) {
do {
opal_mutex_unlock(m);
opal_progress();
gettimeofday(&tv,NULL);
opal_mutex_lock(m);
} while (c->c_signaled == 0 &&
(tv.tv_sec <= absolute.tv_sec ||
(tv.tv_sec == absolute.tv_sec && tv.tv_usec < absolute.tv_usec)));
}
#endif
} else {
absolute.tv_sec = abstime->tv_sec;
absolute.tv_usec = abstime->tv_nsec * 1000;
gettimeofday(&tv,NULL);
while (c->c_signaled == 0 &&
(tv.tv_sec <= absolute.tv_sec ||
(tv.tv_sec == absolute.tv_sec && tv.tv_usec < absolute.tv_usec))) {
opal_progress();
gettimeofday(&tv,NULL);
if (c->c_signaled == 0) {
do {
opal_progress();
gettimeofday(&tv,NULL);
} while (c->c_signaled == 0 &&
(tv.tv_sec <= absolute.tv_sec ||
(tv.tv_sec == absolute.tv_sec && tv.tv_usec < absolute.tv_usec)));
}
}
@ -163,7 +167,7 @@ static inline int opal_condition_timedwait(opal_condition_t *c,
m->m_lock_debug++;
#endif
c->c_signaled--;
if (c->c_signaled != 0) c->c_signaled--;
c->c_waiting--;
return rc;
}

Просмотреть файл

@ -53,7 +53,6 @@ int orte_gpr_replica_recv_put_cmd(orte_buffer_t *buffer, orte_buffer_t *answer)
cnt = 1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buffer, &num_values, &cnt, ORTE_STD_CNTR))) {
ORTE_ERROR_LOG(rc);
free(values);
ret = rc;
goto RETURN_ERROR;
}

Просмотреть файл

@ -39,7 +39,6 @@ BEGIN_C_DECLS
* globals needed within component
*/
typedef struct {
int output;
int xcast_linear_xover;
int xcast_binomial_xover;
orte_std_cntr_t num_active;
@ -67,10 +66,6 @@ int orte_grpcomm_basic_finalize(void);
* xcast interfaces
*/
void orte_ns_replica_recv(int status, orte_process_name_t* sender,
orte_buffer_t* buffer, orte_rml_tag_t tag, void* cbdata);
ORTE_MODULE_DECLSPEC extern orte_grpcomm_base_component_t mca_grpcomm_basic_component;
extern orte_grpcomm_base_module_t orte_grpcomm_basic_module;

Просмотреть файл

@ -47,8 +47,13 @@
#include "grpcomm_basic.h"
#define XCAST_LINEAR_XOVER_DEFAULT 10
#define XCAST_BINOMIAL_XOVER_DEFAULT INT_MAX
/* set the default xovers to always force linear
* this is a tmp workaround for a problem in the
* rml that prevents the daemons from sending
* messages to their local procs
*/
#define XCAST_LINEAR_XOVER_DEFAULT 2
#define XCAST_BINOMIAL_XOVER_DEFAULT 16
/*
@ -81,7 +86,6 @@ orte_grpcomm_basic_globals_t orte_grpcomm_basic;
/* Open the component */
int orte_grpcomm_basic_open(void)
{
int value;
char *mode;
mca_base_component_t *c = &mca_grpcomm_basic_component.grpcomm_version;
@ -90,16 +94,6 @@ int orte_grpcomm_basic_open(void)
OBJ_CONSTRUCT(&orte_grpcomm_basic.cond, opal_condition_t);
orte_grpcomm_basic.num_active = 0;
/* register parameters */
mca_base_param_reg_int(c, "verbose",
"Verbosity level for the grpcomm basic component",
false, false, 0, &value);
if (value != 0) {
orte_grpcomm_basic.output = opal_output_open(NULL);
} else {
orte_grpcomm_basic.output = -1;
}
mca_base_param_reg_int(c, "xcast_linear_xover",
"Number of daemons where use of linear xcast mode is to begin",
false, false, XCAST_LINEAR_XOVER_DEFAULT, &orte_grpcomm_basic.xcast_linear_xover);

Просмотреть файл

@ -40,25 +40,9 @@
#include "orte/mca/rml/rml.h"
#include "orte/runtime/params.h"
#include "orte/mca/grpcomm/base/base.h"
#include "grpcomm_basic.h"
/* API functions */
static int xcast_nb(orte_jobid_t job,
orte_buffer_t *buffer,
orte_rml_tag_t tag);
static int xcast(orte_jobid_t job,
orte_buffer_t *buffer,
orte_rml_tag_t tag);
static int xcast_gate(orte_gpr_trigger_cb_fn_t cbfunc);
orte_grpcomm_base_module_t orte_grpcomm_basic_module = {
xcast,
xcast_nb,
xcast_gate
};
/* Local functions */
static int xcast_binomial_tree(orte_jobid_t job,
orte_buffer_t *buffer,
@ -108,7 +92,10 @@ static int xcast_nb(orte_jobid_t job,
struct timeval start, stop;
orte_vpid_t num_daemons;
opal_output(orte_grpcomm_basic.output, "oob_xcast_nb: sent to job %ld tag %ld", (long)job, (long)tag);
OPAL_OUTPUT_VERBOSE((1, orte_grpcomm_base_output,
"%s xcast_nb sent to job %ld tag %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
(long)job, (long)tag));
/* if there is no message to send, then just return ok */
if (NULL == buffer) {
@ -127,9 +114,11 @@ static int xcast_nb(orte_jobid_t job,
return rc;
}
opal_output(orte_grpcomm_basic.output, "oob_xcast_nb: num_daemons %ld linear xover: %ld binomial xover: %ld",
(long)num_daemons, (long)orte_grpcomm_basic.xcast_linear_xover,
(long)orte_grpcomm_basic.xcast_binomial_xover);
OPAL_OUTPUT_VERBOSE((1, orte_grpcomm_base_output,
"%s xcast_nb: num_daemons %ld linear xover: %ld binomial xover: %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
(long)num_daemons, (long)orte_grpcomm_basic.xcast_linear_xover,
(long)orte_grpcomm_basic.xcast_binomial_xover));
if (num_daemons < 2) {
/* if there is only one daemon in the system, then we must
@ -183,7 +172,10 @@ static int xcast(orte_jobid_t job,
struct timeval start, stop;
orte_vpid_t num_daemons;
opal_output(orte_grpcomm_basic.output, "oob_xcast: sent to job %ld tag %ld", (long)job, (long)tag);
OPAL_OUTPUT_VERBOSE((1, orte_grpcomm_base_output,
"%s xcast sent to job %ld tag %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
(long)job, (long)tag));
/* if there is no message to send, then just return ok */
if (NULL == buffer) {
@ -202,9 +194,11 @@ static int xcast(orte_jobid_t job,
return rc;
}
opal_output(orte_grpcomm_basic.output, "oob_xcast: num_daemons %ld linear xover: %ld binomial xover: %ld",
(long)num_daemons, (long)orte_grpcomm_basic.xcast_linear_xover,
(long)orte_grpcomm_basic.xcast_binomial_xover);
OPAL_OUTPUT_VERBOSE((1, orte_grpcomm_base_output,
"%s xcast: num_daemons %ld linear xover: %ld binomial xover: %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
(long)num_daemons, (long)orte_grpcomm_basic.xcast_linear_xover,
(long)orte_grpcomm_basic.xcast_binomial_xover));
if (num_daemons < 2) {
/* if there is only one daemon in the system, then we must
@ -263,10 +257,13 @@ static int xcast_binomial_tree(orte_jobid_t job,
int rc;
orte_process_name_t target;
orte_buffer_t *buf;
orte_vpid_t num_daemons;
opal_output(orte_grpcomm_basic.output, "oob_xcast_mode: binomial");
orte_vpid_t nd;
orte_std_cntr_t num_daemons;
OPAL_OUTPUT_VERBOSE((1, orte_grpcomm_base_output,
"%s xcast_binomial",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
/* this is the HNP end, so it starts the procedure. Since the HNP is always the
* vpid=0 at this time, we take advantage of that fact to figure out who we
* should send this to on the first step
@ -288,10 +285,11 @@ static int xcast_binomial_tree(orte_jobid_t job,
/* get the number of daemons currently in the system and tell the daemon so
* it can properly route
*/
if (ORTE_SUCCESS != (rc = orte_ns.get_vpid_range(0, &num_daemons))) {
if (ORTE_SUCCESS != (rc = orte_ns.get_vpid_range(0, &nd))) {
ORTE_ERROR_LOG(rc);
goto CLEANUP;
}
num_daemons = (orte_std_cntr_t)nd;
if (ORTE_SUCCESS != (rc = orte_dss.pack(buf, &num_daemons, 1, ORTE_STD_CNTR))) {
ORTE_ERROR_LOG(rc);
goto CLEANUP;
@ -331,10 +329,10 @@ static int xcast_binomial_tree(orte_jobid_t job,
goto CLEANUP;
}
if (orte_timing) {
opal_output(0, "xcast %s: mode binomial buffer size %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), (long)buf->bytes_used);
}
OPAL_OUTPUT_VERBOSE((2, orte_grpcomm_base_output,
"%s xcast_binomial: buffer size %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
(long)buf->bytes_used));
/* all we need to do is send this to ourselves - our relay logic
* will ensure everyone else gets it!
@ -343,8 +341,12 @@ static int xcast_binomial_tree(orte_jobid_t job,
target.vpid = 0;
++orte_grpcomm_basic.num_active;
opal_output(orte_grpcomm_basic.output, "%s xcast to %s",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), ORTE_NAME_PRINT(&target));
OPAL_OUTPUT_VERBOSE((2, orte_grpcomm_base_output,
"xcast_binomial: num_active now %ld sending %s => %s",
(long)orte_grpcomm_basic.num_active,
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(&target)));
if (0 > (rc = orte_rml.send_buffer_nb(&target, buf, ORTE_RML_TAG_ORTED_ROUTED,
0, xcast_send_cb, NULL))) {
ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);
@ -371,8 +373,10 @@ static int xcast_linear(orte_jobid_t job,
orte_vpid_t i, range;
orte_process_name_t dummy;
opal_output(orte_grpcomm_basic.output, "oob_xcast_mode: linear");
OPAL_OUTPUT_VERBOSE((1, orte_grpcomm_base_output,
"%s xcast_linear",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
/* since we have to pack some additional info into the buffer to be
* sent to the daemons, we create a new buffer into which we will
* put the intermediate payload - i.e., the info that goes to the
@ -421,10 +425,10 @@ static int xcast_linear(orte_jobid_t job,
goto CLEANUP;
}
if (orte_timing) {
opal_output(0, "xcast %s: mode linear buffer size %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), (long)buf->bytes_used);
}
OPAL_OUTPUT_VERBOSE((2, orte_grpcomm_base_output,
"%s xcast_linear: buffer size %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
(long)buf->bytes_used));
/* get the number of daemons out there */
orte_ns.get_vpid_range(0, &range);
@ -438,12 +442,19 @@ static int xcast_linear(orte_jobid_t job,
orte_grpcomm_basic.num_active += range;
OPAL_THREAD_UNLOCK(&orte_grpcomm_basic.mutex);
OPAL_OUTPUT_VERBOSE((1, orte_grpcomm_base_output,
"%s xcast_linear: num_active now %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
(long)orte_grpcomm_basic.num_active));
/* send the message to each daemon as fast as we can */
dummy.jobid = 0;
for (i=0; i < range; i++) {
dummy.vpid = i;
opal_output(orte_grpcomm_basic.output, "%s xcast to %s",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), ORTE_NAME_PRINT(&dummy));
OPAL_OUTPUT_VERBOSE((2, orte_grpcomm_base_output,
"xcast_linear: %s => %s",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(&dummy)));
if (0 > (rc = orte_rml.send_buffer_nb(&dummy, buf, ORTE_RML_TAG_ORTED_ROUTED,
0, xcast_send_cb, NULL))) {
if (ORTE_ERR_ADDRESSEE_UNKNOWN != rc) {
@ -479,8 +490,10 @@ static int xcast_direct(orte_jobid_t job,
opal_list_t attrs;
opal_list_item_t *item;
opal_output(orte_grpcomm_basic.output, "oob_xcast_mode: direct");
OPAL_OUTPUT_VERBOSE((1, orte_grpcomm_base_output,
"%s xcast_direct",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
/* need to get the job peers so we know who to send the message to */
OBJ_CONSTRUCT(&attrs, opal_list_t);
orte_rmgr.add_attribute(&attrs, ORTE_NS_USE_JOBID, ORTE_JOBID, &job, ORTE_RMGR_ATTR_OVERRIDE);
@ -493,10 +506,10 @@ static int xcast_direct(orte_jobid_t job,
OBJ_RELEASE(item);
OBJ_DESTRUCT(&attrs);
if (orte_timing) {
opal_output(0, "xcast %s: mode direct buffer size %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), (long)buffer->bytes_used);
}
OPAL_OUTPUT_VERBOSE((2, orte_grpcomm_base_output,
"%s xcast_direct: buffer size %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
(long)buffer->bytes_used));
/* we have to account for all of the messages we are about to send
* because the non-blocking send can come back almost immediately - before
@ -507,9 +520,16 @@ static int xcast_direct(orte_jobid_t job,
orte_grpcomm_basic.num_active += n;
OPAL_THREAD_UNLOCK(&orte_grpcomm_basic.mutex);
opal_output(orte_grpcomm_basic.output, "oob_xcast_direct: num_active now %ld", (long)orte_grpcomm_basic.num_active);
OPAL_OUTPUT_VERBOSE((1, orte_grpcomm_base_output,
"%s xcast_direct: num_active now %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
(long)orte_grpcomm_basic.num_active));
for(i=0; i<n; i++) {
OPAL_OUTPUT_VERBOSE((2, orte_grpcomm_base_output,
"xcast_direct: %s => %s",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(peers+i)));
if (0 > (rc = orte_rml.send_buffer_nb(peers+i, buffer, tag, 0, xcast_send_cb, NULL))) {
if (ORTE_ERR_ADDRESSEE_UNKNOWN != rc) {
ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);
@ -533,35 +553,264 @@ CLEANUP:
return rc;
}
static int xcast_gate(orte_gpr_trigger_cb_fn_t cbfunc)
static int allgather(orte_buffer_t *sbuf, orte_buffer_t *rbuf)
{
orte_process_name_t name;
int rc;
orte_std_cntr_t i;
orte_buffer_t rbuf;
orte_gpr_notify_message_t *mesg;
orte_buffer_t tmpbuf;
OBJ_CONSTRUCT(&rbuf, orte_buffer_t);
rc = orte_rml.recv_buffer(ORTE_NAME_WILDCARD, &rbuf, ORTE_RML_TAG_XCAST_BARRIER, 0);
if(rc < 0) {
OBJ_DESTRUCT(&rbuf);
/* everything happens within my jobid */
name.jobid = ORTE_PROC_MY_NAME->jobid;
if (0 != ORTE_PROC_MY_NAME->vpid) {
/* everyone but rank=0 sends data */
name.vpid = 0;
if (0 > orte_rml.send_buffer(&name, sbuf, ORTE_RML_TAG_ALLGATHER, 0)) {
ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);
return ORTE_ERR_COMM_FAILURE;
}
OPAL_OUTPUT_VERBOSE((2, orte_grpcomm_base_output,
"%s allgather buffer sent",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
/* now receive the final result from rank=0 */
if (0 > orte_rml.recv_buffer(ORTE_NAME_WILDCARD, rbuf, ORTE_RML_TAG_ALLGATHER, 0)) {
ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);
return ORTE_ERR_COMM_FAILURE;
}
OPAL_OUTPUT_VERBOSE((2, orte_grpcomm_base_output,
"%s allgather buffer received",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
return ORTE_SUCCESS;
}
/* seed the outgoing buffer with the num_procs so it can be unpacked */
if (ORTE_SUCCESS != (rc = orte_dss.pack(rbuf, &orte_process_info.num_procs, 1, ORTE_STD_CNTR))) {
ORTE_ERROR_LOG(rc);
return rc;
}
if (cbfunc != NULL) {
mesg = OBJ_NEW(orte_gpr_notify_message_t);
if (NULL == mesg) {
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
return ORTE_ERR_OUT_OF_RESOURCE;
/* put my own information into the outgoing buffer */
if (ORTE_SUCCESS != (rc = orte_dss.copy_payload(rbuf, sbuf))) {
ORTE_ERROR_LOG(rc);
return rc;
}
OPAL_OUTPUT_VERBOSE((2, orte_grpcomm_base_output,
"%s allgather collecting buffers",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
/* rank=0 receives everyone else's data */
for (i=1; i < orte_process_info.num_procs; i++) {
name.vpid = (orte_vpid_t)i;
OBJ_CONSTRUCT(&tmpbuf, orte_buffer_t);
if (0 > orte_rml.recv_buffer(&name, &tmpbuf, ORTE_RML_TAG_ALLGATHER, 0)) {
ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);
return ORTE_ERR_COMM_FAILURE;
}
i=1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&rbuf, &mesg, &i, ORTE_GPR_NOTIFY_MSG))) {
OPAL_OUTPUT_VERBOSE((2, orte_grpcomm_base_output,
"%s allgather buffer %ld received",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), (long)i));
/* append this data to the rbuf */
if (ORTE_SUCCESS != (rc = orte_dss.copy_payload(rbuf, &tmpbuf))) {
ORTE_ERROR_LOG(rc);
OBJ_RELEASE(mesg);
return rc;
}
cbfunc(mesg);
OBJ_RELEASE(mesg);
/* clear out the tmpbuf */
OBJ_DESTRUCT(&tmpbuf);
}
OBJ_DESTRUCT(&rbuf);
OPAL_OUTPUT_VERBOSE((2, orte_grpcomm_base_output,
"%s allgather xcasting collected data",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
/* xcast the results */
orte_grpcomm.xcast(ORTE_PROC_MY_NAME->jobid, rbuf, ORTE_RML_TAG_ALLGATHER);
/* xcast automatically ensures that the sender -always- gets a copy
* of the message. This is required to ensure proper operation of the
* launch system as the HNP -must- get a copy itself. So we have to
* post our own receive here so that we don't leave a message rattling
* around in our RML
*/
OBJ_CONSTRUCT(&tmpbuf, orte_buffer_t);
if (0 > (rc = orte_rml.recv_buffer(ORTE_NAME_WILDCARD, &tmpbuf, ORTE_RML_TAG_ALLGATHER, 0))) {
ORTE_ERROR_LOG(rc);
return rc;
}
OPAL_OUTPUT_VERBOSE((2, orte_grpcomm_base_output,
"%s allgather buffer received",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
/* don't need the received buffer - we already have what we need */
OBJ_DESTRUCT(&tmpbuf);
return ORTE_SUCCESS;
}
static int allgather_list(opal_list_t *names, orte_buffer_t *sbuf, orte_buffer_t *rbuf)
{
opal_list_item_t *item;
orte_namelist_t *peer, *root;
orte_std_cntr_t i, num_peers;
orte_buffer_t tmpbuf;
int rc;
/* the first entry on the list is the "root" that collects
* all the data - everyone else just sends and gets back
* the results
*/
root = (orte_namelist_t*)opal_list_get_first(names);
if (ORTE_EQUAL != orte_dss.compare(root->name, ORTE_PROC_MY_NAME, ORTE_NAME)) {
/* everyone but root sends data */
if (0 > orte_rml.send_buffer(root->name, sbuf, ORTE_RML_TAG_ALLGATHER_LIST, 0)) {
ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);
return ORTE_ERR_COMM_FAILURE;
}
/* now receive the final result */
if (0 > orte_rml.recv_buffer(root->name, rbuf, ORTE_RML_TAG_ALLGATHER_LIST, 0)) {
ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);
return ORTE_ERR_COMM_FAILURE;
}
return ORTE_SUCCESS;
}
/* count how many peers are participating, including myself */
num_peers = (orte_std_cntr_t)opal_list_get_size(names);
/* seed the outgoing buffer with the num_procs so it can be unpacked */
if (ORTE_SUCCESS != (rc = orte_dss.pack(rbuf, &num_peers, 1, ORTE_STD_CNTR))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* put my own information into the outgoing buffer */
if (ORTE_SUCCESS != (rc = orte_dss.copy_payload(rbuf, sbuf))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* root receives everyone else's data */
for (i=1; i < num_peers; i++) {
/* receive the buffer from this process */
OBJ_CONSTRUCT(&tmpbuf, orte_buffer_t);
if (0 > orte_rml.recv_buffer(ORTE_NAME_WILDCARD, &tmpbuf, ORTE_RML_TAG_ALLGATHER_LIST, 0)) {
ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);
return ORTE_ERR_COMM_FAILURE;
}
/* append this data to the rbuf */
if (ORTE_SUCCESS != (rc = orte_dss.copy_payload(rbuf, &tmpbuf))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* clear out the tmpbuf */
OBJ_DESTRUCT(&tmpbuf);
}
/* broadcast the results */
for (item = opal_list_get_first(names);
item != opal_list_get_end(names);
item = opal_list_get_next(item)) {
peer = (orte_namelist_t*)item;
/* skip myself */
if (ORTE_EQUAL == orte_dss.compare(root->name, peer->name, ORTE_NAME)) {
continue;
}
/* transmit the buffer to this process */
if (0 > orte_rml.send_buffer(peer->name, rbuf, ORTE_RML_TAG_ALLGATHER_LIST, 0)) {
ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);
return ORTE_ERR_COMM_FAILURE;
}
}
return ORTE_SUCCESS;
}
static int barrier(void)
{
orte_process_name_t name;
orte_std_cntr_t i;
orte_buffer_t buf;
int rc;
/* everything happens within the same jobid */
name.jobid = ORTE_PROC_MY_NAME->jobid;
/* All non-root send & receive zero-length message. */
if (0 != ORTE_PROC_MY_NAME->vpid) {
name.vpid = 0;
OBJ_CONSTRUCT(&buf, orte_buffer_t);
i=0;
orte_dss.pack(&buf, &i, 1, ORTE_STD_CNTR); /* put something meaningless here */
rc = orte_rml.send_buffer(&name,&buf,ORTE_RML_TAG_BARRIER,0);
if (rc < 0) {
ORTE_ERROR_LOG(rc);
return rc;
}
OBJ_DESTRUCT(&buf);
/* get the release from rank=0 */
OBJ_CONSTRUCT(&buf, orte_buffer_t);
rc = orte_rml.recv_buffer(ORTE_NAME_WILDCARD,&buf,ORTE_RML_TAG_BARRIER,0);
if (rc < 0) {
ORTE_ERROR_LOG(rc);
return rc;
}
OBJ_DESTRUCT(&buf);
return ORTE_SUCCESS;
}
for (i = 1; i < orte_process_info.num_procs; i++) {
name.vpid = (orte_vpid_t)i;
OBJ_CONSTRUCT(&buf, orte_buffer_t);
OPAL_OUTPUT_VERBOSE((2, orte_grpcomm_base_output,
"%s barrier %ld received",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), (long)i));
rc = orte_rml.recv_buffer(&name,&buf,ORTE_RML_TAG_BARRIER,0);
if (rc < 0) {
ORTE_ERROR_LOG(rc);
return rc;
}
OBJ_DESTRUCT(&buf);
}
OPAL_OUTPUT_VERBOSE((2, orte_grpcomm_base_output,
"%s barrier xcasting release",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
/* xcast the release */
OBJ_CONSTRUCT(&buf, orte_buffer_t);
orte_dss.pack(&buf, &i, 1, ORTE_STD_CNTR); /* put something meaningless here */
orte_grpcomm.xcast(ORTE_PROC_MY_NAME->jobid, &buf, ORTE_RML_TAG_BARRIER);
OBJ_DESTRUCT(&buf);
/* xcast automatically ensures that the sender -always- gets a copy
* of the message. This is required to ensure proper operation of the
* launch system as the HNP -must- get a copy itself. So we have to
* post our own receive here so that we don't leave a message rattling
* around in our RML
*/
OBJ_CONSTRUCT(&buf, orte_buffer_t);
if (0 > (rc = orte_rml.recv_buffer(ORTE_NAME_WILDCARD, &buf, ORTE_RML_TAG_BARRIER, 0))) {
ORTE_ERROR_LOG(rc);
return rc;
}
OPAL_OUTPUT_VERBOSE((2, orte_grpcomm_base_output,
"%s barrier release received",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
OBJ_DESTRUCT(&buf);
return ORTE_SUCCESS;
}
orte_grpcomm_base_module_t orte_grpcomm_basic_module = {
xcast,
xcast_nb,
allgather,
allgather_list,
barrier
};

Просмотреть файл

@ -25,20 +25,10 @@
#include <sys/time.h>
#endif /* HAVE_SYS_TIME_H */
#include "opal/threads/condition.h"
#include "opal/util/output.h"
#include "opal/util/bit_ops.h"
#include "orte/util/proc_info.h"
#include "orte/dss/dss.h"
#include "orte/mca/gpr/gpr.h"
#include "orte/mca/ns/ns_types.h"
#include "orte/mca/errmgr/errmgr.h"
#include "orte/mca/ns/ns.h"
#include "orte/mca/rmgr/rmgr.h"
#include "orte/mca/smr/smr.h"
#include "orte/mca/odls/odls_types.h"
#include "orte/mca/rml/rml.h"
#include "orte/runtime/params.h"
#include "orte/mca/rml/rml_types.h"
#include "grpcomm_cnos.h"
@ -55,14 +45,18 @@ static int xcast(orte_jobid_t job,
orte_buffer_t *buffer,
orte_rml_tag_t tag);
static int xcast_gate(orte_gpr_trigger_cb_fn_t cbfunc);
static int orte_grpcomm_cnos_barrier(void);
static int allgather(orte_buffer_t *sbuf, orte_buffer_t *rbuf);
static int allgather_list(opal_list_t *names, orte_buffer_t *sbuf, orte_buffer_t *rbuf);
orte_grpcomm_base_module_t orte_grpcomm_cnos_module = {
xcast,
xcast_nb,
xcast_gate
allgather,
allgather_list,
orte_grpcomm_cnos_barrier
};
@ -88,26 +82,6 @@ static int xcast(orte_jobid_t job,
return ORTE_SUCCESS;
}
static int xcast_gate(orte_gpr_trigger_cb_fn_t cbfunc)
{
int rc = ORTE_SUCCESS;
orte_grpcomm_cnos_barrier();
if (NULL != cbfunc) {
orte_gpr_notify_message_t *msg;
msg = OBJ_NEW(orte_gpr_notify_message_t);
if (NULL == msg) {
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
return ORTE_ERR_OUT_OF_RESOURCE;
}
cbfunc(msg);
OBJ_RELEASE(msg);
}
return rc;
}
static int
orte_grpcomm_cnos_barrier(void)
{
@ -117,3 +91,29 @@ orte_grpcomm_cnos_barrier(void)
return ORTE_SUCCESS;
}
static int allgather(orte_buffer_t *sbuf, orte_buffer_t *rbuf)
{
int rc;
orte_std_cntr_t zero=0;
/* seed the outgoing buffer with num_procs=0 so it won't be unpacked */
if (ORTE_SUCCESS != (rc = orte_dss.pack(rbuf, &zero, 1, ORTE_STD_CNTR))) {
ORTE_ERROR_LOG(rc);
return rc;
}
return rc;
}
static int allgather_list(opal_list_t *names, orte_buffer_t *sbuf, orte_buffer_t *rbuf)
{
int rc;
orte_std_cntr_t zero=0;
/* seed the outgoing buffer with num_procs=0 so it won't be unpacked */
if (ORTE_SUCCESS != (rc = orte_dss.pack(rbuf, &zero, 1, ORTE_STD_CNTR))) {
ORTE_ERROR_LOG(rc);
return rc;
}
return rc;
}

Просмотреть файл

@ -38,6 +38,7 @@
#include "orte/orte_types.h"
#include "opal/mca/mca.h"
#include "opal/class/opal_list.h"
#include "orte/dss/dss_types.h"
#include "orte/mca/gpr/gpr_types.h"
@ -63,16 +64,24 @@ typedef int (*orte_grpcomm_base_module_xcast_nb_fn_t)(orte_jobid_t job,
orte_buffer_t *buffer,
orte_rml_tag_t tag);
/* Wait for receipt of an xcast message */
typedef int (*orte_grpcomm_base_module_xcast_gate_fn_t)(orte_gpr_trigger_cb_fn_t cbfunc);
/* allgather - gather data from all procs */
typedef int (*orte_grpcomm_base_module_allgather_fn_t)(orte_buffer_t *sbuf, orte_buffer_t *rbuf);
typedef int (*orte_grpcomm_base_module_allgather_list_fn_t)(opal_list_t *names,
orte_buffer_t *sbuf, orte_buffer_t *rbuf);
/* barrier function */
typedef int (*orte_grpcomm_base_module_barrier_fn_t)(void);
/*
* Ver 2.0
*/
struct orte_grpcomm_base_module_2_0_0_t {
orte_grpcomm_base_module_xcast_fn_t xcast;
orte_grpcomm_base_module_xcast_nb_fn_t xcast_nb;
orte_grpcomm_base_module_xcast_gate_fn_t xcast_gate;
orte_grpcomm_base_module_xcast_fn_t xcast;
orte_grpcomm_base_module_xcast_nb_fn_t xcast_nb;
orte_grpcomm_base_module_allgather_fn_t allgather;
orte_grpcomm_base_module_allgather_list_fn_t allgather_list;
orte_grpcomm_base_module_barrier_fn_t barrier;
};
typedef struct orte_grpcomm_base_module_2_0_0_t orte_grpcomm_base_module_2_0_0_t;

Просмотреть файл

@ -199,6 +199,20 @@ int orte_ns_base_open(void)
return rc;
}
tmp = ORTE_NODEID;
if (ORTE_SUCCESS != (rc = orte_dss.register_type(orte_ns_base_pack_nodeid,
orte_ns_base_unpack_nodeid,
(orte_dss_copy_fn_t)orte_ns_base_copy_nodeid,
(orte_dss_compare_fn_t)orte_ns_base_compare_nodeid,
(orte_dss_size_fn_t)orte_ns_base_std_size,
(orte_dss_print_fn_t)orte_ns_base_std_print,
(orte_dss_release_fn_t)orte_ns_base_std_release,
ORTE_DSS_UNSTRUCTURED,
"ORTE_NODEID", &tmp))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* Open up all available components */
if (ORTE_SUCCESS !=

Просмотреть файл

@ -92,8 +92,10 @@ char* orte_ns_base_print_name_args(const orte_process_name_t *name)
if (NULL == name) {
snprintf(ptr->buffers[ptr->cntr++], ORTE_PRINT_NAME_ARGS_MAX_SIZE, "[NO-NAME]");
} else {
snprintf(ptr->buffers[ptr->cntr++], ORTE_PRINT_NAME_ARGS_MAX_SIZE, "[%ld,%ld]", (long)name->jobid, (long)name->vpid);
}
snprintf(ptr->buffers[ptr->cntr++],
ORTE_PRINT_NAME_ARGS_MAX_SIZE,
"[%ld,%ld]", ORTE_NAME_ARGS(name));
}
return ptr->buffers[ptr->cntr-1];
}

Просмотреть файл

@ -106,6 +106,7 @@ int orte_ns_base_convert_string_to_process_name(orte_process_name_t **name,
/* check for error */
if (NULL == token) {
ORTE_ERROR_LOG(ORTE_ERR_BAD_PARAM);
return ORTE_ERR_BAD_PARAM;
}
@ -134,6 +135,7 @@ int orte_ns_base_convert_string_to_process_name(orte_process_name_t **name,
/* check for error */
if (NULL == token) {
ORTE_ERROR_LOG(ORTE_ERR_BAD_PARAM);
return ORTE_ERR_BAD_PARAM;
}

Просмотреть файл

@ -57,9 +57,19 @@ typedef uint8_t orte_ns_cmd_flag_t;
* typedefs above and in ns_types.h
*/
#define ORTE_NS_CMD ORTE_INT8
#define ORTE_NODEID_T ORTE_INT32
#if ORTE_ENABLE_JUMBO_APPS
#define ORTE_JOBID_T ORTE_INT32
#define ORTE_VPID_T ORTE_INT32
#define ORTE_NODEID_T ORTE_INT32
#else
#define ORTE_JOBID_T ORTE_INT16
#define ORTE_VPID_T ORTE_INT16
#define ORTE_NODEID_T ORTE_INT16
#endif
/*
* define flag values for remote commands - only used internally

Просмотреть файл

@ -55,11 +55,6 @@ extern "C" {
#define ORTE_NS_INCLUDE_DESCENDANTS "orte-ns-include-desc"
#define ORTE_NS_INCLUDE_CHILDREN "orte-ns-include-child"
#define ORTE_NS_USE_JOB_FAMILY "orte-ns-use-job-family"
#define ORTE_NAME_ARGS(n) \
(long) ((NULL == n) ? (long)-1 : (long)(n)->jobid), \
(long) ((NULL == n) ? (long)-1 : (long)(n)->vpid)
/*
@ -71,7 +66,11 @@ extern "C" {
#define ORTE_NS_CMP_VPID 0x04
#define ORTE_NS_CMP_ALL 0Xff
/*
#define ORTE_NAME_ARGS(n) \
(long) ((NULL == n) ? (long)-1 : (long)(n)->jobid), \
(long) ((NULL == n) ? (long)-1 : (long)(n)->vpid)
/*
* general typedefs & structures
*/
/** Set the allowed range for ids in each space
@ -81,11 +80,61 @@ extern "C" {
* HTON and NTOH macros below must be updated, as well as the MIN /
* MAX macros below and the datatype packing representations in
* ns_private.h
*
* NOTE: Be sure to keep the jobid and vpid types the same size! Due
* to padding rules, it won't save anything to have one larger than
* the other, and it will cause problems in the communication subsystems
*/
typedef orte_std_cntr_t orte_jobid_t;
typedef orte_std_cntr_t orte_nodeid_t;
typedef orte_std_cntr_t orte_vpid_t;
#if ORTE_ENABLE_JUMBO_APPS
typedef orte_std_cntr_t orte_jobid_t;
#define ORTE_JOBID_MAX ORTE_STD_CNTR_MAX
#define ORTE_JOBID_MIN ORTE_STD_CNTR_MIN
typedef orte_std_cntr_t orte_vpid_t;
#define ORTE_VPID_MAX ORTE_STD_CNTR_MAX
#define ORTE_VPID_MIN ORTE_STD_CNTR_MIN
typedef orte_std_cntr_t orte_nodeid_t;
#define ORTE_NODEID_MAX ORTE_STD_CNTR_MAX
#define ORTE_NODEID_MIN ORTE_STD_CNTR_MIN
#define ORTE_PROCESS_NAME_HTON(n) \
do { \
n.jobid = htonl(n.jobid); \
n.vpid = htonl(n.vpid); \
} while (0)
#define ORTE_PROCESS_NAME_NTOH(n) \
do { \
n.jobid = ntohl(n.jobid); \
n.vpid = ntohl(n.vpid); \
} while (0)
#else
typedef int16_t orte_jobid_t;
#define ORTE_JOBID_MAX INT16_MAX
#define ORTE_JOBID_MIN INT16_MIN
typedef int16_t orte_vpid_t;
#define ORTE_VPID_MAX INT16_MAX
#define ORTE_VPID_MIN INT16_MIN
typedef int16_t orte_nodeid_t;
#define ORTE_NODEID_MAX INT16_MAX
#define ORTE_NODEID_MIN INT16_MIN
#define ORTE_PROCESS_NAME_HTON(n) \
do { \
n.jobid = htons(n.jobid); \
n.vpid = htons(n.vpid); \
} while (0)
#define ORTE_PROCESS_NAME_NTOH(n) \
do { \
n.jobid = ntohs(n.jobid); \
n.vpid = ntohs(n.vpid); \
} while (0)
#endif
typedef uint8_t orte_ns_cmp_bitmask_t; /**< Bit mask for comparing process names */
struct orte_process_name_t {
@ -100,20 +149,6 @@ ORTE_DECLSPEC extern char* orte_ns_base_print_name_args(const orte_process_name_
#define ORTE_NAME_PRINT(n) \
orte_ns_base_print_name_args(n)
/*
* define maximum value for id's in any field
*/
#define ORTE_JOBID_MAX ORTE_STD_CNTR_MAX
#define ORTE_VPID_MAX ORTE_STD_CNTR_MAX
#define ORTE_NODEID_MAX ORTE_STD_CNTR_MAX
/*
* define minimum value for id's in any field
*/
#define ORTE_JOBID_MIN ORTE_STD_CNTR_MIN
#define ORTE_VPID_MIN ORTE_STD_CNTR_MIN
#define ORTE_NODEID_MIN ORTE_STD_CNTR_MIN
/*
* define invalid values
*/
@ -143,25 +178,6 @@ ORTE_DECLSPEC extern orte_process_name_t orte_ns_name_invalid; /** instantiated
#define ORTE_PROC_MY_HNP &orte_ns_name_my_hnp
ORTE_DECLSPEC extern orte_process_name_t orte_ns_name_my_hnp; /** instantiated in orte/mca/ns/base/ns_base_open.c */
/**
* Convert process name from host to network byte order.
*
* @param name
*/
#define ORTE_PROCESS_NAME_HTON(n) \
n.jobid = htonl(n.jobid); \
n.vpid = htonl(n.vpid);
/**
* Convert process name from network to host byte order.
*
* @param name
*/
#define ORTE_PROCESS_NAME_NTOH(n) \
n.jobid = ntohl(n.jobid); \
n.vpid = ntohl(n.vpid);
/** List of names for general use
*/
struct orte_namelist_t {

Просмотреть файл

@ -16,6 +16,8 @@
# $HEADER$
#
dist_pkgdata_DATA = help-ns-replica.txt
sources = \
ns_replica.h \
ns_replica_class_instances.h \

33
orte/mca/ns/replica/help-ns-replica.txt Обычный файл
Просмотреть файл

@ -0,0 +1,33 @@
# -*- text -*-
#
# Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2006 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
[out-of-jobids]
The system has exhausted its available jobids - the application is attempting
to spawn too many jobs and will be aborted.
This may be resolved by increasing the number of available jobids by
re-configuring Open MPI with the --enable-jumbo-dynamics option, and then
re-running the application
#
[out-of-vpids]
The system has exhausted its available ranks - the application is attempting
to spawn too many processes and will be aborted.
This may be resolved by increasing the number of available ranks by
re-configuring Open MPI with the --enable-jumbo-apps option, and then
re-running the application

Просмотреть файл

@ -25,6 +25,7 @@
#include "opal/threads/mutex.h"
#include "opal/util/output.h"
#include "opal/util/show_help.h"
#include "opal/util/trace.h"
#include "orte/dss/dss.h"
@ -50,6 +51,14 @@ int orte_ns_replica_create_jobid(orte_jobid_t *jobid, opal_list_t *attrs)
*jobid = ORTE_JOBID_INVALID;
/* is a jobid available, or are we at the max? */
if (ORTE_JOBID_MAX == orte_ns_replica.num_jobids) {
/* at max - alert user to situation */
opal_show_help("help-ns-replica.txt", "out-of-jobids", true);
OPAL_THREAD_UNLOCK(&orte_ns_replica.mutex);
return ORTE_ERR_OUT_OF_RESOURCE;
}
/* check for attributes */
if (NULL != (attr = orte_rmgr.find_attribute(attrs, ORTE_NS_USE_PARENT))) {
/* declares the specified jobid to be the parent of the new one */
@ -249,6 +258,7 @@ int orte_ns_replica_get_parent_job(orte_jobid_t *parent_job, orte_jobid_t job)
item != opal_list_get_end(&orte_ns_replica.jobs);
item = opal_list_get_next(item)) {
root = (orte_ns_replica_jobitem_t*)item;
parent = root;
if (NULL != (ptr = down_search(root, &parent, job))) {
goto REPORT;
}
@ -335,8 +345,9 @@ int orte_ns_replica_reserve_range(orte_jobid_t job, orte_vpid_t range,
return rc;
}
/* get here if the range isn't available */
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
/* get here if the range isn't available - alert user */
opal_show_help("help-ns-replica.txt", "out-of-vpids", true);
OPAL_THREAD_UNLOCK(&orte_ns_replica.mutex);
return ORTE_ERR_OUT_OF_RESOURCE;
}

Просмотреть файл

@ -60,6 +60,7 @@ static void orte_odls_child_constructor(orte_odls_child_t *ptr)
ptr->state = ORTE_PROC_STATE_UNDEF;
ptr->exit_code = 0;
ptr->cpu_set = 0xffffffff;
ptr->sync_required = false;
}
static void orte_odls_child_destructor(orte_odls_child_t *ptr)

Просмотреть файл

@ -59,6 +59,7 @@ typedef struct orte_odls_child_t {
orte_proc_state_t state; /* the state of the process */
int exit_code; /* process exit code */
unsigned long cpu_set;
bool sync_required; /* require sync before termination */
} orte_odls_child_t;
ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_odls_child_t);

Просмотреть файл

@ -59,7 +59,8 @@ orte_odls_base_module_t orte_odls_bproc_module = {
orte_odls_bproc_launch_local_procs,
orte_odls_bproc_kill_local_procs,
orte_odls_bproc_signal_local_procs,
orte_odls_bproc_deliver_message
orte_odls_bproc_deliver_message,
orte_odls_bproc_get_local_proc_names
};
static int odls_bproc_make_dir(char *directory);
@ -68,7 +69,7 @@ static char * odls_bproc_get_base_dir_name(int proc_rank, orte_jobid_t jobid,
static void odls_bproc_delete_dir_tree(char * path);
static int odls_bproc_remove_dir(void);
static void odls_bproc_send_cb(int status, orte_process_name_t * peer,
orte_buffer_t* buffer, int tag, void* cbdata);
orte_buffer_t* buffer, orte_rml_tag_t tag, void* cbdata);
static int odls_bproc_setup_stdio(orte_process_name_t *proc_name,
int proc_rank, orte_jobid_t jobid,
orte_std_cntr_t app_context, bool connect_stdin);
@ -338,7 +339,8 @@ odls_bproc_remove_dir()
*/
static void
odls_bproc_send_cb(int status, orte_process_name_t * peer,
orte_buffer_t* buffer, int tag, void* cbdata)
orte_buffer_t* buffer,
orte_rml_tag_t tag, void* cbdata)
{
OBJ_RELEASE(buffer);
}
@ -539,7 +541,7 @@ cleanup:
* @retval error
*/
int
orte_odls_bproc_launch_local_procs(orte_gpr_notify_data_t *data,)
orte_odls_bproc_launch_local_procs(orte_gpr_notify_data_t *data)
{
odls_bproc_child_t *child;
opal_list_item_t* item;
@ -573,7 +575,7 @@ orte_odls_bproc_launch_local_procs(orte_gpr_notify_data_t *data,)
* from the parent/front-end process, as bproc4 does not currently allow the
* process to intercept the signal
*/
setpgid(0,0);
setpgid(0,0);
/* set the flag indicating this node is not included in the launch data */
node_included = false;
@ -668,7 +670,7 @@ orte_odls_bproc_launch_local_procs(orte_gpr_notify_data_t *data,)
/* setup some values we'll need to drop my uri for each child */
orte_ns.convert_jobid_to_string(&job_str, jobid);
my_uri = orte_rml.get_uri();
my_uri = orte_rml.get_contact_info();
/* set up the io files for our children */
for(item = opal_list_get_first(&mca_odls_bproc_component.children);
@ -748,7 +750,7 @@ CALLHOME:
if(ORTE_SUCCESS != rc) {
ORTE_ERROR_LOG(rc);
}
rc = mca_oob_send_packed_nb(ORTE_PROC_MY_HNP, ack, ORTE_RML_TAG_BPROC, 0,
rc = orte_rml.send_buffer_nb(ORTE_PROC_MY_HNP, ack, ORTE_RML_TAG_BPROC, 0,
odls_bproc_send_cb, NULL);
if (0 > rc) {
ORTE_ERROR_LOG(rc);
@ -821,6 +823,40 @@ int orte_odls_bproc_deliver_message(orte_jobid_t job, orte_buffer_t *buffer, ort
return ORTE_SUCCESS;
}
int orte_odls_bproc_get_local_proc_names(opal_list_t *names, orte_jobid_t job)
{
opal_list_item_t *item;
orte_odls_child_t *child;
orte_namelist_t *nitem;
/* protect operations involving the global list of children */
OPAL_THREAD_LOCK(&mca_odls_bproc_component.lock);
for (item = opal_list_get_first(&mca_odls_bproc_component.children);
item != opal_list_get_end(&mca_odls_bproc_component.children);
item = opal_list_get_next(item)) {
child = (orte_odls_child_t*)item;
/* do we have a child from the specified job. Because the
* job could be given as a WILDCARD value, we must use
* the dss.compare function to check for equality.
*/
if (ORTE_EQUAL != orte_dss.compare(&job, &(child->name->jobid), ORTE_JOBID)) {
continue;
}
/* add this name to the list */
nitem = OBJ_NEW(orte_namelist_t);
orte_dss.copy((void**)&nitem->name, child->name, ORTE_NAME);
opal_list_append(names, &nitem->item);
}
opal_condition_signal(&mca_odls_bproc_component.cond);
OPAL_THREAD_UNLOCK(&mca_odls_bproc_component.lock);
return ORTE_SUCCESS;
}
/**
* Finalizes the bproc module. Cleanup tmp directory/files
* used for I/O forwarding.

Просмотреть файл

@ -33,6 +33,7 @@
#include "opal/mca/mca.h"
#include "opal/threads/condition.h"
#include "opal/class/opal_list.h"
#include "orte/mca/gpr/gpr_types.h"
#include "orte/mca/rmaps/rmaps_types.h"
@ -64,6 +65,7 @@ int orte_odls_bproc_launch_local_procs(orte_gpr_notify_data_t *data);
int orte_odls_bproc_kill_local_procs(orte_jobid_t job, bool set_state);
int orte_odls_bproc_signal_local_procs(const orte_process_name_t* proc_name, int32_t signal);
int orte_odls_bproc_deliver_message(orte_jobid_t job, orte_buffer_t *buffer, orte_rml_tag_t tag);
int orte_odls_bproc_get_local_proc_names(opal_list_t *names, orte_jobid_t job);
/**
* ODLS bproc_orted component

Просмотреть файл

@ -44,3 +44,14 @@ Could not execute the executable "%s": %s
This could mean that your PATH or executable name is wrong, or that you do not
have the necessary permissions. Please ensure that the executable is able to be
found and executed.
#
[nodeid-out-of-range]
The id of a node is out of the allowed range.
Value given: %ld
Max value allowed: %ld
This may be resolved by increasing the number of available node id's by
re-configuring Open MPI with the --enable-jumbo-clusters option, and then
re-running the application

Просмотреть файл

@ -90,6 +90,8 @@
#include "orte/mca/gpr/gpr.h"
#include "orte/mca/rmaps/base/base.h"
#include "orte/mca/smr/smr.h"
#include "orte/mca/routed/routed.h"
#if OPAL_ENABLE_FT == 1
#include "orte/mca/snapc/snapc.h"
#endif
@ -107,6 +109,12 @@ static int orte_odls_default_signal_local_procs(const orte_process_name_t *proc,
int32_t signal);
static int orte_odls_default_deliver_message(orte_jobid_t job, orte_buffer_t *buffer, orte_rml_tag_t tag);
static int orte_odls_default_extract_proc_map_info(orte_process_name_t *daemon,
orte_process_name_t *proc,
orte_gpr_value_t *value);
static int orte_odls_default_require_sync(orte_process_name_t *proc);
static void set_handler_default(int sig);
orte_odls_base_module_t orte_odls_default_module = {
@ -114,7 +122,9 @@ orte_odls_base_module_t orte_odls_default_module = {
orte_odls_default_launch_local_procs,
orte_odls_default_kill_local_procs,
orte_odls_default_signal_local_procs,
orte_odls_default_deliver_message
orte_odls_default_deliver_message,
orte_odls_default_extract_proc_map_info,
orte_odls_default_require_sync
};
int orte_odls_default_get_add_procs_data(orte_gpr_notify_data_t **data,
@ -204,8 +214,8 @@ int orte_odls_default_get_add_procs_data(orte_gpr_notify_data_t **data,
}
if (ORTE_SUCCESS != (rc = orte_gpr.create_keyval(&(value->keyvals[0]),
ORTE_PROC_NAME_KEY,
ORTE_NAME, &proc->name))) {
ORTE_VPID_KEY,
ORTE_VPID, &(node->daemon->vpid)))) {
ORTE_ERROR_LOG(rc);
OBJ_RELEASE(ndat);
OBJ_RELEASE(value);
@ -213,17 +223,17 @@ int orte_odls_default_get_add_procs_data(orte_gpr_notify_data_t **data,
}
if (ORTE_SUCCESS != (rc = orte_gpr.create_keyval(&(value->keyvals[1]),
ORTE_PROC_APP_CONTEXT_KEY,
ORTE_STD_CNTR, &proc->app_idx))) {
ORTE_VPID_KEY,
ORTE_VPID, &(proc->name.vpid)))) {
ORTE_ERROR_LOG(rc);
OBJ_RELEASE(ndat);
OBJ_RELEASE(value);
return rc;
}
if (ORTE_SUCCESS != (rc = orte_gpr.create_keyval(&(value->keyvals[2]),
ORTE_NODE_NAME_KEY,
ORTE_STRING, node->nodename))) {
ORTE_PROC_APP_CONTEXT_KEY,
ORTE_STD_CNTR, &proc->app_idx))) {
ORTE_ERROR_LOG(rc);
OBJ_RELEASE(ndat);
OBJ_RELEASE(value);
@ -747,6 +757,20 @@ GOTCHILD:
aborted = true;
free(abort_file);
} else {
/* okay, it terminated normally - check to see if a sync was required and
* if it was received
*/
if (child->sync_required) {
/* if this is set, then we required a sync and didn't get it, so this
* is considered an abnormal termination and treated accordingly
*/
aborted = true;
opal_output(orte_odls_globals.output, "odls: child process %s terminated normally "
"but did not provide a required sync - it "
"will be treated as an abnormal termination",
ORTE_NAME_PRINT(child->name));
goto MOVEON;
}
opal_output(orte_odls_globals.output, "odls: child process %s terminated normally",
ORTE_NAME_PRINT(child->name));
}
@ -954,6 +978,19 @@ static int odls_default_fork_local_proc(
free(param);
free(uri);
/* pass a nodeid to the proc - for now, set this to our vpid as
* this is a globally unique number and we have a one-to-one
* mapping of daemons to nodes
*/
if (ORTE_SUCCESS != (rc = orte_ns.convert_nodeid_to_string(&param2, (orte_nodeid_t)ORTE_PROC_MY_NAME->vpid))) {
ORTE_ERROR_LOG(rc);
return rc;
}
param = mca_base_param_environ_variable("orte","nodeid",NULL);
opal_setenv(param, param2, true, &environ_copy);
free(param);
free(param2);
/* setup yield schedule and processor affinity
* We default here to always setting the affinity processor if we want
* it. The processor affinity system then determines
@ -1137,13 +1174,12 @@ static int odls_default_fork_local_proc(
int orte_odls_default_launch_local_procs(orte_gpr_notify_data_t *data)
{
int rc;
orte_std_cntr_t i, j, kv, kv2, *sptr, total_slots_alloc = 0;
orte_std_cntr_t i, j, kv, *sptr, total_slots_alloc = 0;
orte_gpr_value_t *value, **values;
orte_gpr_keyval_t *kval;
orte_app_context_t *app;
orte_jobid_t job;
orte_vpid_t *vptr, start, range;
char *node_name;
opal_list_t app_context_list;
orte_odls_child_t *child;
odls_default_app_context_t *app_item;
@ -1154,6 +1190,7 @@ int orte_odls_default_launch_local_procs(orte_gpr_notify_data_t *data)
bool node_included;
char *job_str, *uri_file, *my_uri, *session_dir=NULL, *slot_str;
FILE *fp;
orte_process_name_t daemon, proc;
/* parse the returned data to create the required structures
* for a fork launch. Since the data will contain information
@ -1172,6 +1209,14 @@ int orte_odls_default_launch_local_procs(orte_gpr_notify_data_t *data)
opal_output(orte_odls_globals.output, "odls: setting up launch for job %ld", (long)job);
/* setup the routing table for communications - we need to do this
* prior to launch as the procs may want to communicate right away
*/
if (ORTE_SUCCESS != (rc = orte_routed.init_routes(job, data))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* We need to create a list of the app_contexts
* so we can know what to launch - the process info only gives
* us an index into the app_context array, not the app_context
@ -1187,6 +1232,10 @@ int orte_odls_default_launch_local_procs(orte_gpr_notify_data_t *data)
/* set the flag indicating this node is not included in the launch data */
node_included = false;
/* init the daemon and proc objects */
daemon.jobid = 0;
proc.jobid = job;
values = (orte_gpr_value_t**)(data->values)->addr;
for (j=0, i=0; i < data->cnt && j < (data->values)->size; j++) { /* loop through all returned values */
if (NULL != values[j]) {
@ -1257,96 +1306,73 @@ int orte_odls_default_launch_local_procs(orte_gpr_notify_data_t *data)
} /* end for loop to process global data */
} else {
/* this must have come from one of the process containers, so it must
* contain data for a proc structure - see if it
* belongs to this node
*/
for (kv=0; kv < value->cnt; kv++) {
kval = value->keyvals[kv];
if (strcmp(kval->key, ORTE_NODE_NAME_KEY) == 0) {
/* Most C-compilers will bark if we try to directly compare the string in the
* kval data area against a regular string, so we need to "get" the data
* so we can access it */
if (ORTE_SUCCESS != (rc = orte_dss.get((void**)&node_name, kval->value, ORTE_STRING))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* if this is our node...must also protect against a zero-length string */
if (NULL != node_name && 0 == strcmp(node_name, orte_system_info.nodename)) {
/* indicate that there is something for us to do */
node_included = true;
/* ...harvest the info into a new child structure */
child = OBJ_NEW(orte_odls_child_t);
for (kv2 = 0; kv2 < value->cnt; kv2++) {
kval = value->keyvals[kv2];
if(strcmp(kval->key, ORTE_PROC_NAME_KEY) == 0) {
/* copy the name into the child object */
if (ORTE_SUCCESS != (rc = orte_dss.copy((void**)&(child->name), kval->value->data, ORTE_NAME))) {
ORTE_ERROR_LOG(rc);
return rc;
}
continue;
}
if(strcmp(kval->key, ORTE_PROC_CPU_LIST_KEY) == 0) {
/* copy the name into the child object */
if (ORTE_SUCCESS != (rc = orte_dss.copy((void**)&slot_str, kval->value->data, ORTE_STRING))) {
ORTE_ERROR_LOG(rc);
return rc;
}
if (NULL != slot_str) {
if (ORTE_SUCCESS != (rc = slot_list_to_cpu_set(slot_str, child))){
ORTE_ERROR_LOG(rc);
free(slot_str);
return rc;
}
free(slot_str);
}
continue;
}
if(strcmp(kval->key, ORTE_PROC_APP_CONTEXT_KEY) == 0) {
if (ORTE_SUCCESS != (rc = orte_dss.get((void**)&sptr, kval->value, ORTE_STD_CNTR))) {
ORTE_ERROR_LOG(rc);
return rc;
}
child->app_idx = *sptr; /* save the index into the app_context objects */
continue;
}
if(strcmp(kval->key, ORTE_PROC_LOCAL_RANK_KEY) == 0) {
if (ORTE_SUCCESS != (rc = orte_dss.get((void**)&vptr, kval->value, ORTE_VPID))) {
ORTE_ERROR_LOG(rc);
return rc;
}
child->local_rank = *vptr; /* save the local_rank */
continue;
}
if(strcmp(kval->key, ORTE_NODE_NUM_PROCS_KEY) == 0) {
if (ORTE_SUCCESS != (rc = orte_dss.get((void**)&sptr, kval->value, ORTE_STD_CNTR))) {
ORTE_ERROR_LOG(rc);
return rc;
}
child->num_procs = *sptr; /* save the number of procs from this job on this node */
continue;
}
if(strcmp(kval->key, ORTE_NODE_OVERSUBSCRIBED_KEY) == 0) {
if (ORTE_SUCCESS != (rc = orte_dss.get((void**)&bptr, kval->value, ORTE_BOOL))) {
ORTE_ERROR_LOG(rc);
return rc;
}
oversubscribed = *bptr;
continue;
}
} /* kv2 */
/* protect operation on the global list of children */
OPAL_THREAD_LOCK(&orte_odls_default.mutex);
opal_list_append(&orte_odls_default.children, &child->super);
opal_condition_signal(&orte_odls_default.cond);
OPAL_THREAD_UNLOCK(&orte_odls_default.mutex);
* contain data for a proc structure - get the name of the daemon and proc
*/
if (ORTE_SUCCESS != (rc = orte_odls_default_extract_proc_map_info(&daemon, &proc, value))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* does this proc belong to us? */
if (ORTE_EQUAL != orte_dss.compare(&(ORTE_PROC_MY_NAME->vpid), &daemon.vpid, ORTE_VPID)) {
/* evidently not - ignore it */
continue;
}
/* yes it does - indicate that we need to do something */
node_included = true;
/* harvest the info into a new child structure, taking advantage
* of our knowledge of the ordering of the data itself
*/
child = OBJ_NEW(orte_odls_child_t);
if (ORTE_SUCCESS != (rc = orte_dss.copy((void**)&child->name, &proc, ORTE_NAME))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* 3rd posn - app_idx */
if (ORTE_SUCCESS != (rc = orte_dss.get((void**)&sptr, value->keyvals[2]->value, ORTE_STD_CNTR))) {
ORTE_ERROR_LOG(rc);
return rc;
}
child->app_idx = *sptr; /* save the index into the app_context objects */
}
/* 4th posn - local rank */
if (ORTE_SUCCESS != (rc = orte_dss.get((void**)&vptr, value->keyvals[3]->value, ORTE_VPID))) {
ORTE_ERROR_LOG(rc);
return rc;
}
child->local_rank = *vptr; /* save the local_rank */
/* 5th posn - cpu list */
if (ORTE_SUCCESS != (rc = orte_dss.copy((void**)&slot_str, value->keyvals[4]->value->data, ORTE_STRING))) {
ORTE_ERROR_LOG(rc);
return rc;
}
if (NULL != slot_str) {
if (ORTE_SUCCESS != (rc = slot_list_to_cpu_set(slot_str, child))){
ORTE_ERROR_LOG(rc);
free(slot_str);
return rc;
}
} /* for kv */
free(slot_str);
}
/* 6th posn - number of local procs */
if (ORTE_SUCCESS != (rc = orte_dss.get((void**)&sptr, value->keyvals[5]->value, ORTE_STD_CNTR))) {
ORTE_ERROR_LOG(rc);
return rc;
}
child->num_procs = *sptr; /* save the number of procs from this job on this node */
/* protect operation on the global list of children */
OPAL_THREAD_LOCK(&orte_odls_default.mutex);
opal_list_append(&orte_odls_default.children, &child->super);
opal_condition_signal(&orte_odls_default.cond);
OPAL_THREAD_UNLOCK(&orte_odls_default.mutex);
}
} /* for j */
}
}
/* if there is nothing for us to do, just return */
@ -1678,7 +1704,8 @@ int orte_odls_default_deliver_message(orte_jobid_t job, orte_buffer_t *buffer, o
* job could be given as a WILDCARD value, we must use
* the dss.compare function to check for equality.
*/
if (ORTE_EQUAL != orte_dss.compare(&job, &(child->name->jobid), ORTE_JOBID)) {
if (!child->alive ||
ORTE_EQUAL != orte_dss.compare(&job, &(child->name->jobid), ORTE_JOBID)) {
continue;
}
opal_output(orte_odls_globals.output, "odls: sending message to tag %lu on child %s",
@ -1686,7 +1713,11 @@ int orte_odls_default_deliver_message(orte_jobid_t job, orte_buffer_t *buffer, o
/* if so, send the message */
rc = orte_rml.send_buffer(child->name, buffer, tag, 0);
if (rc < 0) {
if (rc < 0 && rc != ORTE_ERR_ADDRESSEE_UNKNOWN) {
/* ignore if the addressee is unknown as a race condition could
* have allowed the child to exit before we send it a barrier
* due to the vagaries of the event library
*/
ORTE_ERROR_LOG(rc);
}
}
@ -1696,6 +1727,92 @@ int orte_odls_default_deliver_message(orte_jobid_t job, orte_buffer_t *buffer, o
return ORTE_SUCCESS;
}
static int orte_odls_default_extract_proc_map_info(orte_process_name_t *daemon,
orte_process_name_t *proc,
orte_gpr_value_t *value)
{
int rc;
orte_vpid_t *vptr;
/* vpid of daemon that will host these procs is in first position */
if (ORTE_SUCCESS != (rc = orte_dss.get((void**)&vptr, value->keyvals[0]->value, ORTE_VPID))) {
ORTE_ERROR_LOG(rc);
return rc;
}
daemon->vpid = *vptr;
/* vpid of proc is in second position */
if (ORTE_SUCCESS != (rc = orte_dss.get((void**)&vptr, value->keyvals[1]->value, ORTE_VPID))) {
ORTE_ERROR_LOG(rc);
return rc;
}
proc->vpid = *vptr;
return ORTE_SUCCESS;
}
static int orte_odls_default_require_sync(orte_process_name_t *proc)
{
orte_buffer_t buffer;
opal_list_item_t *item;
orte_odls_child_t *child;
int8_t dummy;
int rc;
bool found=false;
/* protect operations involving the global list of children */
OPAL_THREAD_LOCK(&orte_odls_default.mutex);
for (item = opal_list_get_first(&orte_odls_default.children);
item != opal_list_get_end(&orte_odls_default.children);
item = opal_list_get_next(item)) {
child = (orte_odls_child_t*)item;
/* find this child */
if (ORTE_EQUAL == orte_dss.compare(proc, child->name, ORTE_NAME)) {
opal_output(orte_odls_globals.output, "odls: registering sync on child %s",
ORTE_NAME_PRINT(child->name));
child->sync_required = !child->sync_required;
found = true;
break;
}
}
/* if it wasn't found on the list, then we need to add it - must have
* come from a singleton
*/
if (!found) {
child = OBJ_NEW(orte_odls_child_t);
if (ORTE_SUCCESS != (rc = orte_dss.copy((void**)&child->name, proc, ORTE_NAME))) {
ORTE_ERROR_LOG(rc);
return rc;
}
opal_list_append(&orte_odls_default.children, &child->super);
/* we don't know any other info about the child, so just indicate it's
* alive and set the sync
*/
child->alive = true;
child->sync_required = !child->sync_required;
}
/* ack the call */
OBJ_CONSTRUCT(&buffer, orte_buffer_t);
orte_dss.pack(&buffer, &dummy, 1, ORTE_INT8); /* put anything in */
opal_output(orte_odls_globals.output, "odls: sending sync ack to child %s",
ORTE_NAME_PRINT(proc));
if (0 > (rc = orte_rml.send_buffer(proc, &buffer, ORTE_RML_TAG_SYNC, 0))) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&buffer);
return rc;
}
OBJ_DESTRUCT(&buffer);
opal_condition_signal(&orte_odls_default.cond);
OPAL_THREAD_UNLOCK(&orte_odls_default.mutex);
return ORTE_SUCCESS;
}
static void set_handler_default(int sig)
{

Просмотреть файл

@ -80,6 +80,17 @@ typedef int (*orte_odls_base_module_signal_local_process_fn_t)(const orte_proces
typedef int (*orte_odls_base_module_deliver_message_fn_t)(orte_jobid_t job, orte_buffer_t *buffer,
orte_rml_tag_t tag);
/**
* Extract the mapping of daemon-proc pair
*/
typedef int (*orte_odls_base_module_extract_proc_map_info_fn_t)(orte_process_name_t *daemon,
orte_process_name_t *proc,
orte_gpr_value_t *value);
/**
* Register to require sync before termination
*/
typedef int (*orte_odls_base_module_require_sync_fn_t)(orte_process_name_t *proc);
/**
* pls module version 1.3.0
*/
@ -89,6 +100,8 @@ struct orte_odls_base_module_1_3_0_t {
orte_odls_base_module_kill_local_processes_fn_t kill_local_procs;
orte_odls_base_module_signal_local_process_fn_t signal_local_procs;
orte_odls_base_module_deliver_message_fn_t deliver_message;
orte_odls_base_module_extract_proc_map_info_fn_t extract_proc_map_info;
orte_odls_base_module_require_sync_fn_t require_sync;
};
/** shorten orte_odls_base_module_1_3_0_t declaration */

Просмотреть файл

@ -49,7 +49,8 @@ typedef uint8_t orte_daemon_cmd_flag_t;
#define ORTE_DAEMON_ROUTE_BINOMIAL (orte_daemon_cmd_flag_t) 12
#define ORTE_DAEMON_WARMUP_LOCAL_CONN (orte_daemon_cmd_flag_t) 13
#define ORTE_DAEMON_NULL_CMD (orte_daemon_cmd_flag_t) 14
#define ORTE_DAEMON_SYNC_BY_PROC (orte_daemon_cmd_flag_t) 15
/* define some useful attributes for dealing with orteds */
#define ORTE_DAEMON_SOFT_KILL "orted-soft-kill"
#define ORTE_DAEMON_HARD_KILL "orted-hard-kill"

Просмотреть файл

@ -1253,9 +1253,111 @@ int orte_odls_process_deliver_message(orte_jobid_t job, orte_buffer_t *buffer, o
return ORTE_SUCCESS;
}
static int orte_odls_process_extract_proc_map_info(orte_process_name_t *daemon,
orte_process_name_t *proc,
orte_gpr_value_t *value)
{
int rc;
orte_vpid_t *vptr;
#if 0
/*** NOTE: YOU WILL NEED TO REVISE THIS TO REFLECT HOW YOU STORED
THE DATA IN YOUR GET_ADD_PROCS_DATA ROUTINE. YOU MAY WISH
TO REVISE THAT ROUTINE, AND YOUR LAUNCH ROUTINE WHERE YOU PARSE
THAT DATA, TO REFLECT CHANGES IN THE DEFAULT COMPONENT AS SOME
EFFICIENCIES AND FEATURES HAVE BEEN ADDED
****/
/* vpid of daemon that will host these procs is in first position */
if (ORTE_SUCCESS != (rc = orte_dss.get((void**)&vptr, value->keyvals[0]->value, ORTE_VPID))) {
ORTE_ERROR_LOG(rc);
return rc;
}
daemon->vpid = *vptr;
/* vpid of proc is in second position */
if (ORTE_SUCCESS != (rc = orte_dss.get((void**)&vptr, value->keyvals[1]->value, ORTE_VPID))) {
ORTE_ERROR_LOG(rc);
return rc;
}
proc->vpid = *vptr;
return ORTE_SUCCESS;
#endif
return ORTE_ERR_NOT_IMPLEMENTED;
}
static int orte_odls_process_require_sync(orte_process_name_t *proc)
{
orte_buffer_t buffer;
opal_list_item_t *item;
orte_odls_child_t *child;
int8_t dummy;
int rc;
bool found=false;
/* protect operations involving the global list of children */
OPAL_THREAD_LOCK(&orte_odls_process.mutex);
for (item = opal_list_get_first(&orte_odls_process.children);
item != opal_list_get_end(&orte_odls_process.children);
item = opal_list_get_next(item)) {
child = (orte_odls_child_t*)item;
/* find this child */
if (ORTE_EQUAL == orte_dss.compare(proc, child->name, ORTE_NAME)) {
opal_output(orte_odls_globals.output, "odls: registering sync on child %s",
ORTE_NAME_PRINT(child->name));
child->sync_required = !child->sync_required;
found = true;
break;
}
}
/* if it wasn't found on the list, then we need to add it - must have
* come from a singleton
*/
if (!found) {
child = OBJ_NEW(orte_odls_child_t);
if (ORTE_SUCCESS != (rc = orte_dss.copy((void**)&child->name, proc, ORTE_NAME))) {
ORTE_ERROR_LOG(rc);
return rc;
}
opal_list_append(&orte_odls_process.children, &child->super);
/* we don't know any other info about the child, so just indicate it's
* alive and set the sync
*/
child->alive = true;
child->sync_required = !child->sync_required;
}
/* ack the call */
OBJ_CONSTRUCT(&buffer, orte_buffer_t);
orte_dss.pack(&buffer, &dummy, 1, ORTE_INT8); /* put anything in */
opal_output(orte_odls_globals.output, "odls: sending sync ack to child %s",
ORTE_NAME_PRINT(proc));
if (0 > (rc = orte_rml.send_buffer(proc, &buffer, ORTE_RML_TAG_SYNC, 0))) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&buffer);
return rc;
}
OBJ_DESTRUCT(&buffer);
opal_condition_signal(&orte_odls_process.cond);
OPAL_THREAD_UNLOCK(&orte_odls_process.mutex);
return ORTE_SUCCESS;
}
orte_odls_base_module_1_3_0_t orte_odls_process_module = {
orte_odls_process_get_add_procs_data,
orte_odls_process_launch_local_procs,
orte_odls_process_kill_local_procs,
orte_odls_process_signal_local_proc
orte_odls_process_signal_local_proc,
orte_odls_process_deliver_message,
orte_odls_process_extract_proc_map_info,
orte_odls_process_require_sync
};

Просмотреть файл

@ -82,21 +82,29 @@ typedef int (*mca_oob_base_module_ping_fn_t)(const orte_process_name_t*,
/**
* Implementation of mca_oob_send_nb().
*
* @param peer (IN) Opaque name of peer process.
* @param msg (IN) Array of iovecs describing user buffers and lengths.
* @param count (IN) Number of elements in iovec array.
* @param tag (IN) User defined tag for matching send/recv.
* @param flags (IN) Currently unused.
* @param cbfunc (IN) Callback function on send completion.
* @param cbdata (IN) User data that is passed to callback function.
* @return OMPI error code (<0) on error number of bytes actually sent.
*
*/
* Send an oob message
*
* Send an oob message. All oob sends are non-blocking, and cbfunc
* will be called when the message has been sent. When cbfunc is
* called, message has been injected into the network but no guarantee
* is made about whether the target has received the message.
*
* @param[in] target Destination process name
* @param[in] origin Origin process for the message, for the purposes
* of message matching. This can be different from
* the process calling send().
* @param[in] msg Array of iovecs describing user buffers and lengths.
* @param[in] count Number of elements in iovec array.
* @param[in] tag User defined tag for matching send/recv.
* @param[in] flags Currently unused.
* @param[in] cbfunc Callback function on send completion.
* @param[in] cbdata User data that is passed to callback function.
*
* @return OMPI error code (<0) on error number of bytes actually sent.
*/
typedef int (*mca_oob_base_module_send_nb_fn_t)(
orte_process_name_t* peer,
orte_process_name_t* target,
orte_process_name_t* origin,
struct iovec* msg,
int count,
int tag,

Просмотреть файл

@ -115,28 +115,14 @@ int mca_oob_tcp_ping(const orte_process_name_t*, const char* uri, const struct t
* Non-blocking versions of send/recv.
*/
/**
* Non-blocking version of mca_oob_send().
*
* @param peer (IN) Opaque name of peer process.
* @param msg (IN) Array of iovecs describing user buffers and lengths.
* @param count (IN) Number of elements in iovec array.
* @param tag (IN) User defined tag for matching send/recv.
* @param flags (IN) Currently unused.
* @param cbfunc (IN) Callback function on send completion.
* @param cbdata (IN) User data that is passed to callback function.
* @return OMPI error code (<0) on error number of bytes actually sent.
*
*/
int mca_oob_tcp_send_nb(
orte_process_name_t* peer,
struct iovec* msg,
int count,
int tag,
int flags,
orte_rml_callback_fn_t cbfunc,
void* cbdata);
int mca_oob_tcp_send_nb(orte_process_name_t* target,
orte_process_name_t* origin,
struct iovec* msg,
int count,
int tag,
int flags,
orte_rml_callback_fn_t cbfunc,
void* cbdata);
/**
* Non-blocking version of mca_oob_recv().

Просмотреть файл

@ -35,6 +35,7 @@
* Header used by tcp oob protocol.
*/
struct mca_oob_tcp_hdr_t {
orte_process_name_t msg_origin;
orte_process_name_t msg_src;
orte_process_name_t msg_dst;
uint32_t msg_type; /**< type of message */
@ -47,6 +48,7 @@ typedef struct mca_oob_tcp_hdr_t mca_oob_tcp_hdr_t;
* Convert the message header to host byte order
*/
#define MCA_OOB_TCP_HDR_NTOH(h) \
ORTE_PROCESS_NAME_NTOH((h)->msg_origin); \
ORTE_PROCESS_NAME_NTOH((h)->msg_src); \
ORTE_PROCESS_NAME_NTOH((h)->msg_dst); \
(h)->msg_type = ntohl((h)->msg_type); \
@ -57,6 +59,7 @@ typedef struct mca_oob_tcp_hdr_t mca_oob_tcp_hdr_t;
* Convert the message header to network byte order
*/
#define MCA_OOB_TCP_HDR_HTON(h) \
ORTE_PROCESS_NAME_HTON((h)->msg_origin); \
ORTE_PROCESS_NAME_HTON((h)->msg_src); \
ORTE_PROCESS_NAME_HTON((h)->msg_dst); \
(h)->msg_type = htonl((h)->msg_type); \

Просмотреть файл

@ -332,9 +332,10 @@ bool mca_oob_tcp_msg_recv_handler(mca_oob_tcp_msg_t* msg, struct mca_oob_tcp_pee
msg->msg_rwnum = 0;
}
if (mca_oob_tcp_component.tcp_debug >= OOB_TCP_DEBUG_INFO) {
opal_output(0, "%s-%s mca_oob_tcp_msg_recv_handler: size %lu\n",
opal_output(0, "%s-%s (origin: %s) mca_oob_tcp_msg_recv_handler: size %lu\n",
ORTE_NAME_PRINT(orte_process_info.my_name),
ORTE_NAME_PRINT(&(peer->peer_name)),
ORTE_NAME_PRINT(&(msg->msg_hdr.msg_origin)),
(unsigned long)(msg->msg_hdr.msg_size) );
}
}
@ -483,7 +484,7 @@ static void mca_oob_tcp_msg_data(mca_oob_tcp_msg_t* msg, mca_oob_tcp_peer_t* pee
OPAL_THREAD_LOCK(&mca_oob_tcp_component.tcp_match_lock);
/* match msg against posted receives */
post = mca_oob_tcp_msg_match_post(&peer->peer_name, msg->msg_hdr.msg_tag);
post = mca_oob_tcp_msg_match_post(&msg->msg_hdr.msg_origin, msg->msg_hdr.msg_tag);
if(NULL != post) {
if(NULL == post->msg_uiov || 0 == post->msg_ucnt) {
@ -519,7 +520,7 @@ static void mca_oob_tcp_msg_data(mca_oob_tcp_msg_t* msg, mca_oob_tcp_peer_t* pee
post->msg_hdr.msg_tag,
post->msg_cbdata);
} else {
mca_oob_tcp_msg_complete(post, &peer->peer_name);
mca_oob_tcp_msg_complete(post, &msg->msg_hdr.msg_origin);
}
OPAL_THREAD_LOCK(&mca_oob_tcp_component.tcp_match_lock);
@ -593,7 +594,7 @@ mca_oob_tcp_msg_t* mca_oob_tcp_msg_match_recv(orte_process_name_t* name, int tag
msg != (mca_oob_tcp_msg_t*) opal_list_get_end(&mca_oob_tcp_component.tcp_msg_recv);
msg = (mca_oob_tcp_msg_t*) opal_list_get_next(msg)) {
if(ORTE_EQUAL == orte_dss.compare(name, &msg->msg_peer, ORTE_NAME)) {
if(ORTE_EQUAL == orte_dss.compare(name, &msg->msg_hdr.msg_origin, ORTE_NAME)) {
if (tag == msg->msg_hdr.msg_tag) {
return msg;
}

Просмотреть файл

@ -58,6 +58,7 @@
#include "orte/mca/gpr/gpr.h"
#include "orte/mca/ns/ns.h"
#include "orte/mca/errmgr/errmgr.h"
#include "orte/mca/routed/routed.h"
#include "oob_tcp.h"
#include "oob_tcp_peer.h"
@ -513,6 +514,11 @@ static void mca_oob_tcp_peer_connected(mca_oob_tcp_peer_t* peer, int sd)
opal_event_del(&peer->peer_timer_event);
peer->peer_state = MCA_OOB_TCP_CONNECTED;
peer->peer_retries = 0;
/* Since we have a direct connection established to this peer, use
the connection as a direct route between peers */
orte_routed.update_route(&peer->peer_name, &peer->peer_name);
if(opal_list_get_size(&peer->peer_send_queue) > 0) {
if(NULL == peer->peer_send_msg) {
peer->peer_send_msg = (mca_oob_tcp_msg_t*)

Просмотреть файл

@ -119,6 +119,7 @@ int mca_oob_tcp_recv_nb(
}
/* fill in the header */
msg->msg_hdr.msg_origin = *peer;
if (NULL == orte_process_info.my_name) {
msg->msg_hdr.msg_src = *ORTE_NAME_INVALID;
} else {

Просмотреть файл

@ -89,7 +89,8 @@ static int mca_oob_tcp_send_self(
*
*/
int mca_oob_tcp_send_nb(
orte_process_name_t* name,
orte_process_name_t* target,
orte_process_name_t* origin,
struct iovec* iov,
int count,
int tag,
@ -97,7 +98,7 @@ int mca_oob_tcp_send_nb(
orte_rml_callback_fn_t cbfunc,
void* cbdata)
{
mca_oob_tcp_peer_t* peer = mca_oob_tcp_peer_lookup(name);
mca_oob_tcp_peer_t* peer = mca_oob_tcp_peer_lookup(target);
mca_oob_tcp_msg_t* msg;
int size;
int rc;
@ -127,12 +128,13 @@ int mca_oob_tcp_send_nb(
msg->msg_hdr.msg_type = MCA_OOB_TCP_DATA;
msg->msg_hdr.msg_size = size;
msg->msg_hdr.msg_tag = tag;
msg->msg_hdr.msg_origin = *origin;
if (NULL == orte_process_info.my_name) {
msg->msg_hdr.msg_src = *ORTE_NAME_INVALID;
} else {
msg->msg_hdr.msg_src = *orte_process_info.my_name;
}
msg->msg_hdr.msg_dst = *name;
msg->msg_hdr.msg_dst = *target;
/* create one additional iovect that will hold the size of the message */
msg->msg_type = MCA_OOB_TCP_POSTED;
@ -152,7 +154,7 @@ int mca_oob_tcp_send_nb(
msg->msg_complete = false;
msg->msg_peer = peer->peer_name;
if (ORTE_EQUAL == mca_oob_tcp_process_name_compare(name, orte_process_info.my_name)) { /* local delivery */
if (ORTE_EQUAL == mca_oob_tcp_process_name_compare(target, orte_process_info.my_name)) { /* local delivery */
rc = mca_oob_tcp_send_self(peer,msg,iov,count);
if (rc < 0 ) {
return rc;

Просмотреть файл

@ -31,6 +31,8 @@
#include "orte/util/univ_info.h"
#include "orte/mca/rml/rml.h"
#include "orte/runtime/params.h"
#include "orte/mca/ns/ns.h"
#include "orte/mca/errmgr/errmgr.h"
#include "orte/mca/pls/base/pls_private.h"
@ -64,14 +66,25 @@ void orte_pls_base_purge_mca_params(char ***env)
int orte_pls_base_orted_append_basic_args(int *argc, char ***argv,
int *proc_name_index,
int *node_name_index,
orte_std_cntr_t num_procs)
int *node_name_index)
{
char *param = NULL, *contact_info = NULL;
int loc_id;
char * amca_param_path = NULL;
char * amca_param_prefix = NULL;
char * tmp_force = NULL;
char *purge[] = {
"seed",
"rds",
"ras",
"rmaps",
"pls",
"rmgr",
NULL
};
int i, j, cnt, rc;
bool pass;
orte_vpid_t total_num_daemons;
/* check for debug flags */
if (orte_debug_flag) {
@ -97,9 +110,15 @@ int orte_pls_base_orted_append_basic_args(int *argc, char ***argv,
opal_argv_append(argc, argv, "<template>");
}
/* tell the daemon how many procs are in the daemon's job */
/* get the total number of daemons that will be in the system */
if (ORTE_SUCCESS != (rc = orte_ns.get_vpid_range(0, &total_num_daemons))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* pass that number along */
opal_argv_append(argc, argv, "--num_procs");
asprintf(&param, "%lu", (unsigned long)(num_procs));
asprintf(&param, "%lu", (unsigned long)(total_num_daemons));
opal_argv_append(argc, argv, param);
free(param);
@ -145,6 +164,31 @@ int orte_pls_base_orted_append_basic_args(int *argc, char ***argv,
free(contact_info);
free(param);
/* pass along any cmd line MCA params provided to mpirun,
* being sure to "purge" any that would cause problems
* on backend nodes
*/
cnt = opal_argv_count(orted_cmd_line);
for (i=0; i < cnt; i+=3) {
/* check to see if this is on the purge list */
pass = true;
for (j=0; NULL != purge[j]; j++) {
/* the ith position holds -mca, so need to check
* against the i+1st position to find the param
*/
if (0 == strcmp(orted_cmd_line[i+1],purge[j])) {
/* on purge list - skip it */
pass = false;
break;
}
}
if (pass) {
opal_argv_append(argc, argv, orted_cmd_line[i]);
opal_argv_append(argc, argv, orted_cmd_line[i+1]);
opal_argv_append(argc, argv, orted_cmd_line[i+2]);
}
}
/*
* Pass along the Aggregate MCA Parameter Sets
*/

Просмотреть файл

@ -30,6 +30,8 @@
#include "orte/mca/rml/base/rml_contact.h"
#include "orte/mca/grpcomm/grpcomm.h"
#include "orte/mca/odls/odls.h"
#include "orte/mca/smr/smr.h"
#include "orte/runtime/orte_wakeup.h"
#include "orte/mca/pls/base/pls_private.h"
@ -72,6 +74,47 @@ int orte_pls_base_launch_apps(orte_job_map_t *map)
return rc;
}
void orte_pls_base_daemon_failed(orte_jobid_t job, bool callback_active, pid_t pid,
int status, orte_job_state_t state)
{
int src[3] = {-1, -1, -1};
orte_buffer_t ack;
int rc;
if (callback_active) {
/* if we failed while launching daemons, we need to fake a message to
* the daemon callback system so it can break out of its receive loop
*/
src[2] = pid;
if(WIFSIGNALED(status)) {
src[1] = WTERMSIG(status);
}
OBJ_CONSTRUCT(&ack, orte_buffer_t);
if (ORTE_SUCCESS != (rc = orte_dss.pack(&ack, &src, 3, ORTE_INT))) {
ORTE_ERROR_LOG(rc);
}
rc = orte_rml.send_buffer(ORTE_PROC_MY_NAME, &ack, ORTE_RML_TAG_ORTED_CALLBACK, 0);
if (0 > rc) {
ORTE_ERROR_LOG(rc);
}
OBJ_DESTRUCT(&ack);
}
/* The usual reasons for a daemon to exit abnormally all are a pretty good
indication that things in general are going to fall apart.
Set the job state as indicated so orterun's exit status
will be non-zero
*/
if (ORTE_SUCCESS != (rc = orte_smr.set_job_state(job, state))) {
ORTE_ERROR_LOG(rc);
}
/* forcibly terminate the job so orterun can exit */
if (ORTE_SUCCESS != (rc = orte_wakeup(job))) {
ORTE_ERROR_LOG(rc);
}
}
int orte_pls_base_daemon_callback(orte_std_cntr_t num_daemons)
{
orte_std_cntr_t i;
@ -82,6 +125,7 @@ int orte_pls_base_daemon_callback(orte_std_cntr_t num_daemons)
orte_buffer_t *buf;
orte_gpr_notify_data_t *data=NULL;
orte_rml_cmd_flag_t command;
orte_vpid_t total_num_daemons;
for(i = 0; i < num_daemons; i++) {
OBJ_CONSTRUCT(&ack, orte_buffer_t);
@ -99,6 +143,7 @@ int orte_pls_base_daemon_callback(orte_std_cntr_t num_daemons)
* actual number of packed entries up to the number we specify here
*/
idx = 4;
src[0]=src[1]=src[2]=src[3]=0;
rc = orte_dss.unpack(&ack, &src, &idx, ORTE_INT);
if(ORTE_SUCCESS != rc) {
ORTE_ERROR_LOG(rc);
@ -149,7 +194,14 @@ int orte_pls_base_daemon_callback(orte_std_cntr_t num_daemons)
OBJ_DESTRUCT(&handoff); /* done with this */
}
/* all done launching - update everyone's contact info so all daemons
/* all done launching - update the num_procs in my local structure */
if (ORTE_SUCCESS != (rc = orte_ns.get_vpid_range(0, &total_num_daemons))) {
ORTE_ERROR_LOG(rc);
return rc;
}
orte_process_info.num_procs = total_num_daemons;
/* update everyone's contact info so all daemons
* can talk to each other
*/
name.jobid = 0;

Просмотреть файл

@ -71,6 +71,8 @@ typedef uint8_t orte_pls_cmd_flag_t;
ORTE_DECLSPEC int orte_pls_base_orted_signal_local_procs(orte_jobid_t job, int32_t signal, opal_list_t *attrs);
ORTE_DECLSPEC int orte_pls_base_launch_apps(orte_job_map_t *map);
ORTE_DECLSPEC void orte_pls_base_daemon_failed(orte_jobid_t job, bool callback_active,
pid_t pid, int status, orte_job_state_t state);
ORTE_DECLSPEC int orte_pls_base_daemon_callback(orte_std_cntr_t num_daemons);
@ -95,8 +97,7 @@ typedef uint8_t orte_pls_cmd_flag_t;
int *argc,
char ***argv,
int *proc_name_index,
int *node_name_index,
orte_std_cntr_t num_procs);
int *node_name_index);
#if defined(c_plusplus) || defined(__cplusplus)
}

Просмотреть файл

@ -87,3 +87,11 @@ nodes and therefore cannot continue.
On node %d the process pid was %d and errno was set to %d.
For reference, we tried to launch %s
[proc-io-setup-failed]
The daemons failed to setup the I/O forwarding subsystem to support
the application. This is a fatal error so we are aborting. For
reference, the internal error code was:
error: %s

Просмотреть файл

@ -284,12 +284,13 @@ static void orte_pls_bproc_waitpid_daemon_cb(pid_t wpid, int status, void *data)
if(WIFSIGNALED(status)) {
src[1] = WTERMSIG(status);
}
opal_output(0, "%s detected daemon %ld exit during launch on %ld", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), (long)src[2], (long)src[3]);
OBJ_CONSTRUCT(&ack, orte_buffer_t);
rc = orte_dss.pack(&ack, &src, 4, ORTE_INT);
if(ORTE_SUCCESS != rc) {
ORTE_ERROR_LOG(rc);
}
rc = mca_oob_send_packed(ORTE_PROC_MY_NAME, &ack, ORTE_RML_TAG_BPROC, 0);
rc = orte_rml.send_buffer(ORTE_PROC_MY_NAME, &ack, ORTE_RML_TAG_ORTED_CALLBACK, 0);
if(0 > rc) {
ORTE_ERROR_LOG(rc);
}
@ -377,7 +378,7 @@ static void orte_pls_bproc_setup_env(char *** env)
/* ns replica contact info */
if(NULL == orte_process_info.ns_replica) {
orte_dss.copy((void**)&orte_process_info.ns_replica, orte_process_info.my_name, ORTE_NAME);
orte_process_info.ns_replica_uri = orte_rml.get_uri();
orte_process_info.ns_replica_uri = orte_rml.get_contact_info();
}
var = mca_base_param_environ_variable("ns","replica","uri");
opal_setenv(var,orte_process_info.ns_replica_uri, true, env);
@ -392,7 +393,7 @@ static void orte_pls_bproc_setup_env(char *** env)
/* gpr replica contact info */
if(NULL == orte_process_info.gpr_replica) {
orte_dss.copy((void**)&orte_process_info.gpr_replica, orte_process_info.my_name, ORTE_NAME);
orte_process_info.gpr_replica_uri = orte_rml.get_uri();
orte_process_info.gpr_replica_uri = orte_rml.get_contact_info();
}
var = mca_base_param_environ_variable("gpr","replica","uri");
opal_setenv(var,orte_process_info.gpr_replica_uri, true, env);
@ -431,6 +432,7 @@ static void orte_pls_bproc_setup_env(char *** env)
static int orte_pls_bproc_launch_daemons(orte_job_map_t *map, char ***envp) {
int * daemon_list = NULL;
int num_daemons = 0;
orte_vpid_t range;
int total_num_daemons = 0;
int rc, i;
int * pids = NULL;
@ -462,13 +464,8 @@ static int orte_pls_bproc_launch_daemons(orte_job_map_t *map, char ***envp) {
* Since we are going to "hold" until all the messages have arrived,
* we need to know how many are coming
*/
total_num_daemons = map->num_nodes;
/* account for any reuse of daemons */
if (ORTE_SUCCESS != (rc = orte_pls_base_launch_on_existing_daemons(map))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
}
orte_ns.get_vpid_range(0, &range);
total_num_daemons = range;
/* get the number of new daemons to be launched for this job and allocate an array for
* their names so we can pass that to bproc - populate the list
@ -635,11 +632,8 @@ static int orte_pls_bproc_launch_daemons(orte_job_map_t *map, char ***envp) {
}
}
/* setup the callbacks - this needs to be done *after* we store the
* daemon info so that short-lived apps don't cause mpirun to
* try and terminate the orteds before we record them
*/
if (!mca_pls_bproc_component.do_not_launch) {
if (!mca_pls_bproc_component.do_not_launch) {
/* setup the callbacks in case a daemon dies before we finish */
for (i=0; i < num_daemons; i++) {
rc = orte_wait_cb(pids[i], orte_pls_bproc_waitpid_daemon_cb,
&daemon_list[i]);
@ -648,40 +642,47 @@ static int orte_pls_bproc_launch_daemons(orte_job_map_t *map, char ***envp) {
goto cleanup;
}
}
/* wait for the new daemons to callback */
if (ORTE_SUCCESS != (rc = orte_pls_base_daemon_callback(num_daemons))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
}
WAITFORCOMM:
/* tell the daemons to set up the pty/pipes and IO forwarding
* which the user apps will use
*/
if (ORTE_SUCCESS != (rc = orte_pls_base_launch_apps(map))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
}
/* wait for communication back from the daemons, which indicates they have
* sucessfully set up the pty/pipes and IO forwarding which the user apps
* will use */
* sucessfully performed that preparation - this comes from ALL daemons
* as we xcast the launch command across everyone
*/
for(i = 0; i < total_num_daemons; i++) {
orte_buffer_t ack;
int src[4];
int src;
OBJ_CONSTRUCT(&ack, orte_buffer_t);
rc = mca_oob_recv_packed(ORTE_NAME_WILDCARD, &ack, ORTE_RML_TAG_BPROC);
rc = orte_rml.recv_buffer(ORTE_NAME_WILDCARD, &ack, ORTE_RML_TAG_BPROC, 0);
if(0 > rc) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&ack);
goto cleanup;
}
idx = 4;
idx = 1;
rc = orte_dss.unpack(&ack, &src, &idx, ORTE_INT);
if(ORTE_SUCCESS != rc) {
ORTE_ERROR_LOG(rc);
}
OBJ_DESTRUCT(&ack);
if(-1 == src[0]) {
/* one of the daemons has failed to properly launch. The error is sent
* by orte_pls_bproc_waitpid_daemon_cb */
if(-1 == src[1]) { /* did not die on a signal */
opal_show_help("help-pls-bproc.txt", "daemon-died-no-signal", true,
src[2], src[3]);
} else { /* died on a signal */
opal_show_help("help-pls-bproc.txt", "daemon-died-signal", true,
src[2], src[3], src[1]);
}
rc = ORTE_ERROR;
ORTE_ERROR_LOG(rc);
if(src < 0) {
/* one of the daemons has failed to properly setup the required
* support. The error is sent by orte_pls_bproc_waitpid_daemon_cb
*/
opal_show_help("help-pls-bproc.txt", "proc-io-setup-failed",
true, ORTE_ERROR_NAME(src));
rc = src;
goto cleanup;
}
}
@ -713,12 +714,13 @@ cleanup:
/* check for failed launch - if so, force terminate */
if (!daemons_launched) {
if (ORTE_SUCCESS != (rc = orte_smr.set_job_state(map->job, ORTE_JOB_STATE_FAILED_TO_START))) {
ORTE_ERROR_LOG(rc);
int ret;
if (ORTE_SUCCESS != (ret = orte_smr.set_job_state(map->job, ORTE_JOB_STATE_FAILED_TO_START))) {
ORTE_ERROR_LOG(ret);
}
if (ORTE_SUCCESS != (rc = orte_wakeup(map->job))) {
ORTE_ERROR_LOG(rc);
if (ORTE_SUCCESS != (ret = orte_wakeup(map->job))) {
ORTE_ERROR_LOG(ret);
}
}
@ -1114,7 +1116,7 @@ int orte_pls_bproc_launch(orte_jobid_t jobid) {
}
/* For Bproc, we need to know how many slots were allocated on each
* node so the spawned processes can computer their name. Only Bproc
* node so the spawned processes can compute their name. Only Bproc
* needs to do this, so we choose not to modify the mapped_node struct
* to hold this info - bproc can go get it.
*
@ -1123,7 +1125,7 @@ int orte_pls_bproc_launch(orte_jobid_t jobid) {
* the data for the first node on the map
*/
map_node = (orte_mapped_node_t*)opal_list_get_first(&map->nodes);
if (NULL == (ras_node = orte_ras.node_lookup(map_node->cell, map_node->nodename))) {
if (NULL == (ras_node = orte_ras.node_lookup(map_node->nodename))) {
ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
goto cleanup;
}

Просмотреть файл

@ -134,10 +134,6 @@ static int orte_pls_gridengine_fill_orted_path(char** orted_path)
*/
static void orte_pls_gridengine_wait_daemon(pid_t pid, int status, void* cbdata)
{
int rc;
orte_buffer_t ack;
int src[3] = {-1, -1};
if (! WIFEXITED(status) || ! WEXITSTATUS(status) == 0) {
/* tell the user something went wrong. We need to do this BEFORE we
* set the state to ABORTED as that action will cause a trigger to
@ -166,35 +162,10 @@ static void orte_pls_gridengine_wait_daemon(pid_t pid, int status, void* cbdata)
opal_output(0, "No extra status information is available: %d.", status);
}
/* need to fake a message to the daemon callback system so it can break out
* of its receive loop
/* report that the daemon has failed so we break out of the daemon
* callback receive and can exit
*/
src[2] = pid;
if(WIFSIGNALED(status)) {
src[1] = WTERMSIG(status);
}
OBJ_CONSTRUCT(&ack, orte_buffer_t);
if (ORTE_SUCCESS != (rc = orte_dss.pack(&ack, &src, 3, ORTE_INT))) {
ORTE_ERROR_LOG(rc);
}
rc = orte_rml.send_buffer(ORTE_PROC_MY_NAME, &ack, ORTE_RML_TAG_ORTED_CALLBACK, 0);
if (0 > rc) {
ORTE_ERROR_LOG(rc);
}
OBJ_DESTRUCT(&ack);
/* The usual reasons for qrsh to exit abnormally all are a pretty good
indication that the child processes aren't going to start up properly.
Set the job state to indicate we failed to launch so orterun's exit status
will be non-zero and forcibly terminate the job so orterun can exit
*/
if (ORTE_SUCCESS != (rc = orte_smr.set_job_state(active_job, ORTE_JOB_STATE_FAILED_TO_START))) {
ORTE_ERROR_LOG(rc);
}
if (ORTE_SUCCESS != (rc = orte_wakeup(active_job))) {
ORTE_ERROR_LOG(rc);
}
orte_pls_base_daemon_failed(active_job, true, pid, status, ORTE_JOB_STATE_FAILED_TO_START);
}
}
@ -288,8 +259,7 @@ int orte_pls_gridengine_launch_job(orte_jobid_t jobid)
*/
orte_pls_base_orted_append_basic_args(&argc, &argv,
&proc_name_index,
&node_name_index2,
map->num_nodes);
&node_name_index2);
/* setup environment. The environment is common to all the daemons
* so we only need to do this once
@ -578,13 +548,7 @@ cleanup:
/* check for failed launch - if so, force terminate */
if (failed_launch) {
if (ORTE_SUCCESS != (rc = orte_smr.set_job_state(jobid, ORTE_JOB_STATE_FAILED_TO_START))) {
ORTE_ERROR_LOG(rc);
}
if (ORTE_SUCCESS != (rc = orte_wakeup(jobid))) {
ORTE_ERROR_LOG(rc);
}
orte_pls_base_daemon_failed(jobid, false, -1, 0, ORTE_JOB_STATE_FAILED_TO_START);
}
return rc;

Просмотреть файл

@ -223,9 +223,7 @@ static int pls_lsf_launch_job(orte_jobid_t jobid)
/* Add basic orted command line options */
orte_pls_base_orted_append_basic_args(&argc, &argv,
&proc_name_index,
NULL,
map->num_nodes /* need total #daemons here */
);
NULL);
/* force orted to use the lsf sds */
opal_argv_append(&argc, &argv, "--ns-nds");
@ -359,14 +357,7 @@ cleanup:
/* check for failed launch - if so, force terminate */
if (failed_launch) {
if (ORTE_SUCCESS !=
(rc = orte_smr.set_job_state(jobid,
ORTE_JOB_STATE_FAILED_TO_START))) {
ORTE_ERROR_LOG(rc);
}
if (ORTE_SUCCESS != (rc = orte_wakeup(jobid))) {
ORTE_ERROR_LOG(rc);
}
orte_pls_base_daemon_failed(jobid, false, -1, 0, ORTE_JOB_STATE_FAILED_TO_START);
}
return rc;

Просмотреть файл

@ -558,8 +558,7 @@ int orte_pls_process_launch(orte_jobid_t jobid)
/* Add basic orted command line options */
orte_pls_base_orted_append_basic_args(&argc, &argv,
&proc_name_index,
&node_name_index2,
map->num_nodes);
&node_name_index2);
if (mca_pls_process_component.debug) {
param = opal_argv_join(argv, ' ');

Просмотреть файл

@ -274,10 +274,7 @@ static int orte_pls_rsh_fill_exec_path ( char ** exec_path)
static void orte_pls_rsh_wait_daemon(pid_t pid, int status, void* cbdata)
{
int rc;
unsigned long deltat;
orte_buffer_t ack;
int src[3] = {-1, -1};
if (! WIFEXITED(status) || ! WEXITSTATUS(status) == 0) {
/* tell the user something went wrong */
@ -302,36 +299,10 @@ static void orte_pls_rsh_wait_daemon(pid_t pid, int status, void* cbdata)
} else {
opal_output(0, "No extra status information is available: %d.", status);
}
/* need to fake a message to the daemon callback system so it can break out
* of its receive loop
/* report that the daemon has failed so we break out of the daemon
* callback receive and can exit
*/
src[2] = pid;
if(WIFSIGNALED(status)) {
src[1] = WTERMSIG(status);
}
OBJ_CONSTRUCT(&ack, orte_buffer_t);
if (ORTE_SUCCESS != (rc = orte_dss.pack(&ack, &src, 3, ORTE_INT))) {
ORTE_ERROR_LOG(rc);
}
rc = orte_rml.send_buffer(ORTE_PROC_MY_NAME, &ack, ORTE_RML_TAG_ORTED_CALLBACK, 0);
if (0 > rc) {
ORTE_ERROR_LOG(rc);
}
OBJ_DESTRUCT(&ack);
/* The usual reasons for ssh to exit abnormally all are a pretty good
indication that the child processes aren't going to start up properly.
Set the job state to indicate we failed to launch so orterun's exit status
will be non-zero and forcibly terminate the job so orterun can exit
*/
if (ORTE_SUCCESS != (rc = orte_smr.set_job_state(active_job, ORTE_JOB_STATE_FAILED_TO_START))) {
ORTE_ERROR_LOG(rc);
}
if (ORTE_SUCCESS != (rc = orte_wakeup(active_job))) {
ORTE_ERROR_LOG(rc);
}
orte_pls_base_daemon_failed(active_job, true, pid, status, ORTE_JOB_STATE_FAILED_TO_START);
} /* if abnormal exit */
/* release any waiting threads */
@ -555,8 +526,7 @@ int orte_pls_rsh_launch(orte_jobid_t jobid)
*/
orte_pls_base_orted_append_basic_args(&argc, &argv,
&proc_name_index,
&node_name_index2,
map->num_nodes);
&node_name_index2);
local_exec_index_end = argc;
if (mca_pls_rsh_component.debug) {
@ -958,13 +928,7 @@ launch_apps:
/* check for failed launch - if so, force terminate */
if (failed_launch) {
if (ORTE_SUCCESS != (rc = orte_smr.set_job_state(jobid, ORTE_JOB_STATE_FAILED_TO_START))) {
ORTE_ERROR_LOG(rc);
}
if (ORTE_SUCCESS != (rc = orte_wakeup(jobid))) {
ORTE_ERROR_LOG(rc);
}
orte_pls_base_daemon_failed(jobid, false, -1, 0, ORTE_JOB_STATE_FAILED_TO_START);
}
return rc;

Просмотреть файл

@ -249,8 +249,7 @@ static int pls_slurm_launch_job(orte_jobid_t jobid)
/* Add basic orted command line options, including debug flags */
orte_pls_base_orted_append_basic_args(&argc, &argv,
&proc_name_index,
NULL,
num_nodes);
NULL);
/* force orted to use the slurm sds */
opal_argv_append(&argc, &argv, "--ns-nds");
@ -392,13 +391,7 @@ cleanup:
/* check for failed launch - if so, force terminate */
if (failed_launch) {
if (ORTE_SUCCESS != (rc = orte_smr.set_job_state(jobid, ORTE_JOB_STATE_FAILED_TO_START))) {
ORTE_ERROR_LOG(rc);
}
if (ORTE_SUCCESS != (rc = orte_wakeup(jobid))) {
ORTE_ERROR_LOG(rc);
}
orte_pls_base_daemon_failed(jobid, false, -1, 0, ORTE_JOB_STATE_FAILED_TO_START);
}
return rc;
@ -504,8 +497,6 @@ static void srun_wait_cb(pid_t pid, int status, void* cbdata){
wakes up - otherwise, do nothing!
*/
int rc;
if (0 != status) {
if (failed_launch) {
/* we have a problem during launch */
@ -514,10 +505,10 @@ static void srun_wait_cb(pid_t pid, int status, void* cbdata){
opal_output(0, "ERROR: on one or more remote nodes, lack of authority to execute");
opal_output(0, "ERROR: on one or more specified nodes, or other factors.");
/* set the job state so we know it failed to start */
if (ORTE_SUCCESS != (rc = orte_smr.set_job_state(active_job, ORTE_JOB_STATE_FAILED_TO_START))) {
ORTE_ERROR_LOG(rc);
}
/* report that the daemon has failed so we break out of the daemon
* callback receive and exit
*/
orte_pls_base_daemon_failed(active_job, true, pid, status, ORTE_JOB_STATE_FAILED_TO_START);
} else {
/* an orted must have died unexpectedly after launch */
@ -525,15 +516,9 @@ static void srun_wait_cb(pid_t pid, int status, void* cbdata){
opal_output(0, "ERROR: during execution of the application with a non-zero");
opal_output(0, "ERROR: status of %ld. This is a fatal error.", (long)status);
/* set the job state so we know it aborted */
if (ORTE_SUCCESS != (rc = orte_smr.set_job_state(active_job, ORTE_JOB_STATE_ABORTED))) {
ORTE_ERROR_LOG(rc);
}
/* report that the daemon has failed so we exit */
orte_pls_base_daemon_failed(active_job, false, pid, status, ORTE_JOB_STATE_ABORTED);
}
/* force termination of the job */
if (ORTE_SUCCESS != (rc = orte_wakeup(active_job))) {
ORTE_ERROR_LOG(rc);
}
}
}

Просмотреть файл

@ -526,8 +526,7 @@ int orte_pls_submit_launch(orte_jobid_t jobid)
*/
orte_pls_base_orted_append_basic_args(&argc, &argv,
&proc_name_index,
&node_name_index2,
map->num_nodes);
&node_name_index2);
local_exec_index_end = argc;
if (mca_pls_submit_component.debug) {

Просмотреть файл

@ -207,8 +207,7 @@ static int pls_tm_launch_job(orte_jobid_t jobid)
/* Add basic orted command line options */
orte_pls_base_orted_append_basic_args(&argc, &argv,
&proc_name_index,
&node_name_index,
map->num_nodes);
&node_name_index);
if (mca_pls_tm_component.debug) {
param = opal_argv_join(argv, ' ');
@ -233,8 +232,6 @@ static int pls_tm_launch_job(orte_jobid_t jobid)
/* setup environment */
env = opal_argv_copy(environ);
var = mca_base_param_environ_variable("seed",NULL,NULL);
opal_setenv(var, "0", true, &env);
/* clean out any MCA component selection directives that
* won't work on remote nodes
@ -452,13 +449,7 @@ launch_apps:
/* check for failed launch - if so, force terminate */
if (failed_launch) {
if (ORTE_SUCCESS != (rc = orte_smr.set_job_state(jobid, ORTE_JOB_STATE_FAILED_TO_START))) {
ORTE_ERROR_LOG(rc);
}
if (ORTE_SUCCESS != (rc = orte_wakeup(jobid))) {
ORTE_ERROR_LOG(rc);
}
orte_pls_base_daemon_failed(jobid, false, -1, 0, ORTE_JOB_STATE_FAILED_TO_START);
}
/* check for timing request - get stop time and process if so */

Просмотреть файл

@ -37,7 +37,6 @@
#include "orte/mca/errmgr/errmgr.h"
#include "orte/mca/ras/ras.h"
#include "orte/mca/ras/base/ras_private.h"
#include "orte/runtime/runtime_types.h"
#include "orte/mca/rds/rds.h"
#include "orte/mca/rds/base/rds_private.h"

Просмотреть файл

@ -39,6 +39,7 @@
#include "orte/mca/odls/odls.h"
#include "orte/mca/rmaps/rmaps.h"
#include "orte/mca/grpcomm/grpcomm.h"
#include "orte/mca/routed/routed.h"
#include "orte/mca/smr/smr.h"
#include "orte/runtime/runtime.h"
@ -62,7 +63,6 @@ int orte_rmgr_base_proc_stage_gate_mgr(orte_gpr_notify_message_t *msg)
{
int rc;
orte_jobid_t job;
orte_buffer_t *buffer;
OPAL_TRACE(1);
@ -102,21 +102,6 @@ int orte_rmgr_base_proc_stage_gate_mgr(orte_gpr_notify_message_t *msg)
ORTE_ERROR_LOG(rc);
goto CLEANUP;
}
} else if (orte_schema.check_std_trigger_name(msg->target, ORTE_STG2_TRIGGER)) {
if (ORTE_SUCCESS != (rc = orte_smr.set_job_state(job, ORTE_JOB_STATE_AT_STG2))) {
ORTE_ERROR_LOG(rc);
goto CLEANUP;
}
} else if (orte_schema.check_std_trigger_name(msg->target, ORTE_STG3_TRIGGER)) {
if (ORTE_SUCCESS != (rc = orte_smr.set_job_state(job, ORTE_JOB_STATE_AT_STG3))) {
ORTE_ERROR_LOG(rc);
goto CLEANUP;
}
} else if (orte_schema.check_std_trigger_name(msg->target, ORTE_ALL_FINALIZED_TRIGGER)) {
if (ORTE_SUCCESS != (rc = orte_smr.set_job_state(job, ORTE_JOB_STATE_FINALIZED))) {
ORTE_ERROR_LOG(rc);
goto CLEANUP;
}
} else if (orte_schema.check_std_trigger_name(msg->target, ORTE_ALL_TERMINATED_TRIGGER)) {
if (ORTE_SUCCESS != (rc = orte_smr.set_job_state(job, ORTE_JOB_STATE_TERMINATED))) {
ORTE_ERROR_LOG(rc);
@ -134,14 +119,12 @@ int orte_rmgr_base_proc_stage_gate_mgr(orte_gpr_notify_message_t *msg)
}
}
#if 0
/* check to see if this came from a trigger that does not require we send
* out a message
*/
if (!orte_schema.check_std_trigger_name(msg->target, ORTE_STARTUP_TRIGGER) &&
!orte_schema.check_std_trigger_name(msg->target, ORTE_STG1_TRIGGER) &&
!orte_schema.check_std_trigger_name(msg->target, ORTE_STG2_TRIGGER) &&
!orte_schema.check_std_trigger_name(msg->target, ORTE_STG3_TRIGGER) &&
!orte_schema.check_std_trigger_name(msg->target, ORTE_ALL_FINALIZED_TRIGGER)) {
!orte_schema.check_std_trigger_name(msg->target, ORTE_STG1_TRIGGER)) {
return ORTE_SUCCESS;
}
@ -169,7 +152,18 @@ int orte_rmgr_base_proc_stage_gate_mgr(orte_gpr_notify_message_t *msg)
ORTE_ERROR_LOG(rc);
}
OBJ_RELEASE(buffer);
#endif
if (orte_schema.check_std_trigger_name(msg->target, ORTE_STG1_TRIGGER)) {
if (ORTE_SUCCESS != (rc = orte_routed.init_routes(job, NULL))) {
ORTE_ERROR_LOG(rc);
return rc;
}
} else {
opal_output(0, "rmgr_stage_gate: got trigger %s", msg->target);
rc = ORTE_SUCCESS;
}
CLEANUP:
return rc;

Просмотреть файл

@ -400,18 +400,6 @@ static void orte_rmgr_proxy_callback(orte_gpr_notify_data_t *data, void *cbdata)
(*cbfunc)(jobid,ORTE_PROC_STATE_AT_STG1);
continue;
}
if(strcmp(keyval->key, ORTE_PROC_NUM_AT_STG2) == 0) {
(*cbfunc)(jobid,ORTE_PROC_STATE_AT_STG2);
continue;
}
if(strcmp(keyval->key, ORTE_PROC_NUM_AT_STG3) == 0) {
(*cbfunc)(jobid,ORTE_PROC_STATE_AT_STG3);
continue;
}
if(strcmp(keyval->key, ORTE_PROC_NUM_FINALIZED) == 0) {
(*cbfunc)(jobid,ORTE_PROC_STATE_FINALIZED);
continue;
}
if(strcmp(keyval->key, ORTE_PROC_NUM_TERMINATED) == 0) {
(*cbfunc)(jobid,ORTE_PROC_STATE_TERMINATED);
continue;

Просмотреть файл

@ -288,24 +288,6 @@ static void orte_rmgr_urm_callback(orte_gpr_notify_data_t *data, void *cbdata)
(*cbfunc)(jobid,ORTE_PROC_STATE_AT_STG1);
continue;
}
if(strcmp(keyval->key, ORTE_PROC_NUM_AT_STG2) == 0) {
(*cbfunc)(jobid,ORTE_PROC_STATE_AT_STG2);
continue;
}
if(strcmp(keyval->key, ORTE_PROC_NUM_AT_STG3) == 0) {
(*cbfunc)(jobid,ORTE_PROC_STATE_AT_STG3);
continue;
}
if(strcmp(keyval->key, ORTE_PROC_NUM_FINALIZED) == 0) {
#if OPAL_ENABLE_FT == 1
/* Stop tracking this job */
if(ORTE_SUCCESS != (rc = orte_snapc.release_job(jobid))) {
ORTE_ERROR_LOG(rc);
}
#endif
(*cbfunc)(jobid,ORTE_PROC_STATE_FINALIZED);
continue;
}
if(strcmp(keyval->key, ORTE_PROC_NUM_TERMINATED) == 0) {
#if OPAL_ENABLE_FT == 1
/* Stop tracking this job */

Просмотреть файл

@ -37,6 +37,7 @@ orte_rml_base_open(void)
{
int ret;
orte_data_type_t tmp;
int param, value;
/* Initialize globals */
OBJ_CONSTRUCT(&orte_rml_base_components, opal_list_t);
@ -67,6 +68,17 @@ orte_rml_base_open(void)
false, false,
NULL, NULL);
/* register parameters */
param = mca_base_param_reg_int_name("rml", "base_verbose",
"Verbosity level for the rml framework",
false, false, 0, &value);
if (value != 0) {
orte_rml_base_output = opal_output_open(NULL);
} else {
orte_rml_base_output = -1;
}
/* Open up all available components */
ret = mca_base_components_open("rml",
orte_rml_base_output,

Просмотреть файл

@ -12,6 +12,7 @@
#include "orte/mca/errmgr/errmgr.h"
#include "orte/mca/ns/ns.h"
#include "orte/mca/gpr/gpr.h"
#include "orte/mca/routed/routed.h"
#include "orte/mca/oob/oob_types.h"
extern opal_list_t orte_rml_base_subscriptions;
@ -233,6 +234,7 @@ orte_rml_base_contact_info_notify(orte_gpr_notify_data_t* data,
orte_gpr_value_t **values, *value;
orte_gpr_keyval_t *keyval;
char *contact_info;
orte_process_name_t name;
/* process the callback */
values = (orte_gpr_value_t**)(data->values)->addr;
@ -249,6 +251,9 @@ orte_rml_base_contact_info_notify(orte_gpr_notify_data_t* data,
continue;
orte_dss.get((void**)&(contact_info), keyval->value, ORTE_STRING);
orte_rml.set_contact_info(contact_info);
/* also have to set the route, so extract the process name */
orte_rml_base_parse_uris(contact_info, &name, NULL);
orte_routed.update_route(&name, &name);
}
}
}

Просмотреть файл

@ -147,7 +147,7 @@ int
orte_rml_cnos_send(orte_process_name_t * peer,
struct iovec *msg, int count, int tag, int flags)
{
return ORTE_ERR_NOT_SUPPORTED;
return ORTE_SUCCESS;
}
int
@ -155,7 +155,7 @@ orte_rml_cnos_send_buffer(orte_process_name_t * peer,
orte_buffer_t * buffer,
orte_rml_tag_t tag, int flags)
{
return ORTE_ERR_NOT_SUPPORTED;
return ORTE_SUCCESS;
}
int
@ -163,14 +163,14 @@ orte_rml_cnos_recv(orte_process_name_t * peer,
struct iovec *msg,
int count, orte_rml_tag_t tag, int flags)
{
return ORTE_ERR_NOT_SUPPORTED;
return ORTE_SUCCESS;
}
int
orte_rml_cnos_recv_buffer(orte_process_name_t * peer,
orte_buffer_t * buf, orte_rml_tag_t tag, int flags)
{
return ORTE_ERR_NOT_SUPPORTED;
return ORTE_SUCCESS;
}
int
@ -217,7 +217,7 @@ orte_rml_cnos_recv_buffer_nb(orte_process_name_t * peer,
int
orte_rml_cnos_recv_cancel(orte_process_name_t * peer, orte_rml_tag_t tag)
{
return ORTE_ERR_NOT_SUPPORTED;
return ORTE_SUCCESS;
}
int orte_rml_cnos_add_exception_handler(orte_rml_exception_callback_t cbfunc)

Просмотреть файл

@ -32,6 +32,8 @@ struct orte_rml_oob_module_t {
mca_oob_t *active_oob;
opal_list_t exceptions;
opal_mutex_t exceptions_lock;
opal_list_t queued_routing_messages;
opal_mutex_t queued_lock;
};
typedef struct orte_rml_oob_module_t orte_rml_oob_module_t;
@ -100,6 +102,12 @@ struct orte_rml_oob_msg_t {
typedef struct orte_rml_oob_msg_t orte_rml_oob_msg_t;
OBJ_CLASS_DECLARATION(orte_rml_oob_msg_t);
struct orte_rml_oob_queued_msg_t {
opal_list_item_t super;
struct iovec payload[1];
};
typedef struct orte_rml_oob_queued_msg_t orte_rml_oob_queued_msg_t;
OBJ_CLASS_DECLARATION(orte_rml_oob_queued_msg_t);
int orte_rml_oob_init(void);
int orte_rml_oob_fini(void);

Просмотреть файл

@ -136,6 +136,8 @@ rml_oob_init(int* priority)
OBJ_CONSTRUCT(&orte_rml_oob_module.exceptions, opal_list_t);
OBJ_CONSTRUCT(&orte_rml_oob_module.exceptions_lock, opal_mutex_t);
OBJ_CONSTRUCT(&orte_rml_oob_module.queued_routing_messages, opal_list_t);
OBJ_CONSTRUCT(&orte_rml_oob_module.queued_lock, opal_mutex_t);
orte_rml_oob_module.active_oob = &mca_oob;
orte_rml_oob_module.active_oob->oob_exception_callback =
@ -144,20 +146,20 @@ rml_oob_init(int* priority)
return &orte_rml_oob_module.super;
}
static struct iovec route_recv_iov[1];
int
orte_rml_oob_init(void)
{
int ret;
struct iovec iov[1];
ret = orte_rml_oob_module.active_oob->oob_init();
iov[0].iov_base = NULL;
iov[0].iov_len = 0;
route_recv_iov[0].iov_base = NULL;
route_recv_iov[0].iov_len = 0;
ret = orte_rml_oob_module.active_oob->oob_recv_nb(ORTE_NAME_WILDCARD,
iov, 1,
route_recv_iov, 1,
ORTE_RML_TAG_RML_ROUTE,
ORTE_RML_ALLOC|ORTE_RML_PERSISTENT,
rml_oob_recv_route_callback,
@ -178,6 +180,8 @@ orte_rml_oob_fini(void)
}
OBJ_DESTRUCT(&orte_rml_oob_module.exceptions);
OBJ_DESTRUCT(&orte_rml_oob_module.exceptions_lock);
OBJ_DESTRUCT(&orte_rml_oob_module.queued_routing_messages);
OBJ_DESTRUCT(&orte_rml_oob_module.queued_lock);
orte_rml_oob_module.active_oob->oob_exception_callback = NULL;
return ORTE_SUCCESS;
@ -282,6 +286,125 @@ msg_destruct(orte_rml_oob_msg_t *msg)
OBJ_CLASS_INSTANCE(orte_rml_oob_msg_t, opal_object_t,
msg_construct, msg_destruct);
static void
queued_msg_construct(orte_rml_oob_queued_msg_t *msg)
{
msg->payload[0].iov_base = NULL;
msg->payload[0].iov_len = 0;
}
static void
queued_msg_destruct(orte_rml_oob_queued_msg_t *msg)
{
if (NULL != msg->payload[0].iov_base) free(msg->payload[0].iov_base);
}
OBJ_CLASS_INSTANCE(orte_rml_oob_queued_msg_t, opal_list_item_t,
queued_msg_construct, queued_msg_destruct);
static void
rml_oob_recv_route_queued_send_callback(int status,
struct orte_process_name_t* peer,
struct iovec* iov,
int count,
orte_rml_tag_t tag,
void* cbdata)
{
orte_rml_oob_queued_msg_t *qmsg = (orte_rml_oob_queued_msg_t*) cbdata;
OBJ_RELEASE(qmsg);
}
static int
rml_oob_queued_progress(void)
{
orte_rml_oob_queued_msg_t *qmsg;
orte_rml_oob_msg_header_t *hdr;
int real_tag;
int ret;
orte_process_name_t next, origin;
int count = 0;
while (true) {
OPAL_THREAD_LOCK(&orte_rml_oob_module.queued_lock);
qmsg = (orte_rml_oob_queued_msg_t*) opal_list_remove_first(&orte_rml_oob_module.queued_routing_messages);
if (0 == opal_list_get_size(&orte_rml_oob_module.queued_routing_messages)) {
opal_progress_unregister(rml_oob_queued_progress);
}
OPAL_THREAD_UNLOCK(&orte_rml_oob_module.queued_lock);
if (NULL == qmsg) break;
hdr = (orte_rml_oob_msg_header_t*) qmsg->payload;
origin = hdr->origin;
next = orte_routed.get_route(&hdr->destination);
if (next.vpid == ORTE_VPID_INVALID) {
opal_output(0,
"%s tried routing message to %s, can't find route",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(&hdr->destination));
abort();
}
if (0 == orte_ns.compare_fields(ORTE_NS_CMP_ALL, &next, ORTE_PROC_MY_NAME)) {
opal_output(0, "%s trying to get message to %s, routing loop",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(&hdr->destination));
abort();
}
if (0 == orte_ns.compare_fields(ORTE_NS_CMP_ALL, &next, &hdr->destination)) {
real_tag = hdr->tag;
} else {
real_tag = ORTE_RML_TAG_RML_ROUTE;
}
OPAL_OUTPUT_VERBOSE((1, orte_rml_base_output,
"%s routing message from %s for %s to %s (tag: %d)",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(&hdr->origin),
ORTE_NAME_PRINT(&hdr->destination),
ORTE_NAME_PRINT(&next),
hdr->tag));
ORTE_RML_OOB_MSG_HEADER_HTON(*hdr);
ret = orte_rml_oob_module.active_oob->oob_send_nb(&next,
&origin,
qmsg->payload,
1,
real_tag,
0,
rml_oob_recv_route_queued_send_callback,
qmsg);
if (ORTE_SUCCESS != ret) {
if (ORTE_ERR_ADDRESSEE_UNKNOWN == ret) {
/* still no route -- try again */
OPAL_THREAD_LOCK(&orte_rml_oob_module.queued_lock);
opal_list_append(&orte_rml_oob_module.queued_routing_messages,
&qmsg->super);
if (1 == opal_list_get_size(&orte_rml_oob_module.queued_routing_messages)) {
opal_progress_register(rml_oob_queued_progress);
}
OPAL_THREAD_UNLOCK(&orte_rml_oob_module.queued_lock);
} else {
opal_output(0,
"%s failed to send message to %s: %s (rc = %d)",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(&next),
opal_strerror(ret),
ret);
abort();
}
}
count++;
}
return count;
}
static void
rml_oob_recv_route_send_callback(int status,
struct orte_process_name_t* peer,
@ -290,12 +413,9 @@ rml_oob_recv_route_send_callback(int status,
orte_rml_tag_t tag,
void* cbdata)
{
/* BWB -- propogate errors here... */
if (NULL != iov[0].iov_base) free(iov[0].iov_base);
}
static void
rml_oob_recv_route_callback(int status,
struct orte_process_name_t* peer,
@ -308,29 +428,84 @@ rml_oob_recv_route_callback(int status,
(orte_rml_oob_msg_header_t*) iov[0].iov_base;
int real_tag;
int ret;
orte_process_name_t next;
orte_process_name_t next, origin;
/* BWB -- propogate errors here... */
assert(status >= 0);
ORTE_RML_OOB_MSG_HEADER_NTOH(*hdr);
origin = hdr->origin;
next = orte_routed.get_route(&hdr->destination);
if (next.vpid == ORTE_VPID_INVALID) {
ORTE_ERROR_LOG(ORTE_ERR_ADDRESSEE_UNKNOWN);
opal_output(0,
"%s tried routing message to %s, can't find route",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(&hdr->destination));
abort();
}
if (0 == orte_ns.compare_fields(ORTE_NS_CMP_ALL, &next, peer)) {
if (0 == orte_ns.compare_fields(ORTE_NS_CMP_ALL, &next, ORTE_PROC_MY_NAME)) {
opal_output(0, "%s trying to get message to %s, routing loop",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(&hdr->destination));
abort();
}
if (0 == orte_ns.compare_fields(ORTE_NS_CMP_ALL, &next, &hdr->destination)) {
real_tag = hdr->tag;
} else {
real_tag = ORTE_RML_TAG_RML_ROUTE;
}
OPAL_OUTPUT_VERBOSE((1, orte_rml_base_output,
"%s routing message from %s for %s to %s (tag: %d)",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(&hdr->origin),
ORTE_NAME_PRINT(&hdr->destination),
ORTE_NAME_PRINT(&next),
hdr->tag));
ORTE_RML_OOB_MSG_HEADER_HTON(*hdr);
ret = orte_rml_oob_module.active_oob->oob_send_nb(&next,
&origin,
iov,
count,
real_tag,
0,
rml_oob_recv_route_send_callback,
NULL);
assert(ret == ORTE_SUCCESS);
if (ORTE_SUCCESS != ret) {
if (ORTE_ERR_ADDRESSEE_UNKNOWN == ret) {
/* no route -- queue and hope we find a route */
orte_rml_oob_queued_msg_t *qmsg = OBJ_NEW(orte_rml_oob_queued_msg_t);
OPAL_OUTPUT_VERBOSE((1, orte_rml_base_output,
"%s: no OOB information for %s. Queuing for later.",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(&next)));
ORTE_RML_OOB_MSG_HEADER_NTOH(*hdr);
qmsg->payload[0].iov_base = malloc(iov[0].iov_len);
if (NULL == qmsg->payload[0].iov_base) abort();
qmsg->payload[0].iov_len = iov[0].iov_len;
memcpy(qmsg->payload[0].iov_base, iov[0].iov_base, iov[0].iov_len);
OPAL_THREAD_LOCK(&orte_rml_oob_module.queued_lock);
opal_list_append(&orte_rml_oob_module.queued_routing_messages,
&qmsg->super);
if (1 == opal_list_get_size(&orte_rml_oob_module.queued_routing_messages)) {
opal_progress_register(rml_oob_queued_progress);
}
OPAL_THREAD_UNLOCK(&orte_rml_oob_module.queued_lock);
} else {
opal_output(0,
"%s failed to send message to %s: %s (rc = %d)",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(&next),
opal_strerror(ret),
ret);
abort();
}
}
}

Просмотреть файл

@ -10,8 +10,10 @@
#include "rml_oob.h"
#include "opal/util/output.h"
#include "orte/mca/oob/oob.h"
#include "orte/mca/oob/base/base.h"
#include "orte/mca/rml/base/base.h"
#include "orte/dss/dss.h"
@ -26,6 +28,14 @@ orte_rml_recv_msg_callback(int status,
orte_rml_oob_msg_t *msg = (orte_rml_oob_msg_t*) cbdata;
orte_rml_oob_msg_header_t *hdr =
(orte_rml_oob_msg_header_t*) iov[0].iov_base;
ORTE_RML_OOB_MSG_HEADER_NTOH(*hdr);
OPAL_OUTPUT_VERBOSE((1, orte_rml_base_output,
"%s recv from %s for %s (tag %d)",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(&hdr->origin),
ORTE_NAME_PRINT(&hdr->destination),
hdr->tag));
if (msg->msg_type == ORTE_RML_BLOCKING_RECV) {
/* blocking send */
@ -38,7 +48,6 @@ orte_rml_recv_msg_callback(int status,
status -= sizeof(orte_rml_oob_msg_header_t);
}
ORTE_RML_OOB_MSG_HEADER_NTOH(*hdr);
msg->msg_cbfunc.iov(status, &hdr->origin, iov + 1, count - 1,
hdr->tag, msg->msg_cbdata);
if (!msg->msg_persistent) OBJ_RELEASE(msg);
@ -49,7 +58,6 @@ orte_rml_recv_msg_callback(int status,
iov[1].iov_base,
iov[1].iov_len);
ORTE_RML_OOB_MSG_HEADER_NTOH(*hdr);
msg->msg_cbfunc.buffer(status, &hdr->origin, &msg->msg_recv_buffer,
hdr->tag, msg->msg_cbdata);

Просмотреть файл

@ -10,12 +10,14 @@
#include "rml_oob.h"
#include "opal/util/output.h"
#include "orte/mca/routed/routed.h"
#include "orte/mca/oob/oob.h"
#include "orte/mca/oob/base/base.h"
#include "orte/dss/dss.h"
#include "orte/mca/ns/ns.h"
#include "orte/mca/errmgr/errmgr.h"
#include "orte/mca/rml/base/base.h"
static void
orte_rml_send_msg_callback(int status,
@ -83,6 +85,7 @@ orte_rml_oob_send(orte_process_name_t* peer,
next = orte_routed.get_route(peer);
if (next.vpid == ORTE_VPID_INVALID) {
ORTE_ERROR_LOG(ORTE_ERR_ADDRESSEE_UNKNOWN);
opal_output(0, "%s attempted to send to %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), ORTE_NAME_PRINT(peer));
return ORTE_ERR_ADDRESSEE_UNKNOWN;
}
msg->msg_data = (struct iovec *) malloc(sizeof(struct iovec) * (count + 1));
@ -106,7 +109,15 @@ orte_rml_oob_send(orte_process_name_t* peer,
real_tag = ORTE_RML_TAG_RML_ROUTE;
}
OPAL_OUTPUT_VERBOSE((1, orte_rml_base_output,
"rml_send %s -> %s (router %s, tag %d, %d)",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(peer),
ORTE_NAME_PRINT(&next),
tag,
real_tag));
ret = orte_rml_oob_module.active_oob->oob_send_nb(&next,
ORTE_PROC_MY_NAME,
msg->msg_data,
count + 1,
real_tag,
@ -177,7 +188,14 @@ orte_rml_oob_send_nb(orte_process_name_t* peer,
real_tag = ORTE_RML_TAG_RML_ROUTE;
}
OPAL_OUTPUT_VERBOSE((1, orte_rml_base_output,
"rml_send_nb %s -> %s (router %s, tag %d, %d)",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(peer),
ORTE_NAME_PRINT(&next),
tag, real_tag));
ret = orte_rml_oob_module.active_oob->oob_send_nb(&next,
ORTE_PROC_MY_NAME,
msg->msg_data,
count + 1,
real_tag,
@ -247,6 +265,9 @@ orte_rml_oob_send_buffer_nb(orte_process_name_t* peer,
next = orte_routed.get_route(peer);
if (next.vpid == ORTE_VPID_INVALID) {
ORTE_ERROR_LOG(ORTE_ERR_ADDRESSEE_UNKNOWN);
opal_output(0, "%s unable to find address for %s",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(peer));
return ORTE_ERR_ADDRESSEE_UNKNOWN;
}
@ -272,7 +293,15 @@ orte_rml_oob_send_buffer_nb(orte_process_name_t* peer,
OBJ_RETAIN(buffer);
OPAL_OUTPUT_VERBOSE((1, orte_rml_base_output,
"rml_send_buffer_nb %s -> %s (router %s, tag %d, %d)",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(peer),
ORTE_NAME_PRINT(&next),
tag, real_tag));
ret = orte_rml_oob_module.active_oob->oob_send_nb(&next,
ORTE_PROC_MY_NAME,
msg->msg_data,
2,
real_tag,

Просмотреть файл

@ -86,8 +86,17 @@ BEGIN_C_DECLS
#define ORTE_RML_TAG_COMM_CID_INTRA 28
#define ORTE_RML_TAG_ALLGATHER 29
#define ORTE_RML_TAG_ALLGATHER_LIST 30
#define ORTE_RML_TAG_BARRIER 31
#define ORTE_RML_TAG_INIT_ROUTES 32
#define ORTE_RML_TAG_SYNC 33
/* For FileM RSH Component */
#define ORTE_RML_TAG_FILEM_RSH 29
#define ORTE_RML_TAG_FILEM_RSH 34
/* For CRCP Coord Component */
#define OMPI_CRCP_COORD_BOOKMARK_TAG 4242

46
orte/mca/routed/cnos/Makefile.am Обычный файл
Просмотреть файл

@ -0,0 +1,46 @@
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
sources = \
routed_cnos.h \
routed_cnos_module.c \
routed_cnos_component.c
# Make the output library in this directory, and name it either
# mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
# (for static builds).
if OMPI_BUILD_routed_cnos_DSO
component_noinst =
component_install = mca_routed_cnos.la
else
component_noinst = libmca_routed_cnos.la
component_install =
endif
mcacomponentdir = $(pkglibdir)
mcacomponent_LTLIBRARIES = $(component_install)
mca_routed_cnos_la_SOURCES = $(sources)
mca_routed_cnos_la_LDFLAGS = -module -avoid-version
mca_routed_cnos_la_LIBADD = \
$(top_ompi_builddir)/orte/libopen-rte.la \
$(top_ompi_builddir)/opal/libopen-pal.la
noinst_LTLIBRARIES = $(component_noinst)
libmca_routed_cnos_la_SOURCES =$(sources)
libmca_routed_cnos_la_LDFLAGS = -module -avoid-version

40
orte/mca/routed/cnos/configure.m4 Обычный файл
Просмотреть файл

@ -0,0 +1,40 @@
# -*- shell-script -*-
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# MCA_routed_cnos_CONFIG([action-if-found], [action-if-not-found])
# -----------------------------------------------------------
AC_DEFUN([MCA_routed_cnos_CONFIG],[
routed_cnos_happy="no"
# see if we should enable super secret utcp support
if test "$with_routed_cnos" = "utcp" ; then
routed_cnos_happy="yes"
routed_cnos_barrier=0
else
# check for cnos functions
AC_CHECK_FUNC([cnos_barrier],
[routed_cnos_happy="yes"
routed_cnos_barrier=1],
[routed_cnos_happy="no"
routed_cnos_barrier=0])
fi
AC_DEFINE_UNQUOTED([OMPI_ROUTED_CNOS_HAVE_BARRIER], [$routed_cnos_barrier],
[whether to use cnos_barrier or not])
AS_IF([test "$routed_cnos_happy" = "yes"], [$1], [$2])
])dnl

24
orte/mca/routed/cnos/configure.params Обычный файл
Просмотреть файл

@ -0,0 +1,24 @@
# -*- shell-script -*-
#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation. All rights
# reserved.
# Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
# University of Stuttgart. All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# Copyright (c) 2007 Los Alamos National Security, LLC. All rights
# reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#
# Specific to this module
PARAM_CONFIG_FILES="Makefile"

Просмотреть файл

@ -1,8 +1,9 @@
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
/* -*- C -*-
*
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* Copyright (c) 2004-2006 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
@ -14,18 +15,32 @@
* Additional copyrights may follow
*
* $HEADER$
*/
/**
* @file
*
* Internal Run-Time interface functionality
*/
#ifndef OMPI_RUNTIME_INTERNAL_H
#define OMPI_RUNTIME_INTERNAL_H
#ifndef ROUTED_CNOS_H
#define ROUTED_CNOS_H
#include "orte_config.h"
#include "orte/orte_types.h"
#include "orte/orte_constants.h"
#include "opal/threads/mutex.h"
#include "opal/threads/condition.h"
#include "opal/class/opal_object.h"
#include "orte/mca/routed/routed.h"
BEGIN_C_DECLS
/*
* Module open / close
*/
orte_routed_module_t* orte_routed_cnos_init(int *priority);
ORTE_MODULE_DECLSPEC extern orte_routed_component_t mca_routed_cnos_component;
extern orte_routed_module_t orte_routed_cnos_module;
END_C_DECLS
#endif

Просмотреть файл

@ -0,0 +1,69 @@
/* -*- C -*-
*
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
/** @file:
*
*/
/*
* includes
*/
#include "orte_config.h"
#include "orte/orte_constants.h"
#include "orte/orte_types.h"
#include "opal/mca/mca.h"
#include "opal/mca/base/mca_base_param.h"
#include "routed_cnos.h"
/*
* Struct of function pointers that need to be initialized
*/
orte_routed_component_t mca_routed_cnos_component = {
{
ORTE_ROUTED_BASE_VERSION_1_0_0,
"cnos", /* MCA module name */
ORTE_MAJOR_VERSION, /* MCA module major version */
ORTE_MINOR_VERSION, /* MCA module minor version */
ORTE_RELEASE_VERSION, /* MCA module release version */
NULL, /* module open */
NULL /* module close */
},
{
/* The component is checkpoint ready */
MCA_BASE_METADATA_PARAM_CHECKPOINT
},
orte_routed_cnos_init /* component init */
};
/*
* instantiate globals needed within cnos component
*/
orte_routed_module_t* orte_routed_cnos_init(int *priority)
{
/* we are the default, so set a low priority so we can be overridden */
*priority = 50;
return &orte_routed_cnos_module;
}

91
orte/mca/routed/cnos/routed_cnos_module.c Обычный файл
Просмотреть файл

@ -0,0 +1,91 @@
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2007 Sun Microsystems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "orte_config.h"
#include "orte/orte_constants.h"
#include "orte/mca/ns/ns_types.h"
#include "orte/mca/gpr/gpr_types.h"
#include "orte/mca/routed/base/base.h"
#include "routed_cnos.h"
#if OMPI_ROUTED_CNOS_HAVE_BARRIER
#include <catamount/cnos_mpi_os.h>
#endif
/* API functions */
static int orte_routed_cnos_finalize(void);
static int orte_routed_cnos_update_route(orte_process_name_t *target,
orte_process_name_t *route);
static orte_process_name_t orte_routed_cnos_get_route(orte_process_name_t *target);
static int orte_routed_cnos_init_routes(orte_jobid_t job, orte_gpr_notify_data_t *ndat);
static int orte_routed_cnos_warmup_routes(void);
orte_routed_module_t orte_routed_cnos_module = {
orte_routed_cnos_finalize,
orte_routed_cnos_update_route,
orte_routed_cnos_get_route,
orte_routed_cnos_init_routes,
orte_routed_cnos_warmup_routes
};
static int
orte_routed_cnos_finalize(void)
{
return ORTE_SUCCESS;
}
static int
orte_routed_cnos_update_route(orte_process_name_t *target,
orte_process_name_t *route)
{
OPAL_OUTPUT_VERBOSE((1, orte_routed_base_output,
"routed_cnos_update: %s --> %s",
ORTE_NAME_PRINT(target),
ORTE_NAME_PRINT(route)));
return ORTE_SUCCESS;
}
static orte_process_name_t
orte_routed_cnos_get_route(orte_process_name_t *target)
{
OPAL_OUTPUT_VERBOSE((1, orte_routed_base_output,
"routed_cnos_get(%s) --> %s",
ORTE_NAME_PRINT(target),
ORTE_NAME_PRINT(target)));
return *target;
}
static int orte_routed_cnos_init_routes(orte_jobid_t job, orte_gpr_notify_data_t *ndat)
{
return ORTE_SUCCESS;
}
static int orte_routed_cnos_warmup_routes(void)
{
return ORTE_SUCCESS;
}

Просмотреть файл

@ -27,6 +27,7 @@
#include "opal/mca/mca.h"
#include "orte/mca/rml/rml_types.h"
#include "orte/mca/gpr/gpr_types.h"
#include "opal/mca/crs/crs.h"
#include "opal/mca/crs/base/base.h"
@ -126,6 +127,9 @@ typedef int (*orte_routed_module_update_route_fn_t)(orte_process_name_t *target,
typedef orte_process_name_t (*orte_routed_module_get_route_fn_t)(orte_process_name_t *target);
typedef int (*orte_routed_module_init_routes_fn_t)(orte_jobid_t job, orte_gpr_notify_data_t *ndat);
typedef int (*orte_routed_module_warmup_routes_fn_t)(void);
/* ******************************************************************** */
@ -143,6 +147,8 @@ struct orte_routed_module_t {
orte_routed_module_update_route_fn_t update_route;
orte_routed_module_get_route_fn_t get_route;
orte_routed_module_init_routes_fn_t init_routes;
orte_routed_module_warmup_routes_fn_t warmup_routes;
};
/** Convienence typedef */
typedef struct orte_routed_module_t orte_routed_module_t;

Просмотреть файл

Просмотреть файл

@ -9,16 +9,21 @@
*/
#include "orte_config.h"
#include "routed_tree.h"
#include "orte/orte_constants.h"
#include "opal/util/output.h"
#include "orte/orte_constants.h"
#include "orte/mca/routed/base/base.h"
#include "orte/mca/errmgr/errmgr.h"
#include "orte/mca/ns/ns.h"
#include "orte/mca/routed/routed.h"
#include "orte/mca/rmaps/rmaps.h"
#include "orte/mca/grpcomm/grpcomm.h"
#include "orte/mca/odls/odls.h"
#include "orte/mca/smr/smr.h"
#include "orte/mca/rml/base/rml_contact.h"
#include "orte/mca/routed/base/base.h"
#include "routed_tree.h"
int
orte_routed_tree_update_route(orte_process_name_t *target,
@ -30,7 +35,8 @@ orte_routed_tree_update_route(orte_process_name_t *target,
}
OPAL_OUTPUT_VERBOSE((1, orte_routed_base_output,
"routed_tree_update: [%s] --> [%s]",
"%s routed_tree_update: %s --> %s",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(target),
ORTE_NAME_PRINT(route)));
@ -102,6 +108,12 @@ orte_routed_tree_get_route(orte_process_name_t *target)
orte_process_name_t ret;
opal_list_item_t *item;
/* if it is me, then the route is just direct */
if (ORTE_EQUAL == orte_dss.compare(ORTE_PROC_MY_NAME, target, ORTE_NAME)) {
ret = *target;
goto found;
}
/* check exact matches */
for (item = opal_list_get_first(&orte_routed_tree_module.peer_list) ;
item != opal_list_get_end(&orte_routed_tree_module.peer_list) ;
@ -120,10 +132,267 @@ orte_routed_tree_get_route(orte_process_name_t *target)
found:
OPAL_OUTPUT_VERBOSE((1, orte_routed_base_output,
"routed_tree_get([%s]) --> [%s]",
OPAL_OUTPUT_VERBOSE((2, orte_routed_base_output,
"%s routed_tree_get(%s) --> %s",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(target),
ORTE_NAME_PRINT(&ret)));
return ret;
}
int orte_routed_tree_init_routes(orte_jobid_t job, orte_gpr_notify_data_t *ndat)
{
/* the tree module routes all proc communications through
* the local daemon. Daemons must identify which of their
* daemon-peers is "hosting" the specified recipient and
* route the message to that daemon. Daemon contact info
* is handled elsewhere, so all we need to do here is
* ensure that the procs are told to route through their
* local daemon, and that daemons are told how to route
* for each proc
*/
int rc;
/* if I am a daemon or HNP, then I have to extract the routing info for this job
* from the data sent to me for launch and update the routing tables to
* point at the daemon for each proc
*/
if (orte_process_info.daemon || orte_process_info.seed) {
orte_std_cntr_t i, j;
orte_process_name_t daemon, proc;
orte_gpr_value_t **values, *value;
OPAL_OUTPUT_VERBOSE((1, orte_routed_base_output,
"%s routed_tree: init routes for daemon/seed job %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), (long)job));
if (0 == job) {
if (NULL == ndat) {
/* if ndat is NULL, then this is being called during init,
* so just seed the routing table with a path back to the HNP...
*/
if (ORTE_SUCCESS != (rc = orte_routed_tree_update_route(ORTE_PROC_MY_HNP,
ORTE_PROC_MY_HNP))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* ...and register our contact info with the HNP */
if (ORTE_SUCCESS != (rc = orte_rml_base_register_contact_info())) {
ORTE_ERROR_LOG(rc);
return rc;
}
} else {
/* ndat != NULL means we are getting an update of RML info
* for the daemons - so update our contact info and routes
*/
orte_rml_base_contact_info_notify(ndat, NULL);
}
return ORTE_SUCCESS;
}
/* if ndat=NULL, then I can just ignore anything else - this is
* being called because there are places where other routing
* algos need to be called and we don't
*/
if (NULL == ndat) {
OPAL_OUTPUT_VERBOSE((2, orte_routed_base_output,
"%s routed_tree: no routing info provided for daemons",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
return ORTE_SUCCESS;
}
/* for any other job, extract the contact map from the launch
* message since it contains info from every daemon
*/
OPAL_OUTPUT_VERBOSE((2, orte_routed_base_output,
"%s routed_tree: extract proc routing info",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
values = (orte_gpr_value_t**)(ndat->values)->addr;
daemon.jobid = 0;
proc.jobid = job;
for (j=0, i=0; i < ndat->cnt && j < (ndat->values)->size; j++) { /* loop through all returned values */
if (NULL != values[j]) {
i++;
value = values[j];
if (NULL != value->tokens) {
/* this came from the globals container, so ignore it */
continue;
}
/* this must have come from one of the process containers, so it must
* contain data for a proc structure - extract what we need
*/
if (ORTE_SUCCESS != (rc = orte_odls.extract_proc_map_info(&daemon, &proc, value))) {
ORTE_ERROR_LOG(rc);
return rc;
}
if (0 != orte_ns.compare_fields(ORTE_NS_CMP_ALL, ORTE_PROC_MY_NAME, &daemon)) {
/* Setup the route to the remote proc via its daemon */
if (ORTE_SUCCESS != (rc = orte_routed_tree_update_route(&proc, &daemon))) {
ORTE_ERROR_LOG(rc);
return rc;
}
} else {
/* setup the route for my own procs as they may not have talked
* to me yet - if they have, this will simply overwrite the existing
* entry, so no harm done
*/
if (ORTE_SUCCESS != (rc = orte_routed_tree_update_route(&proc, &proc))) {
ORTE_ERROR_LOG(rc);
return rc;
}
}
}
}
OPAL_OUTPUT_VERBOSE((2, orte_routed_base_output,
"%s routed_tree: completed init routes",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
return ORTE_SUCCESS;
}
/* I must be a proc - just setup my route to the local daemon */
{
int id;
char *rml_uri;
if (ORTE_EQUAL == orte_dss.compare(ORTE_NAME_INVALID, &orte_process_info.my_daemon, ORTE_NAME)) {
/* the daemon wasn't previously defined, so look for it */
id = mca_base_param_register_string("orte", "local_daemon", "uri", NULL, NULL);
mca_base_param_lookup_string(id, &rml_uri);
if (NULL == rml_uri) {
/* in this module, we absolutely MUST have this information - if
* we didn't get it, then error out
*/
opal_output(0, "%s ERROR: Failed to identify the local daemon's URI",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
opal_output(0, "%s ERROR: This is a fatal condition when the tree router",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
opal_output(0, "%s ERROR: has been selected - either select the unity router",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
opal_output(0, "%s ERROR: or ensure that the local daemon info is provided",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
return ORTE_ERR_FATAL;
}
/* Set the contact info in the RML - this won't actually establish
* the connection, but just tells the RML how to reach the daemon
* if/when we attempt to send to it
*/
if (ORTE_SUCCESS != (rc = orte_rml.set_contact_info(rml_uri))) {
ORTE_ERROR_LOG(rc);
free(rml_uri);
return(rc);
}
/* extract the daemon's name so we can update the routing table */
if (ORTE_SUCCESS != (rc = orte_rml_base_parse_uris(rml_uri, &orte_process_info.my_daemon, NULL))) {
ORTE_ERROR_LOG(rc);
free(rml_uri);
return rc;
}
free(rml_uri); /* done with this */
}
/* setup the route to all other procs to flow through the daemon */
if (ORTE_SUCCESS != (rc = orte_routed_tree_update_route(ORTE_NAME_WILDCARD, &orte_process_info.my_daemon))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* set my proc state - this will fire the corresponding trigger so
* everything else in this procedure can happen. Note that it involves a
* a communication, which means that a connection to the local daemon
* will be initiated. Thus, the local daemon will subsequently know
* my contact info
*/
if (ORTE_SUCCESS != (rc = orte_smr.set_proc_state(ORTE_PROC_MY_NAME, ORTE_PROC_STATE_AT_STG1, 0))) {
ORTE_ERROR_LOG(rc);
return rc;
}
return ORTE_SUCCESS;
}
}
int orte_routed_tree_warmup_routes(void)
{
orte_std_cntr_t i, j, simultaneous, world_size, istop;
orte_process_name_t next, prev;
struct iovec inmsg[1], outmsg[1];
int ret;
if (orte_process_info.seed) {
/* the HNP does not need to participate as it already has
* a warmed-up connection to every daemon, so just return
*/
return ORTE_SUCCESS;
}
OPAL_OUTPUT_VERBOSE((1, orte_routed_base_output,
"%s routed_tree: warming up daemon wireup for %ld procs",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), (long)orte_process_info.num_procs));
world_size = orte_process_info.num_procs;
istop = world_size/2;
simultaneous = 1;
if (world_size < simultaneous) {
simultaneous = world_size;
}
next.jobid = 0;
prev.jobid = 0;
inmsg[0].iov_base = outmsg[0].iov_base = NULL;
inmsg[0].iov_len = outmsg[0].iov_len = 0;
for (i = 1 ; i <= istop ; i += simultaneous) {
#if 0
if (simultaneous > (istop - i)) {
/* only fill in the rest */
simultaneous = istop - i;
}
#endif
/* the HNP does not need to participate as it already has
* a warmed-up connection to every daemon, so we exclude
* vpid=0 from both send and receive
*/
for (j = 0 ; j < simultaneous ; ++j) {
next.vpid = (ORTE_PROC_MY_NAME->vpid + (i + j )) % world_size;
if (next.vpid == 0) {
continue;
}
OPAL_OUTPUT_VERBOSE((1, orte_routed_base_output,
"%s routed_tree: daemon wireup sending to %s",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), ORTE_NAME_PRINT(&next)));
/* sends do not wait for a match */
ret = orte_rml.send(&next,
outmsg,
1,
ORTE_RML_TAG_WIREUP,
0);
if (ret < 0) return ret;
}
for (j = 0 ; j < simultaneous ; ++j) {
prev.vpid = (ORTE_PROC_MY_NAME->vpid - (i + j) + world_size) % world_size;
if (prev.vpid == 0) {
continue;
}
OPAL_OUTPUT_VERBOSE((1, orte_routed_base_output,
"%s routed_tree: daemon wireup recving from %s",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), ORTE_NAME_PRINT(&prev)));
ret = orte_rml.recv(&prev,
inmsg,
1,
ORTE_RML_TAG_WIREUP,
0);
if (ret < 0) return ret;
}
}
return ORTE_SUCCESS;
}

Просмотреть файл

@ -47,6 +47,9 @@ int orte_routed_tree_update_route(orte_process_name_t *target,
orte_process_name_t orte_routed_tree_get_route(orte_process_name_t *target);
int orte_routed_tree_init_routes(orte_jobid_t job, orte_gpr_notify_data_t *ndat);
int orte_routed_tree_warmup_routes(void);
END_C_DECLS

Просмотреть файл

@ -54,9 +54,10 @@ orte_routed_component_t mca_routed_tree_component = {
orte_routed_tree_module_t orte_routed_tree_module = {
{
orte_routed_tree_finalize,
orte_routed_tree_update_route,
orte_routed_tree_get_route
orte_routed_tree_get_route,
orte_routed_tree_init_routes,
orte_routed_tree_warmup_routes
}
};
@ -66,7 +67,7 @@ OBJ_CLASS_INSTANCE(orte_routed_tree_entry_t, opal_list_item_t, NULL, NULL);
static orte_routed_module_t*
routed_tree_init(int* priority)
{
*priority = 0;
*priority = 5;
OBJ_CONSTRUCT(&orte_routed_tree_module.peer_list, opal_list_t);
OBJ_CONSTRUCT(&orte_routed_tree_module.vpid_wildcard_list, opal_list_t);

Просмотреть файл

@ -25,6 +25,9 @@ int orte_routed_unity_update_route(orte_process_name_t *target,
orte_process_name_t orte_routed_unity_get_route(orte_process_name_t *target);
int orte_routed_unity_init_routes(orte_jobid_t job, orte_gpr_notify_data_t *ndat);
int orte_routed_unity_warmup_routes(void);
END_C_DECLS

Просмотреть файл

@ -9,16 +9,24 @@
*/
#include "orte_config.h"
#include "routed_unity.h"
#include "orte/orte_constants.h"
#include "opal/util/output.h"
#include "opal/mca/base/base.h"
#include "opal/mca/base/mca_base_param.h"
#include "orte/orte_constants.h"
#include "orte/mca/routed/base/base.h"
#include "orte/mca/errmgr/errmgr.h"
#include "orte/mca/ns/ns.h"
#include "orte/mca/gpr/gpr.h"
#include "orte/mca/grpcomm/grpcomm.h"
#include "orte/mca/odls/odls_types.h"
#include "orte/mca/smr/smr.h"
#include "orte/mca/rml/base/rml_contact.h"
#include "orte/mca/sds/base/base.h"
#include "orte/mca/routed/base/base.h"
#include "routed_unity.h"
static orte_routed_module_t* routed_unity_init(int* priority);
@ -53,10 +61,11 @@ orte_routed_component_t mca_routed_unity_component = {
};
orte_routed_module_t orte_routed_unity_module = {
orte_routed_unity_finalize,
orte_routed_unity_update_route,
orte_routed_unity_get_route
orte_routed_unity_finalize,
orte_routed_unity_update_route,
orte_routed_unity_get_route,
orte_routed_unity_init_routes,
orte_routed_unity_warmup_routes
};
static orte_routed_module_t*
@ -80,7 +89,8 @@ orte_routed_unity_update_route(orte_process_name_t *target,
orte_process_name_t *route)
{
OPAL_OUTPUT_VERBOSE((1, orte_routed_base_output,
"routed_unity_update: %s --> %s",
"%s routed_unity_update: %s --> %s",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(target),
ORTE_NAME_PRINT(route)));
return ORTE_SUCCESS;
@ -91,8 +101,319 @@ orte_process_name_t
orte_routed_unity_get_route(orte_process_name_t *target)
{
OPAL_OUTPUT_VERBOSE((1, orte_routed_base_output,
"routed_unity_get(%s) --> %s",
"%s routed_unity_get(%s) --> %s",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(target),
ORTE_NAME_PRINT(target)));
return *target;
}
int orte_routed_unity_init_routes(orte_jobid_t job, orte_gpr_notify_data_t *ndata)
{
/* the unity module just sends direct to everyone, so it requires
* that the RML get loaded with contact info from all of our peers.
* We also look for and provide contact info for our local daemon
* so we can use it if needed
*/
int rc;
int id;
orte_buffer_t buf;
orte_std_cntr_t cnt;
char *rml_uri;
orte_gpr_notify_data_t *ndat;
/* if I am a daemon... */
if (orte_process_info.daemon) {
OPAL_OUTPUT_VERBOSE((1, orte_routed_base_output,
"%s routed_unity: init routes for daemon job %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), (long)job));
if (0 == job) {
if (NULL == ndata) {
/* if ndata is NULL, then this is being called during init,
* so just register our contact info with the HNP */
if (ORTE_SUCCESS != (rc = orte_rml_base_register_contact_info())) {
ORTE_ERROR_LOG(rc);
return rc;
}
} else {
/* ndata != NULL means we are getting an update of RML info
* for the daemons - so update our contact info and routes so
* that any relayed xcast (e.g., binomial) will be able to
* send messages
*/
orte_rml_base_contact_info_notify(ndata, NULL);
}
}
/* since the daemons in the unity component don't route messages,
* there is nothing for them to do except when the job=0
*/
return ORTE_SUCCESS;
}
/* if I am the HNP, then... */
if (orte_process_info.seed) {
#if 0
orte_proc_t **procs;
orte_job_t *jdata;
#endif
OPAL_OUTPUT_VERBOSE((1, orte_routed_base_output,
"%s routed_unity: init routes for HNP job %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), (long)job));
/* if this is for my own job, then ignore it - we handle
* updates of daemon contact info separately, so this
* shouldn't get called during daemon startup. This situation
* would occur, though, when we are doing orte_init within the HNP
* itself, so there really isn't anything to do anyway
*/
if (0 == job) {
/* register our contact info */
if (ORTE_SUCCESS != (rc = orte_rml_base_register_contact_info())) {
ORTE_ERROR_LOG(rc);
return rc;
}
return ORTE_SUCCESS;
}
/* gather up all the RML contact info for the indicated job */
#if 0
/* this code pertains to the revised ORTE */
/* look up the job data for this job */
if (orte_job_data->size < job ||
(NULL == (jdata = (orte_job_t*)orte_job_data->addr[job]))) {
ORTE_ERROR_LOG(ORTE_ERR_BAD_PARAM);
return ORTE_ERR_BAD_PARAM;
}
OBJ_CONSTRUCT(&buf, orte_buffer_t);
/* load in the number of data entries we'll be inserting */
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buf, &jdata->num_procs, 1, ORTE_STD_CNTR))) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&buf);
return rc;
}
/* pack the RML contact info for each proc */
procs = (orte_proc_t**)jdata->procs->addr;
for (i=0; i < jdata->num_procs; i++) {
if (NULL == procs[i]) {
ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
OBJ_DESTRUCT(&buf);
return ORTE_ERR_NOT_FOUND;
}
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buf, &procs[j]->rml_uri, 1, ORTE_STRING))) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&buf);
return rc;
}
}
#endif
{
orte_process_name_t name;
/* if ndata != NULL, then we can ignore it - some routing algos
* need to call init_routes during launch, but we don't
*/
if (NULL != ndata) {
OPAL_OUTPUT_VERBOSE((2, orte_routed_base_output,
"%s routed_unity: no data to process for HNP",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
return ORTE_SUCCESS;
}
name.jobid = job;
name.vpid = ORTE_VPID_WILDCARD;
ndat = OBJ_NEW(orte_gpr_notify_data_t);
if (ORTE_SUCCESS != (rc = orte_rml_base_get_contact_info(&name, &ndat))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* does this job have a parent? */
if (ORTE_SUCCESS != (rc = orte_ns.get_parent_job(&name.jobid, job))) {
ORTE_ERROR_LOG(rc);
return rc;
}
if (name.jobid != job) {
/* yes it does - so get that contact info and send it along as well.
* get_contact_info will simply add to the ndat structure
*/
if (ORTE_SUCCESS != (rc = orte_rml_base_get_contact_info(&name, &ndat))) {
ORTE_ERROR_LOG(rc);
return rc;
}
}
/* have to add in contact info for all daemons since, depending upon
* selected xcast mode, it may be necessary for this proc to send
* directly to a daemon on another node
*/
name.jobid = 0;
name.vpid = ORTE_VPID_WILDCARD;
if (ORTE_SUCCESS != (rc = orte_rml_base_get_contact_info(&name, &ndat))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* pack the results for transmission */
OBJ_CONSTRUCT(&buf, orte_buffer_t);
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buf, &ndat, 1, ORTE_GPR_NOTIFY_DATA))) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&buf);
OBJ_RELEASE(ndat);
return rc;
}
OBJ_RELEASE(ndat);
}
/* send it to all of the procs */
OPAL_OUTPUT_VERBOSE((2, orte_routed_base_output,
"%s routed_unity: xcasting info to procs",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
if (ORTE_SUCCESS != (rc = orte_grpcomm.xcast(job, &buf, ORTE_RML_TAG_INIT_ROUTES))) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&buf);
return rc;
}
OBJ_DESTRUCT(&buf);
return ORTE_SUCCESS;
}
/* guess I am an application process - see if the local daemon's
* contact info is given. We may not always get this in every
* environment, so allow it not to be found.
*/
OPAL_OUTPUT_VERBOSE((1, orte_routed_base_output,
"%s routed_unity: init routes for proc job %ld",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), (long)job));
id = mca_base_param_register_string("orte", "local_daemon", "uri", NULL, NULL);
mca_base_param_lookup_string(id, &rml_uri);
if (NULL != rml_uri) {
orte_daemon_cmd_flag_t command=ORTE_DAEMON_WARMUP_LOCAL_CONN;
/* Set the contact info in the RML - this establishes
* the connection so the daemon knows how to reach us.
* We have to do this as any non-direct xcast will come
* via our local daemon - and if it doesn't know how to
* reach us, then it will error out the message
*/
/* set the contact info into the hash table */
if (ORTE_SUCCESS != (rc = orte_rml.set_contact_info(rml_uri))) {
ORTE_ERROR_LOG(rc);
return(rc);
}
/* extract the daemon's name and store it */
if (ORTE_SUCCESS != (rc = orte_rml_base_parse_uris(rml_uri, &orte_process_info.my_daemon, NULL))) {
ORTE_ERROR_LOG(rc);
free(rml_uri);
return rc;
}
free(rml_uri);
/* we need to send a very small message to get the oob to establish
* the connection - the oob will leave the connection "alive"
* thereafter so we can communicate readily
*/
OBJ_CONSTRUCT(&buf, orte_buffer_t);
/* tell the daemon this is a message to warmup the connection */
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buf, &command, 1, ORTE_DAEMON_CMD))) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&buf);
return rc;
}
/* do the send - it will be ignored on the far end, so don't worry about
* getting a response
*/
if (0 > orte_rml.send_buffer(&orte_process_info.my_daemon, &buf, ORTE_RML_TAG_DAEMON, 0)) {
ORTE_ERROR_LOG(ORTE_ERR_CONNECTION_FAILED);
OBJ_DESTRUCT(&buf);
return ORTE_ERR_CONNECTION_FAILED;
}
OBJ_DESTRUCT(&buf);
}
/* send our contact info to the HNP */
if (ORTE_SUCCESS != (rc = orte_rml_base_register_contact_info())) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* set my proc state - this will fire the corresponding trigger so I can
* get my contact info back
*/
if (ORTE_SUCCESS != (rc = orte_smr.set_proc_state(ORTE_PROC_MY_NAME, ORTE_PROC_STATE_AT_STG1, 0))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* now setup a blocking receive and wait right here until we get
* the contact info for all of our peers
*/
OBJ_CONSTRUCT(&buf, orte_buffer_t);
rc = orte_rml.recv_buffer(ORTE_NAME_WILDCARD, &buf, ORTE_RML_TAG_INIT_ROUTES, 0);
if (ORTE_SUCCESS != rc) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&buf);
return rc;
}
#if 0
/* this code pertains to the revised ORTE */
/* unpack the number of data entries */
cnt = 1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&buf, &num_entries, &cnt, ORTE_STD_CNTR))) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&buf);
return rc;
}
opal_output(0, "routed: init_routes proc got %ld entries", (long)num_entries);
/* update the RML with that info */
for (i=0; i < num_entries; i++) {
cnt = 1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&buf, &rml_uri, &cnt, ORTE_STRING))) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&buf);
return rc;
}
opal_output(0, "routed: init_routes proc got uri %s", rml_uri);
if (ORTE_SUCCESS != (rc = orte_rml.set_contact_info(rml_uri))) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&buf);
return(rc);
}
free(rml_uri);
}
#endif
{
ndat = OBJ_NEW(orte_gpr_notify_data_t);
cnt = 1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&buf, &ndat, &cnt, ORTE_GPR_NOTIFY_DATA))) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&buf);
return rc;
}
orte_rml_base_contact_info_notify(ndat, NULL);
OBJ_RELEASE(ndat);
}
OBJ_DESTRUCT(&buf);
return ORTE_SUCCESS;
}
int orte_routed_unity_warmup_routes(void)
{
/* in the unity component, the daemons do not need to warmup their
* connections as they are not used to route messages. Hence, we
* just return success and ignore this call
*/
OPAL_OUTPUT_VERBOSE((1, orte_routed_base_output,
"%s routed_unity: warmup routes",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME)));
return ORTE_SUCCESS;
}

Просмотреть файл

@ -26,7 +26,6 @@ libmca_sds_la_SOURCES += \
base/sds_base_open.c \
base/sds_base_select.c \
base/sds_base_interface.c \
base/sds_base_orted_contact.c \
base/sds_base_universe.c \
base/sds_base_get.c \
base/sds_base_put.c

Просмотреть файл

@ -39,9 +39,7 @@ int orte_sds_env_get(void)
int num_procs;
int local_rank;
int num_local_procs;
int rc;
int id;
char *local_daemon_uri = NULL;
id = mca_base_param_register_int("ns", "nds", "vpid_start", NULL, -1);
mca_base_param_lookup_int(id, &vpid_start);
@ -74,18 +72,6 @@ int orte_sds_env_get(void)
id = mca_base_param_register_int("ns", "nds", "num_local_procs", NULL, 0);
mca_base_param_lookup_int(id, &num_local_procs);
orte_process_info.num_local_procs = (orte_std_cntr_t)num_local_procs;
id = mca_base_param_register_string("orte", "local_daemon", "uri", NULL, NULL);
mca_base_param_lookup_string(id, &local_daemon_uri);
if (NULL != local_daemon_uri) {
/* if we are a daemon, then we won't have this param set, so allow
* it not to be found
*/
if (ORTE_SUCCESS != (rc = orte_sds_base_contact_orted(local_daemon_uri))) {
ORTE_ERROR_LOG(rc);
return(rc);
}
}
return ORTE_SUCCESS;
}

Просмотреть файл

@ -1,81 +0,0 @@
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "orte_config.h"
#include "orte/orte_constants.h"
#ifdef HAVE_SYS_TIME_H
#include <sys/time.h>
#endif
#include "orte/dss/dss.h"
#include "orte/mca/errmgr/errmgr.h"
#include "orte/mca/ns/ns_types.h"
#include "orte/mca/rml/rml.h"
#include "orte/mca/rml/base/rml_contact.h"
#include "orte/mca/odls/odls_types.h"
#include "orte/mca/sds/base/base.h"
int orte_sds_base_contact_orted(char *orted_contact_info)
{
orte_buffer_t buffer;
int rc;
orte_process_name_t orted;
orte_daemon_cmd_flag_t command=ORTE_DAEMON_WARMUP_LOCAL_CONN;
/* set the contact info into the OOB's hash table */
if (ORTE_SUCCESS != (rc = orte_rml.set_contact_info(orted_contact_info))) {
ORTE_ERROR_LOG(rc);
return(rc);
}
/* extract the daemon's name from the uri */
if (ORTE_SUCCESS != (rc = orte_rml_base_parse_uris(orted_contact_info, &orted, NULL))) {
ORTE_ERROR_LOG(rc);
return rc;
}
/* we need to send a very small message to get the oob to establish
* the connection - the oob will leave the connection "alive"
* thereafter so we can communicate readily
*/
OBJ_CONSTRUCT(&buffer, orte_buffer_t);
/* tell the daemon this is a message to warmup the connection */
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buffer, &command, 1, ORTE_DAEMON_CMD))) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&buffer);
return rc;
}
/* do the send - it will be ignored on the far end, so don't worry about
* getting a response
*/
if (0 > orte_rml.send_buffer(&orted, &buffer, ORTE_RML_TAG_DAEMON, 0)) {
ORTE_ERROR_LOG(ORTE_ERR_CONNECTION_FAILED);
OBJ_DESTRUCT(&buffer);
return ORTE_ERR_CONNECTION_FAILED;
}
OBJ_DESTRUCT(&buffer);
return ORTE_SUCCESS;
}

Просмотреть файл

@ -39,6 +39,8 @@
#include "orte/mca/errmgr/errmgr.h"
#include "orte/mca/ns/ns.h"
#include "orte/mca/rml/rml.h"
#include "orte/mca/rml/base/rml_contact.h"
#include "orte/mca/routed/routed.h"
#include "orte/runtime/params.h"
#include "orte/runtime/runtime.h"
@ -366,58 +368,88 @@ static int fork_hnp(void)
exit(1);
} else {
/* I am the parent - wait to hear something back and
/* I am the parent - wait to hear something back and
* report results
*/
close(p[1]); /* parent closes the write - orted will write its contact info to it*/
close(death_pipe[0]); /* parent closes the death_pipe's read */
/* setup the buffer to read the uri */
/* setup the buffer to read the name + uri */
buffer_length = ORTE_URI_MSG_LGTH;
chunk = ORTE_URI_MSG_LGTH-1;
num_chars_read = 0;
orted_uri = (char*)malloc(buffer_length);
while (1) {
while (chunk == (rc = read(p[0], &orted_uri[num_chars_read], chunk))) {
/* we read an entire buffer - better get more */
num_chars_read += chunk;
buffer_length += ORTE_URI_MSG_LGTH;
orted_uri = realloc((void*)orted_uri, buffer_length);
}
num_chars_read += rc;
if (num_chars_read <= 0) {
/* we didn't get anything back - this is bad */
ORTE_ERROR_LOG(ORTE_ERR_HNP_COULD_NOT_START);
free(orted_uri);
return ORTE_ERR_HNP_COULD_NOT_START;
}
/* we got something back - let's hope it was the uri.
* Set the contact info into our RML - it will bark
* if the returned info isn't a uri
*/
if (ORTE_SUCCESS != (rc = orte_rml.set_contact_info(orted_uri))) {
ORTE_ERROR_LOG(rc);
free(orted_uri);
return rc;
}
/* okay, the HNP is now setup. We actually don't need to
* restart ourselves as we haven't really done anything yet.
* So set our name to be [0,1,0] since we know that's what
* it should be, set the HNP info in our globals, and tell
* orte_init that those things are done
*/
orte_universe_info.seed_uri = strdup(orted_uri);
orte_process_info.ns_replica_uri = strdup(orted_uri);
orte_process_info.gpr_replica_uri = strdup(orted_uri);
/* indicate we are a singleton so orte_init knows what to do */
orte_process_info.singleton = true;
/* all done - report success */
free(orted_uri);
return ORTE_SUCCESS;
while (chunk == (rc = read(p[0], &orted_uri[num_chars_read], chunk))) {
/* we read an entire buffer - better get more */
num_chars_read += chunk;
buffer_length += ORTE_URI_MSG_LGTH;
orted_uri = realloc((void*)orted_uri, buffer_length);
}
num_chars_read += rc;
if (num_chars_read <= 0) {
/* we didn't get anything back - this is bad */
ORTE_ERROR_LOG(ORTE_ERR_HNP_COULD_NOT_START);
free(orted_uri);
return ORTE_ERR_HNP_COULD_NOT_START;
}
/* parse the name from the returned info */
if (']' != orted_uri[strlen(orted_uri)-1]) {
ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);
free(orted_uri);
return ORTE_ERR_COMM_FAILURE;
}
orted_uri[strlen(orted_uri)-1] = '\0';
if (NULL == (param = strrchr(orted_uri, '['))) {
ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);
free(orted_uri);
return ORTE_ERR_COMM_FAILURE;
}
*param = '\0'; /* terminate the string */
param++;
if (ORTE_SUCCESS != (rc = orte_ns.convert_string_to_process_name(&orte_process_info.my_name, param))) {
ORTE_ERROR_LOG(rc);
free(orted_uri);
return rc;
}
/* we got something back - let's hope it was the uri.
* Set the contact info into our RML - it will bark
* if the returned info isn't a uri
*/
if (ORTE_SUCCESS != (rc = orte_rml.set_contact_info(orted_uri))) {
ORTE_ERROR_LOG(rc);
free(orted_uri);
return rc;
}
/* extract the name, noting that the HNP is also my local daemon,
* and define the route as direct
*/
if (ORTE_SUCCESS != (rc = orte_rml_base_parse_uris(orted_uri, &orte_process_info.my_daemon, NULL))) {
ORTE_ERROR_LOG(rc);
free(orted_uri);
return rc;
}
if (ORTE_SUCCESS != (rc = orte_routed.update_route(&orte_process_info.my_daemon,
&orte_process_info.my_daemon))) {
ORTE_ERROR_LOG(rc);
free(orted_uri);
return rc;
}
/* okay, the HNP is now setup. We actually don't need to
* restart ourselves as we haven't really done anything yet.
* So set the HNP info in our globals, and tell
* orte_init that those things are done
*/
orte_universe_info.seed_uri = strdup(orted_uri);
orte_process_info.ns_replica_uri = strdup(orted_uri);
orte_process_info.gpr_replica_uri = strdup(orted_uri);
/* indicate we are a singleton so orte_init knows what to do */
orte_process_info.singleton = true;
/* all done - report success */
free(orted_uri);
return ORTE_SUCCESS;
}
#else
/* someone will have to devise a Windows equivalent */

Просмотреть файл

@ -42,9 +42,14 @@ orte_sds_singleton_set_name(void)
{
int rc;
if (ORTE_SUCCESS != (rc = orte_ns.create_my_name())) {
ORTE_ERROR_LOG(rc);
return rc;
/*
* If we do not have a name at this point, then ask for one.
*/
if( NULL == ORTE_PROC_MY_NAME ) {
if (ORTE_SUCCESS != (rc = orte_ns.create_my_name())) {
ORTE_ERROR_LOG(rc);
return rc;
}
}
orte_process_info.num_procs = 1;

Просмотреть файл

@ -65,6 +65,7 @@ orte_smr_base_module_t orte_smr = {
orte_smr_base_init_job_stage_gates,
orte_smr_base_define_alert_monitor,
orte_smr_base_job_stage_gate_subscribe,
orte_smr_base_register_sync,
orte_smr_base_module_finalize_not_available
};

Просмотреть файл

@ -25,11 +25,12 @@
#include <string.h>
#include "orte/util/proc_info.h"
#include "orte/mca/schema/schema.h"
#include "orte/mca/errmgr/errmgr.h"
#include "orte/mca/gpr/gpr.h"
#include "orte/mca/ns/ns.h"
#include "orte/mca/odls/odls_types.h"
#include "orte/mca/smr/base/smr_private.h"
@ -147,27 +148,6 @@ int orte_smr_base_set_proc_state(orte_process_name_t *proc,
}
break;
case ORTE_PROC_STATE_AT_STG2:
if (ORTE_SUCCESS != (rc = orte_gpr.create_keyval(&(value->keyvals[0]), ORTE_PROC_NUM_AT_STG2, ORTE_UNDEF, NULL))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
}
break;
case ORTE_PROC_STATE_AT_STG3:
if (ORTE_SUCCESS != (rc = orte_gpr.create_keyval(&(value->keyvals[0]), ORTE_PROC_NUM_AT_STG3, ORTE_UNDEF, NULL))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
}
break;
case ORTE_PROC_STATE_FINALIZED:
if (ORTE_SUCCESS != (rc = orte_gpr.create_keyval(&(value->keyvals[0]), ORTE_PROC_NUM_FINALIZED, ORTE_UNDEF, NULL))) {
ORTE_ERROR_LOG(rc);
goto cleanup;
}
break;
case ORTE_PROC_STATE_TERMINATED:
if (ORTE_SUCCESS != (rc = orte_gpr.create_keyval(&(value->keyvals[0]), ORTE_PROC_NUM_TERMINATED, ORTE_UNDEF, NULL))) {
ORTE_ERROR_LOG(rc);
@ -213,3 +193,47 @@ cleanup:
return rc;
}
int orte_smr_base_register_sync(void)
{
orte_buffer_t buffer, ack;
int rc;
orte_daemon_cmd_flag_t command=ORTE_DAEMON_SYNC_BY_PROC;
/* we need to send a very small message to get the oob to establish
* the connection - the oob will leave the connection "alive"
* thereafter so we can communicate readily
*/
OBJ_CONSTRUCT(&buffer, orte_buffer_t);
/* tell the daemon to sync */
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buffer, &command, 1, ORTE_DAEMON_CMD))) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&buffer);
return rc;
}
/* send the sync command to our daemon */
if (0 > (rc = orte_rml.send_buffer(&orte_process_info.my_daemon, &buffer, ORTE_RML_TAG_DAEMON, 0))) {
ORTE_ERROR_LOG(rc);
OBJ_DESTRUCT(&buffer);
return ORTE_SUCCESS;
}
OBJ_DESTRUCT(&buffer);
/* get the ack - need this to ensure that the sync communication
* gets serviced by the event library on the orted prior to the
* process exiting
*/
OBJ_CONSTRUCT(&ack, orte_buffer_t);
if (0 > orte_rml.recv_buffer(&orte_process_info.my_daemon, &ack, ORTE_RML_TAG_SYNC, 0)) {
ORTE_ERROR_LOG(ORTE_ERR_COMM_FAILURE);
OBJ_DESTRUCT(&ack);
return ORTE_ERR_COMM_FAILURE;
}
OBJ_DESTRUCT(&ack);
return ORTE_SUCCESS;
}

Просмотреть файл

@ -64,10 +64,7 @@ int orte_smr_base_init_job_stage_gates(orte_jobid_t job,
ORTE_PROC_NUM_TERMINATED,
/* the following stage gates need data routed through them */
ORTE_PROC_NUM_AT_ORTE_STARTUP,
ORTE_PROC_NUM_AT_STG1,
ORTE_PROC_NUM_AT_STG2,
ORTE_PROC_NUM_AT_STG3,
ORTE_PROC_NUM_FINALIZED
ORTE_PROC_NUM_AT_STG1
};
char* trig_names[] = {
/* this ordering needs to be identical to that in the array above! */
@ -78,9 +75,6 @@ int orte_smr_base_init_job_stage_gates(orte_jobid_t job,
/* the following triggers need data routed through them */
ORTE_STARTUP_TRIGGER,
ORTE_STG1_TRIGGER,
ORTE_STG2_TRIGGER,
ORTE_STG3_TRIGGER,
ORTE_ALL_FINALIZED_TRIGGER
};
@ -293,9 +287,6 @@ int orte_smr_base_job_stage_gate_subscribe(orte_jobid_t job,
ORTE_PROC_STATE_LAUNCHED,
ORTE_PROC_STATE_RUNNING,
ORTE_PROC_STATE_AT_STG1,
ORTE_PROC_STATE_AT_STG2,
ORTE_PROC_STATE_AT_STG3,
ORTE_PROC_STATE_FINALIZED,
ORTE_PROC_STATE_TERMINATED
};
char* keys[] = {
@ -304,9 +295,6 @@ int orte_smr_base_job_stage_gate_subscribe(orte_jobid_t job,
ORTE_PROC_NUM_LAUNCHED,
ORTE_PROC_NUM_RUNNING,
ORTE_PROC_NUM_AT_STG1,
ORTE_PROC_NUM_AT_STG2,
ORTE_PROC_NUM_AT_STG3,
ORTE_PROC_NUM_FINALIZED,
ORTE_PROC_NUM_TERMINATED
};
char* trig_names[] = {
@ -315,9 +303,6 @@ int orte_smr_base_job_stage_gate_subscribe(orte_jobid_t job,
ORTE_ALL_LAUNCHED_TRIGGER,
ORTE_ALL_RUNNING_TRIGGER,
ORTE_STG1_TRIGGER,
ORTE_STG2_TRIGGER,
ORTE_STG3_TRIGGER,
ORTE_ALL_FINALIZED_TRIGGER,
ORTE_ALL_TERMINATED_TRIGGER
};
char* tokens[] = {

Просмотреть файл

@ -94,6 +94,7 @@ int orte_smr_base_begin_monitoring_not_available(orte_job_map_t *map,
orte_gpr_trigger_cb_fn_t cbfunc,
void *user_tag);
int orte_smr_base_register_sync(void);
int orte_smr_base_module_finalize_not_available (void);

Просмотреть файл

@ -405,7 +405,6 @@ int orte_smr_bproc_begin_monitoring(orte_job_map_t *map, orte_gpr_trigger_cb_fn_
node = (orte_mapped_node_t*)item;
newnode = OBJ_NEW(orte_smr_node_state_tracker_t);
newnode->cell = node->cell;
newnode->nodename = strdup(node->nodename);
opal_list_append(&active_node_list, &newnode->super);
}

Просмотреть файл

@ -84,6 +84,11 @@ typedef int (*orte_smr_base_module_get_job_state_fn_t)(orte_job_state_t *state,
typedef int (*orte_smr_base_module_set_job_state_fn_t)(orte_jobid_t jobid,
orte_job_state_t state);
/*
* Register a sync request
*/
typedef int (*orte_smr_base_module_register_sync_fn_t)(void);
/*
* Define the job-specific standard stage gates
* This function creates all of the ORTE-standard stage gates.
@ -176,6 +181,9 @@ struct orte_smr_base_module_1_3_0_t {
orte_smr_base_module_job_stage_gate_init_fn_t init_job_stage_gates;
orte_smr_base_module_define_alert_monitor_fn_t define_alert_monitor;
orte_smr_base_module_job_stage_gate_subscribe_fn_t job_stage_gate_subscribe;
/* REGISTER FUNCTION */
orte_smr_base_module_register_sync_fn_t register_sync;
/* FINALIZE */
orte_smr_base_module_finalize_fn_t finalize;
};

Просмотреть файл

@ -44,10 +44,7 @@ typedef uint16_t orte_proc_state_t;
#define ORTE_PROC_STATE_INIT 0x0001 /* process entry has been created by rmaps */
#define ORTE_PROC_STATE_LAUNCHED 0x0002 /* process has been launched by pls */
#define ORTE_PROC_STATE_AT_STG1 0x0004 /* process is at Stage Gate 1 barrier in orte_init */
#define ORTE_PROC_STATE_AT_STG2 0x0008 /* process is at Stage Gate 2 barrier in orte_init */
#define ORTE_PROC_STATE_RUNNING 0x0010 /* process has exited orte_init and is running */
#define ORTE_PROC_STATE_AT_STG3 0x0020 /* process is at Stage Gate 3 barrier in orte_finalize */
#define ORTE_PROC_STATE_FINALIZED 0x0040 /* process has completed orte_finalize and is running */
#define ORTE_PROC_STATE_TERMINATED 0x0080 /* process has terminated and is no longer running */
#define ORTE_PROC_STATE_ABORTED 0x0100 /* process aborted */
#define ORTE_PROC_STATE_FAILED_TO_START 0x0200 /* process failed to start */
@ -58,7 +55,7 @@ typedef uint16_t orte_proc_state_t;
/** define some common shorthands for when we want to be alerted */
#define ORTE_PROC_STATE_ALL 0xffff /* alert on ALL triggers */
#define ORTE_PROC_STAGE_GATES_ONLY ORTE_PROC_STATE_AT_STG1 | ORTE_PROC_STATE_AT_STG2 | ORTE_PROC_STATE_AT_STG3 | ORTE_PROC_STATE_FINALIZED
#define ORTE_PROC_STAGE_GATES_ONLY ORTE_PROC_STATE_AT_STG1
#define ORTE_PROC_STATE_NONE 0x0000 /* don't alert on any triggers */
/*
@ -71,10 +68,7 @@ typedef uint16_t orte_job_state_t;
#define ORTE_JOB_STATE_INIT 0x0001 /* job entry has been created by rmaps */
#define ORTE_JOB_STATE_LAUNCHED 0x0002 /* job has been launched by pls */
#define ORTE_JOB_STATE_AT_STG1 0x0004 /* all processes are at Stage Gate 1 barrier in orte_init */
#define ORTE_JOB_STATE_AT_STG2 0x0008 /* all processes are at Stage Gate 2 barrier in orte_init */
#define ORTE_JOB_STATE_RUNNING 0x0010 /* all processes have exited orte_init and is running */
#define ORTE_JOB_STATE_AT_STG3 0x0020 /* all processes are at Stage Gate 3 barrier in orte_finalize */
#define ORTE_JOB_STATE_FINALIZED 0x0040 /* all processes have completed orte_finalize and is running */
#define ORTE_JOB_STATE_TERMINATED 0x0080 /* all processes have terminated and is no longer running */
#define ORTE_JOB_STATE_ABORTED 0x0100 /* at least one process aborted, causing job to abort */
#define ORTE_JOB_STATE_FAILED_TO_START 0x0200 /* at least one process failed to start */

Просмотреть файл

@ -47,11 +47,6 @@ ORTE_DECLSPEC void orte_daemon_recv_routed(int status, orte_process_name_t* send
orte_buffer_t *buffer, orte_rml_tag_t tag,
void* cbdata);
ORTE_DECLSPEC void orte_daemon_recv_gate(int status, orte_process_name_t* sender,
orte_buffer_t *buffer, orte_rml_tag_t tag,
void* cbdata);
#if defined(c_plusplus) || defined(__cplusplus)
}
#endif

Просмотреть файл

@ -77,6 +77,7 @@
#include "orte/mca/rmgr/base/rmgr_private.h"
#include "orte/mca/odls/odls.h"
#include "orte/mca/pls/pls.h"
#include "orte/mca/routed/routed.h"
#include "orte/runtime/runtime.h"
@ -87,6 +88,7 @@
/*
* Globals
*/
static bool warmup_routes;
static int binomial_route_msg(orte_process_name_t *sender,
orte_buffer_t *buf,
@ -115,6 +117,9 @@ void orte_daemon_recv_routed(int status, orte_process_name_t* sender,
ORTE_NAME_PRINT(sender));
}
/* init the warmup routes flag */
warmup_routes = false;
/* unpack the routing algorithm */
n = 1;
if (ORTE_SUCCESS != (ret = orte_dss.unpack(buffer, &routing_mode, &n, ORTE_DAEMON_CMD))) {
@ -136,6 +141,13 @@ void orte_daemon_recv_routed(int status, orte_process_name_t* sender,
}
CLEANUP:
/* see if we need to warmup any daemon-to-daemon routes */
if (warmup_routes) {
if (ORTE_SUCCESS != (ret = orte_routed.warmup_routes())) {
ORTE_ERROR_LOG(ret);
}
}
OPAL_THREAD_UNLOCK(&orted_comm_mutex);
/* reissue the non-blocking receive */
@ -162,11 +174,21 @@ void orte_daemon_recv(int status, orte_process_name_t* sender,
ORTE_NAME_PRINT(sender));
}
/* init the warmup routes flag */
warmup_routes = false;
/* process the command */
if (ORTE_SUCCESS != (ret = process_commands(sender, buffer, tag))) {
ORTE_ERROR_LOG(ret);
}
/* see if we need to warmup any daemon-to-daemon routes */
if (warmup_routes) {
if (ORTE_SUCCESS != (ret = orte_routed.warmup_routes())) {
ORTE_ERROR_LOG(ret);
}
}
OPAL_THREAD_UNLOCK(&orted_comm_mutex);
/* reissue the non-blocking receive */
@ -177,40 +199,6 @@ void orte_daemon_recv(int status, orte_process_name_t* sender,
}
}
void orte_daemon_recv_gate(int status, orte_process_name_t* sender,
orte_buffer_t *buffer, orte_rml_tag_t tag,
void* cbdata)
{
int rc;
orte_std_cntr_t i;
orte_gpr_notify_message_t *mesg;
mesg = OBJ_NEW(orte_gpr_notify_message_t);
if (NULL == mesg) {
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
return;
}
i=1;
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buffer, &mesg, &i, ORTE_GPR_NOTIFY_MSG))) {
ORTE_ERROR_LOG(rc);
OBJ_RELEASE(mesg);
return;
}
if (ORTE_SUCCESS != (rc = orte_gpr.deliver_notify_msg(mesg))) {
ORTE_ERROR_LOG(rc);
}
OBJ_RELEASE(mesg);
/* reissue the non-blocking receive */
rc = orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD, ORTE_RML_TAG_XCAST_BARRIER,
ORTE_RML_NON_PERSISTENT, orte_daemon_recv_gate, NULL);
if (rc != ORTE_SUCCESS && rc != ORTE_ERR_NOT_IMPLEMENTED) {
ORTE_ERROR_LOG(rc);
}
}
static int process_commands(orte_process_name_t* sender,
orte_buffer_t *buffer,
orte_rml_tag_t tag)
@ -229,8 +217,6 @@ static int process_commands(orte_process_name_t* sender,
char *contact_info;
orte_buffer_t *answer;
orte_rml_cmd_flag_t rml_cmd;
orte_gpr_notify_message_t *mesg;
char *unpack_ptr;
/* unpack the command */
n = 1;
@ -365,41 +351,15 @@ static int process_commands(orte_process_name_t* sender,
* of the current recv.
*
* The problem here is that, for messages where we need to relay
* them along the orted chain, the xcast_barrier and rml_update
* messages contain contact info we may well need in order to do
* them along the orted chain, the rml_update
* message contains contact info we may well need in order to do
* the relay! So we need to process those messages immediately.
* The only way to accomplish that is to (a) detect that the
* buffer is intended for those tags, and then (b) process
* those buffers here.
*
* NOTE: in the case of xcast_barrier, we *also* must send the
* message along anyway so that it will release us from the
* barrier! So we will process that info twice - can't be helped
* and won't harm anything
*/
if (ORTE_RML_TAG_XCAST_BARRIER == target_tag) {
/* need to preserve the relay buffer's pointers so it can be
* unpacked again at the barrier
*/
unpack_ptr = relay->unpack_ptr;
mesg = OBJ_NEW(orte_gpr_notify_message_t);
n = 1;
if (ORTE_SUCCESS != (ret = orte_dss.unpack(relay, &mesg, &n, ORTE_GPR_NOTIFY_MSG))) {
ORTE_ERROR_LOG(ret);
OBJ_RELEASE(mesg);
goto CLEANUP;
}
orte_gpr.deliver_notify_msg(mesg);
OBJ_RELEASE(mesg);
/* restore the unpack ptr in the buffer */
relay->unpack_ptr = unpack_ptr;
/* make sure we queue this up for later delivery to release us from the barrier */
if ((ret = orte_rml.send_buffer(ORTE_PROC_MY_NAME, relay, target_tag, 0)) < 0) {
ORTE_ERROR_LOG(ret);
} else {
ret = ORTE_SUCCESS;
}
} else if (ORTE_RML_TAG_RML_INFO_UPDATE == target_tag) {
if (ORTE_RML_TAG_RML_INFO_UPDATE == target_tag) {
n = 1;
if (ORTE_SUCCESS != (ret = orte_dss.unpack(relay, &rml_cmd, &n, ORTE_RML_CMD))) {
ORTE_ERROR_LOG(ret);
@ -409,7 +369,17 @@ static int process_commands(orte_process_name_t* sender,
ORTE_ERROR_LOG(ret);
goto CLEANUP;
}
orte_rml_base_contact_info_notify(ndat, NULL);
/* initialize the routes to my peers */
if (ORTE_SUCCESS != (ret = orte_routed.init_routes(0, ndat))) {
ORTE_ERROR_LOG(ret);
goto CLEANUP;
}
/* set the warmup flag so we can warmup the routes between all
* daemons, as required by the routed framework. We have to set
* the flag here, but do the actual warmup later, to avoid blocking
* any relayed xcast (e.g., binomial)
*/
warmup_routes = true;
} else {
/* just deliver it to ourselves */
if ((ret = orte_rml.send_buffer(ORTE_PROC_MY_NAME, relay, target_tag, 0)) < 0) {
@ -539,6 +509,19 @@ static int process_commands(orte_process_name_t* sender,
ret = ORTE_SUCCESS;
break;
/**** SYNC FROM LOCAL PROC ****/
case ORTE_DAEMON_SYNC_BY_PROC:
if (orte_debug_daemons_flag) {
opal_output(0, "%s orted_recv: received sync from local proc %s",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ORTE_NAME_PRINT(sender));
}
if (ORTE_SUCCESS != (ret = orte_odls.require_sync(sender))) {
ORTE_ERROR_LOG(ret);
goto CLEANUP;
}
break;
default:
ORTE_ERROR_LOG(ORTE_ERR_BAD_PARAM);
ret = ORTE_ERR_BAD_PARAM;

Просмотреть файл

@ -377,7 +377,7 @@ int orte_daemon(int argc, char *argv[])
* up incorrect infrastructure that only a singleton would
* require.
*/
if (ORTE_SUCCESS != (ret = orte_init_stage1(ORTE_INFRASTRUCTURE))) {
if (ORTE_SUCCESS != (ret = orte_init(ORTE_INFRASTRUCTURE))) {
ORTE_ERROR_LOG(ret);
return ret;
}
@ -413,33 +413,6 @@ int orte_daemon(int argc, char *argv[])
OBJ_RELEASE(buffer);
return ret;
}
/* Begin recording registry actions */
if (ORTE_SUCCESS != (ret = orte_gpr.begin_compound_cmd(buffer))) {
ORTE_ERROR_LOG(ret);
OBJ_RELEASE(buffer);
return ret;
}
}
/* tell orte_init that we don't want any subscriptions registered by passing
* a NULL trigger name
*/
if (ORTE_SUCCESS != (ret = orte_init_stage2(NULL))) {
ORTE_ERROR_LOG(ret);
OBJ_RELEASE(buffer);
return ret;
}
/* if we aren't a seed, then we need to stop the compound_cmd mode here so
* that other subsystems can use it
*/
if (!orte_process_info.seed) {
if (ORTE_SUCCESS != (ret = orte_gpr.stop_compound_cmd())) {
ORTE_ERROR_LOG(ret);
OBJ_RELEASE(buffer);
return ret;
}
}
/* Set signal handlers to catch kill signals so we can properly clean up
@ -462,13 +435,6 @@ int orte_daemon(int argc, char *argv[])
opal_signal_add(&sigusr2_handler, NULL);
#endif /* __WINDOWS__ */
/* if requested, report my uri to the indicated pipe */
if (orted_globals.uri_pipe > 0) {
write(orted_globals.uri_pipe, orte_universe_info.seed_uri,
strlen(orte_universe_info.seed_uri)+1); /* need to add 1 to get the NULL */
close(orted_globals.uri_pipe);
}
/* setup stdout/stderr */
if (orte_debug_daemons_file_flag) {
/* if we are debugging to a file, then send stdout/stderr to
@ -519,21 +485,40 @@ int orte_daemon(int argc, char *argv[])
/* a daemon should *always* yield the processor when idle */
opal_progress_set_yield_when_idle(true);
/* setup to listen for xcast stage gate commands. We need to do this because updates to the
* contact info for dynamically spawned daemons will come to the gate RML-tag
*/
ret = orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD, ORTE_RML_TAG_XCAST_BARRIER,
ORTE_RML_NON_PERSISTENT, orte_daemon_recv_gate, NULL);
if (ret != ORTE_SUCCESS && ret != ORTE_ERR_NOT_IMPLEMENTED) {
ORTE_ERROR_LOG(ret);
OBJ_RELEASE(buffer);
return ret;
}
/* if requested, report my uri to the indicated pipe */
/* if requested, obtain and report a new process name and my uri to the indicated pipe */
if (orted_globals.uri_pipe > 0) {
write(orted_globals.uri_pipe, orte_universe_info.seed_uri,
strlen(orte_universe_info.seed_uri)+1); /* need to add 1 to get the NULL */
orte_process_name_t name;
char *tmp, *nptr;
orte_app_context_t *app;
/* setup the singleton's job */
orte_ns.create_jobid(&name.jobid, NULL);
orte_ns.reserve_range(name.jobid, 1, &name.vpid);
app = OBJ_NEW(orte_app_context_t);
app->app = strdup("singleton");
app->num_procs = 1;
if (ORTE_SUCCESS !=
(ret = orte_rmgr_base_put_app_context(name.jobid, &app, 1))) {
ORTE_ERROR_LOG(ret);
return ret;
}
OBJ_RELEASE(app);
/* setup stage gates for singleton */
orte_rmgr_base_proc_stage_gate_init(name.jobid);
/* create a string that contains our uri + the singleton's name */
orte_ns.get_proc_name_string(&nptr, &name);
asprintf(&tmp, "%s[%s]", orte_universe_info.seed_uri, nptr);
free(nptr);
/* pass that info to the singleton */
write(orted_globals.uri_pipe, tmp, strlen(tmp)+1); /* need to add 1 to get the NULL */
/* cleanup */
free(tmp);
}
/* if we were given a pipe to monitor for singleton termination, set that up */

Просмотреть файл

@ -25,8 +25,6 @@ headers += \
runtime/orte_wait.h \
runtime/orte_wakeup.h \
runtime/runtime.h \
runtime/runtime_internal.h \
runtime/runtime_types.h \
runtime/params.h \
runtime/orte_cr.h
@ -35,12 +33,8 @@ libopen_rte_la_SOURCES += \
runtime/orte_finalize.c \
runtime/orte_init.c \
runtime/orte_params.c \
runtime/orte_init_stage1.c \
runtime/orte_init_stage2.c \
runtime/orte_monitor.c \
runtime/orte_restart.c \
runtime/orte_system_finalize.c \
runtime/orte_system_init.c \
runtime/orte_universe_exists.c \
runtime/orte_wait.c \
runtime/orte_wakeup.c \

Просмотреть файл

@ -68,6 +68,8 @@
#include "orte/mca/schema/base/base.h"
#include "orte/mca/rmgr/rmgr.h"
#include "orte/mca/rmgr/base/base.h"
#include "orte/mca/routed/base/base.h"
#include "orte/mca/routed/routed.h"
#include "orte/mca/rml/rml.h"
#include "orte/mca/rml/base/base.h"
#include "orte/mca/iof/iof.h"
@ -490,6 +492,14 @@ static int orte_cr_coord_post_restart(void) {
exit_status = ret;
}
/*
* Re-enable communication through the RML
*/
if (ORTE_SUCCESS != (ret = orte_rml.enable_comm())) {
exit_status = ret;
goto cleanup;
}
/*
* Notify NS
*/
@ -514,6 +524,14 @@ static int orte_cr_coord_post_restart(void) {
goto cleanup;
}
/*
* Re-exchange the routes
*/
if (ORTE_SUCCESS != (ret = orte_routed.init_routes(ORTE_PROC_MY_NAME->jobid, NULL))) {
exit_status = ret;
goto cleanup;
}
/*
* Send new PID to GPR
* The checkpointer could have used a proxy program to boot us

Просмотреть файл

@ -1,15 +1,14 @@
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* Copyright (c) 2004-2006 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2007 Cisco Systems, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@ -22,9 +21,40 @@
#include "orte_config.h"
#include "orte/orte_constants.h"
#include "orte/runtime/params.h"
#include "opal/event/event.h"
#include "opal/util/if.h"
#include "opal/util/os_path.h"
#include "opal/runtime/opal.h"
#include "orte/runtime/orte_cr.h"
#include "orte/runtime/runtime.h"
#include "orte/runtime/params.h"
#include "orte/runtime/orte_wait.h"
#include "orte/mca/rml/base/base.h"
#include "orte/mca/routed/base/base.h"
#include "orte/dss/dss.h"
#include "orte/mca/ns/base/base.h"
#include "orte/mca/gpr/base/base.h"
#include "orte/mca/errmgr/base/base.h"
#include "orte/mca/grpcomm/base/base.h"
#include "orte/mca/rds/base/base.h"
#include "orte/mca/ras/base/base.h"
#include "orte/mca/rmaps/base/base.h"
#include "orte/mca/pls/base/base.h"
#include "orte/mca/schema/base/base.h"
#include "orte/mca/iof/base/base.h"
#include "orte/mca/rmgr/base/base.h"
#include "orte/mca/filem/base/base.h"
#if OPAL_ENABLE_FT == 1
#include "orte/mca/snapc/base/base.h"
#endif
#include "orte/mca/odls/base/base.h"
#include "orte/util/session_dir.h"
#include "orte/util/sys_info.h"
#include "orte/util/proc_info.h"
#include "orte/util/univ_info.h"
/**
* Leave ORTE.
@ -36,20 +66,70 @@
*/
int orte_finalize(void)
{
char *contact_path;
if (!orte_initialized) {
return ORTE_SUCCESS;
}
/* We have now entered the finalization stage */
orte_universe_info.state = ORTE_UNIVERSE_STATE_FINALIZE;
/* if I'm the seed, remove the universe contact info file */
if (orte_process_info.seed) {
contact_path = opal_os_path(false, orte_process_info.universe_session_dir,
"universe-setup.txt", NULL);
unlink(contact_path);
free(contact_path);
}
orte_cr_finalize();
/* finalize the orte system */
orte_system_finalize();
/* rmgr and odls close depend on wait/iof */
orte_rmgr_base_close();
#if OPAL_ENABLE_FT == 1
orte_snapc_base_close();
#endif
orte_filem_base_close();
orte_odls_base_close();
orte_wait_finalize();
orte_iof_base_close();
orte_ns_base_close();
orte_gpr_base_close();
orte_schema_base_close();
/* finalize selected modules so they can de-register
* their receives
*/
orte_rds_base_close();
orte_ras_base_close();
orte_rmaps_base_close();
orte_pls_base_close();
/* the errmgr close function retains the base
* module so that error logging can continue
*/
orte_errmgr_base_close();
/* now can close the rml and its friendly group comm */
orte_grpcomm_base_close();
orte_rml_base_close();
orte_routed_base_close();
orte_dss_close();
orte_session_dir_finalize(orte_process_info.my_name);
/* clean out the global structures */
orte_sys_info_finalize();
orte_proc_info_finalize();
orte_univ_info_finalize();
/* finalize the opal utilities */
opal_finalize();
orte_initialized = false;
return ORTE_SUCCESS;
}

Просмотреть файл

@ -5,15 +5,18 @@
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2006 Los Alamos National Security, LLC. All rights
* reserved.
* Copyright (c) 2007 Cisco Systems, Inc. All rights reserved.
*
* $COPYRIGHT$
*
*
* Additional copyrights may follow
*
*
* $HEADER$
*/
@ -21,37 +24,812 @@
#include "orte_config.h"
#include <sys/types.h>
#ifdef HAVE_UNISTD_H
#include <unistd.h>
#endif
#include "orte/orte_constants.h"
#include "orte/mca/errmgr/errmgr.h"
#include "opal/event/event.h"
#include "opal/util/output.h"
#include "opal/util/show_help.h"
#include "opal/threads/mutex.h"
#include "opal/runtime/opal.h"
#include "opal/runtime/opal_cr.h"
#include "opal/mca/mca.h"
#include "opal/mca/base/base.h"
#include "opal/mca/base/mca_base_param.h"
#include "opal/util/os_path.h"
#include "opal/util/cmd_line.h"
#include "opal/util/malloc.h"
#include "orte/dss/dss.h"
#include "orte/mca/rml/base/base.h"
#include "orte/mca/rml/base/rml_contact.h"
#include "orte/mca/routed/base/base.h"
#include "orte/mca/routed/routed.h"
#include "orte/mca/errmgr/base/base.h"
#include "orte/mca/grpcomm/base/base.h"
#include "orte/mca/iof/base/base.h"
#include "orte/mca/ns/base/base.h"
#include "orte/mca/sds/base/base.h"
#include "orte/mca/gpr/base/base.h"
#include "orte/mca/ras/base/base.h"
#include "orte/mca/rds/base/base.h"
#include "orte/mca/pls/base/base.h"
#include "orte/mca/rmgr/base/base.h"
#include "orte/mca/odls/base/base.h"
#include "orte/mca/rmaps/base/base.h"
#if OPAL_ENABLE_FT == 1
#include "orte/mca/snapc/base/base.h"
#endif
#include "orte/mca/filem/base/base.h"
#include "orte/mca/schema/base/base.h"
#include "orte/mca/smr/base/base.h"
#include "orte/util/univ_info.h"
#include "orte/util/proc_info.h"
#include "orte/util/session_dir.h"
#include "orte/util/sys_info.h"
#include "orte/util/universe_setup_file_io.h"
/* these are to be cleaned up for 2.0 */
#include "orte/mca/ras/base/ras_private.h"
#include "orte/mca/rmgr/base/rmgr_private.h"
#include "orte/runtime/runtime.h"
#include "orte/runtime/orte_wait.h"
#include "orte/runtime/params.h"
#include "opal/runtime/opal.h"
#include "orte/runtime/runtime.h"
#include "orte/runtime/orte_cr.h"
/**
* Initialze and setup a process in the ORTE.
*
* @retval ORTE_SUCCESS Upon success.
* @retval ORTE_ERROR Upon failure.
*/
int orte_init(bool infrastructure, bool barrier)
int orte_init(bool infrastructure)
{
int rc;
int ret;
char *error = NULL;
char *jobid_str = NULL;
char *procid_str = NULL;
char *contact_path = NULL;
if (orte_initialized) {
return ORTE_SUCCESS;
}
if (ORTE_SUCCESS != (rc = opal_init())) {
ORTE_ERROR_LOG(rc);
return rc;
/* initialize the opal layer */
if (ORTE_SUCCESS != (ret = opal_init())) {
ORTE_ERROR_LOG(ret);
return ret;
}
if (ORTE_SUCCESS != (rc = orte_system_init(infrastructure, barrier))) {
ORTE_ERROR_LOG(rc);
return rc;
/* register handler for errnum -> string conversion */
opal_error_register("ORTE", ORTE_ERR_BASE, ORTE_ERR_MAX, orte_err2str);
/* Register all MCA Params */
if (ORTE_SUCCESS != (ret = orte_register_params(infrastructure))) {
error = "orte_register_params";
goto error;
}
/* Ensure the system_info structure is instantiated and initialized */
if (ORTE_SUCCESS != (ret = orte_sys_info())) {
error = "orte_sys_info";
goto error;
}
/* Ensure the process info structure is instantiated and initialized */
if (ORTE_SUCCESS != (ret = orte_proc_info())) {
error = "orte_proc_info";
goto error;
}
/* Ensure the universe_info structure is instantiated and initialized */
if (ORTE_SUCCESS != (ret = orte_univ_info())) {
error = "orte_univ_info";
goto error;
}
/*
* Initialize the data storage service.
*/
if (ORTE_SUCCESS != (ret = orte_dss_open())) {
error = "orte_dss_open";
goto error;
}
/*
* Open the name services to ensure access to local functions
*/
if (ORTE_SUCCESS != (ret = orte_ns_base_open())) {
error = "orte_ns_base_open";
goto error;
}
/* Open the error manager to activate error logging - needs local name services */
if (ORTE_SUCCESS != (ret = orte_errmgr_base_open())) {
error = "orte_errmgr_base_open";
goto error;
}
/***** ERROR LOGGING NOW AVAILABLE *****/
/*
* Internal startup
*/
if (ORTE_SUCCESS != (ret = orte_wait_init())) {
ORTE_ERROR_LOG(ret);
error = "orte_wait_init";
goto error;
}
/*
* Runtime Messaging Layer
*/
if (ORTE_SUCCESS != (ret = orte_rml_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_rml_base_open";
goto error;
}
/*
* Runtime Messaging Layer
*/
if (ORTE_SUCCESS != (ret = orte_rml_base_select())) {
ORTE_ERROR_LOG(ret);
error = "orte_rml_base_select";
goto error;
}
/*
* Routed system
*/
if (ORTE_SUCCESS != (ret = orte_routed_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_routed_base_open";
goto error;
}
/*
* Routed system
*/
if (ORTE_SUCCESS != (ret = orte_routed_base_select())) {
ORTE_ERROR_LOG(ret);
error = "orte_routed_base_select";
goto error;
}
/*
* Group communications
*/
if (ORTE_SUCCESS != (ret = orte_grpcomm_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_grpcomm_base_open";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_grpcomm_base_select())) {
ORTE_ERROR_LOG(ret);
error = "orte_grpcomm_base_select";
goto error;
}
/*
* Registry
*/
if (ORTE_SUCCESS != (ret = orte_gpr_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_gpr_base_open";
goto error;
}
/*
* Initialize the daemon launch system so those types
* are registered (needed by the sds to talk to its
* local daemon)
*/
if (ORTE_SUCCESS != (ret = orte_odls_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_odls_base_open";
goto error;
}
/*
* Initialize schema utilities
*/
if (ORTE_SUCCESS != (ret = orte_schema_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_schema_base_open";
goto error;
}
/*
* Initialize and select the Startup Discovery Service. This must
* be done here since some environments have different requirements
* for detecting/connecting to a universe. Note that this does
* *not* set our name - that will come later
*/
if (ORTE_SUCCESS != (ret = orte_sds_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_sds_base_open";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_sds_base_select())) {
ORTE_ERROR_LOG(ret);
error = "orte_sds_base_select";
goto error;
}
/* Try to connect to the universe. If we don't find one and are a
* singleton, this will startup a new HNP and define our name
* within it - in which case, we will skip the name discovery
* process since we already have one
*/
if (ORTE_SUCCESS != (ret = orte_sds_base_contact_universe())) {
ORTE_ERROR_LOG(ret);
error = "orte_sds_base_contact_universe";
goto error;
}
/*
* Name Server
*/
if (ORTE_SUCCESS != (ret = orte_ns_base_select())) {
ORTE_ERROR_LOG(ret);
error = "orte_ns_base_select";
goto error;
}
/*
* Registry
*/
if (ORTE_SUCCESS != (ret = orte_gpr_base_select())) {
ORTE_ERROR_LOG(ret);
error = "orte_gpr_base_select";
goto error;
}
/* set contact info for ns/gpr */
if(NULL != orte_process_info.ns_replica_uri) {
orte_rml.set_contact_info(orte_process_info.ns_replica_uri);
}
if(NULL != orte_process_info.gpr_replica_uri) {
orte_rml.set_contact_info(orte_process_info.gpr_replica_uri);
}
/* Set my name */
if (ORTE_SUCCESS != (ret = orte_sds_base_set_name())) {
ORTE_ERROR_LOG(ret);
error = "orte_sds_base_set_name";
goto error;
}
/* all done with sds - clean up and call it a day */
orte_sds_base_close();
/*
* Now that we know for certain if we are an HNP and/or a daemon,
* setup the resource management frameworks. This includes
* selecting the daemon launch framework - that framework "knows"
* what to do if it isn't in a daemon.
*/
if (ORTE_SUCCESS != (ret = orte_rds_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_rds_base_open";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_rds_base_select())) {
ORTE_ERROR_LOG(ret);
error = "orte_rds_base_select";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_ras_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_ras_base_open";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_ras_base_find_available())) {
ORTE_ERROR_LOG(ret);
error = "orte_ras_base_find_available";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_rmaps_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_rmaps_base_open";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_rmaps_base_find_available())) {
ORTE_ERROR_LOG(ret);
error = "orte_rmaps_base_find_available";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_pls_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_pls_base_open";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_pls_base_select())) {
ORTE_ERROR_LOG(ret);
error = "orte_pls_base_select";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_odls_base_select())) {
ORTE_ERROR_LOG(ret);
error = "orte_odls_base_select";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_rmgr_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_rmgr_base_open";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_rmgr_base_select())) {
ORTE_ERROR_LOG(ret);
error = "orte_rmgr_base_select";
goto error;
}
/*
* setup the state monitor
*/
if (ORTE_SUCCESS != (ret = orte_smr_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_smr_base_open";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_smr_base_select())) {
ORTE_ERROR_LOG(ret);
error = "orte_smr_base_select";
goto error;
}
/*
* setup the errmgr -- open has been done way before
*/
if (ORTE_SUCCESS != (ret = orte_errmgr_base_select())) {
ORTE_ERROR_LOG(ret);
error = "orte_errmgr_base_select";
goto error;
}
/* enable communication with the rml */
if (ORTE_SUCCESS != (ret = orte_rml.enable_comm())) {
ORTE_ERROR_LOG(ret);
error = "orte_rml.enable_comm";
goto error;
}
/* if I'm the seed, set the seed uri to be me! */
if (orte_process_info.seed) {
if (NULL != orte_universe_info.seed_uri) {
free(orte_universe_info.seed_uri);
}
orte_universe_info.seed_uri = orte_rml.get_contact_info();
/* and make sure that the daemon flag is NOT set so that
* components unique to non-HNP orteds can be selected
*/
orte_process_info.daemon = false;
}
/* setup my session directory */
if (ORTE_SUCCESS != (ret = orte_ns.get_jobid_string(&jobid_str, orte_process_info.my_name))) {
ORTE_ERROR_LOG(ret);
error = "orte_ns.get_jobid_string";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_ns.get_vpid_string(&procid_str, orte_process_info.my_name))) {
ORTE_ERROR_LOG(ret);
error = "orte_ns.get_vpid_string";
goto error;
}
if (orte_debug_flag) {
opal_output(0, "%s setting up session dir with",
ORTE_NAME_PRINT(orte_process_info.my_name));
if (NULL != orte_process_info.tmpdir_base) {
opal_output(0, "\ttmpdir %s", orte_process_info.tmpdir_base);
}
opal_output(0, "\tuniverse %s", orte_universe_info.name);
opal_output(0, "\tuser %s", orte_system_info.user);
opal_output(0, "\thost %s", orte_system_info.nodename);
opal_output(0, "\tjobid %s", jobid_str);
opal_output(0, "\tprocid %s", procid_str);
}
if (ORTE_SUCCESS != (ret = orte_session_dir(true,
orte_process_info.tmpdir_base,
orte_system_info.user,
orte_system_info.nodename, NULL,
orte_universe_info.name,
jobid_str, procid_str))) {
if (jobid_str != NULL) free(jobid_str);
if (procid_str != NULL) free(procid_str);
ORTE_ERROR_LOG(ret);
error = "orte_session_dir";
goto error;
}
if (NULL != jobid_str) {
free(jobid_str);
}
if (NULL != procid_str) {
free(procid_str);
}
/* Once the session directory location has been established, set
the opal_output default file location to be in the
proc-specific session directory. */
opal_output_set_output_file_info(orte_process_info.proc_session_dir,
"output-", NULL, NULL);
/* if i'm the seed, get my contact info and write my setup file for others to find */
if (orte_process_info.seed) {
contact_path = opal_os_path(false, orte_process_info.universe_session_dir,
"universe-setup.txt", NULL);
if (orte_debug_flag) {
opal_output(0, "%s contact_file %s",
ORTE_NAME_PRINT(orte_process_info.my_name), contact_path);
}
if (ORTE_SUCCESS != (ret = orte_write_universe_setup_file(contact_path, &orte_universe_info))) {
if (orte_debug_flag) {
opal_output(0, "%s couldn't write setup file", ORTE_NAME_PRINT(orte_process_info.my_name));
}
} else if (orte_debug_flag) {
opal_output(0, "%s wrote setup file", ORTE_NAME_PRINT(orte_process_info.my_name));
}
free(contact_path);
}
/*
* Initialize the selected modules now that all components/name are available.
*/
if (ORTE_SUCCESS != (ret = orte_ns.init())) {
ORTE_ERROR_LOG(ret);
error = "orte_ns.init";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_gpr.init())) {
ORTE_ERROR_LOG(ret);
error = "orte_gpr.init";
goto error;
}
/*
* setup I/O forwarding system
*/
if (ORTE_SUCCESS != (ret = orte_iof_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_iof_base_open";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_iof_base_select())) {
ORTE_ERROR_LOG(ret);
error = "orte_iof_base_select";
goto error;
}
if (orte_process_info.seed) {
/* if we are an HNP, we have to setup an app_context for ourselves so
* various frameworks can find their required info
*/
orte_app_context_t *app;
orte_gpr_value_t *value;
app = OBJ_NEW(orte_app_context_t);
app->app = strdup("orted");
app->num_procs = 1;
if (ORTE_SUCCESS != (ret = orte_rmgr.store_app_context(ORTE_PROC_MY_NAME->jobid, &app, 1))) {
ORTE_ERROR_LOG(ret);
error = "HNP store app context";
goto error;
}
OBJ_RELEASE(app);
if (ORTE_SUCCESS != (ret = orte_gpr.create_value(&value,
ORTE_GPR_OVERWRITE|ORTE_GPR_TOKENS_AND,
"orte-job-0", 2,
0))) {
ORTE_ERROR_LOG(ret);
error = "HNP create value";
goto error;
}
/* store the node name and the daemon's name */
if (ORTE_SUCCESS != (ret = orte_gpr.create_keyval(&(value->keyvals[0]), ORTE_NODE_NAME_KEY,
ORTE_STRING, orte_system_info.nodename))) {
ORTE_ERROR_LOG(ret);
error = "HNP create keyval";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_gpr.create_keyval(&(value->keyvals[1]), ORTE_PROC_NAME_KEY,
ORTE_NAME, ORTE_PROC_MY_NAME))) {
ORTE_ERROR_LOG(ret);
error = "HNP create keyval";
goto error;
}
/* set the tokens */
if (ORTE_SUCCESS != (ret = orte_schema.get_proc_tokens(&(value->tokens), &(value->num_tokens), ORTE_PROC_MY_NAME))) {
ORTE_ERROR_LOG(ret);
error = "HNP get proc tokens";
goto error;
}
/* insert values */
if (ORTE_SUCCESS != (ret = orte_gpr.put(1, &value))) {
ORTE_ERROR_LOG(ret);
error = "HNP put values";
goto error;
}
OBJ_RELEASE(value);
/* and set our state to LAUNCHED */
if (ORTE_SUCCESS != (ret = orte_smr.set_proc_state(ORTE_PROC_MY_NAME, ORTE_PROC_STATE_LAUNCHED, 0))) {
ORTE_ERROR_LOG(ret);
error = "HNP could not set launched state";
goto error;
}
}
if (orte_process_info.singleton) {
/* since all frameworks are now open and active, walk through
* the spawn sequence to ensure that we collect info on all
* available resources. Although we are a singleton and hence
* don't need to be spawned, we may choose to dynamically spawn
* additional processes. If we do that, then we need to know
* about any resources that have been allocated to us - executing
* the RDS and RAS frameworks is the only way to get that info.
*
* THIS ONLY SHOULD BE DONE FOR SINGLETONS - DO NOT DO IT
* FOR ANY OTHER CASE
*/
orte_gpr_value_t *values[2];
orte_std_cntr_t one=1, zero=0;
orte_proc_state_t init=ORTE_PROC_STATE_INIT;
orte_vpid_t lrank=0, vone=1;
char *segment;
/* ensure we read the allocation, even if we are not sitting on a node that is within it */
if (ORTE_SUCCESS != (ret = orte_ras.allocate_job(ORTE_PROC_MY_NAME->jobid, NULL))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not allocate job";
goto error;
}
/* let's deal with the procs first - start by getting their job segment name */
if (ORTE_SUCCESS != (ret = orte_schema.get_job_segment_name(&segment, ORTE_PROC_MY_NAME->jobid))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not create job segment name";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_gpr.create_value(&(values[0]),
ORTE_GPR_OVERWRITE|ORTE_GPR_TOKENS_AND,
segment, 7, 0))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not create gpr value";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_gpr.create_keyval(&(values[0]->keyvals[0]), ORTE_PROC_RANK_KEY, ORTE_VPID, &(ORTE_PROC_MY_NAME->vpid)))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not create keyval";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_gpr.create_keyval(&(values[0]->keyvals[1]), ORTE_PROC_NAME_KEY, ORTE_NAME, ORTE_PROC_MY_NAME))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not create keyval";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_gpr.create_keyval(&(values[0]->keyvals[2]), ORTE_NODE_NAME_KEY, ORTE_STRING, orte_system_info.nodename))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not create keyval";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_gpr.create_keyval(&(values[0]->keyvals[3]), ORTE_PROC_APP_CONTEXT_KEY, ORTE_STD_CNTR, &zero))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not create keyval";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_gpr.create_keyval(&(values[0]->keyvals[4]), ORTE_PROC_STATE_KEY, ORTE_PROC_STATE, &init))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not create keyval";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_gpr.create_keyval(&(values[0]->keyvals[5]), ORTE_PROC_LOCAL_RANK_KEY, ORTE_VPID, &lrank))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not create keyval";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_gpr.create_keyval(&(values[0]->keyvals[6]), ORTE_NODE_NUM_PROCS_KEY, ORTE_STD_CNTR, &one))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not create keyval";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_schema.get_proc_tokens(&(values[0]->tokens), &(values[0]->num_tokens), ORTE_PROC_MY_NAME))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not get gpr tokens";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_gpr.create_value(&values[1],
ORTE_GPR_OVERWRITE|ORTE_GPR_TOKENS_AND,
segment, 3, 1))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not create gpr value";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_gpr.create_keyval(&(values[1]->keyvals[0]), ORTE_PROC_NUM_AT_INIT, ORTE_STD_CNTR, &one))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not create keyval";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_gpr.create_keyval(&(values[1]->keyvals[1]), ORTE_JOB_VPID_START_KEY, ORTE_VPID, &(ORTE_PROC_MY_NAME->vpid)))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not create keyval";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_gpr.create_keyval(&(values[1]->keyvals[2]), ORTE_JOB_VPID_RANGE_KEY, ORTE_VPID, &vone))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not create keyval";
goto error;
}
values[1]->tokens[0] = strdup(ORTE_JOB_GLOBALS); /* counter is in the job's globals container */
/* insert all values in one call */
if (ORTE_SUCCESS != (ret = orte_gpr.put(2, values))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not put its launch data";
goto error;
}
OBJ_RELEASE(values[0]);
OBJ_RELEASE(values[1]);
free(segment);
/* wireup our io */
if (ORTE_SUCCESS != (ret = orte_iof.iof_pull(ORTE_PROC_MY_NAME, ORTE_NS_CMP_JOBID, ORTE_IOF_STDOUT, 1))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not setup iof";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_iof.iof_pull(ORTE_PROC_MY_NAME, ORTE_NS_CMP_JOBID, ORTE_IOF_STDERR, 2))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not setup iof";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_iof.iof_push(ORTE_PROC_MY_NAME, ORTE_NS_CMP_JOBID, ORTE_IOF_STDIN, 0))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not setup iof";
goto error;
}
/* setup the errmgr, as required */
if (ORTE_SUCCESS != (ret = orte_errmgr.register_job(ORTE_PROC_MY_NAME->jobid))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not setup errmgr callbacks";
goto error;
}
/* set our state to LAUNCHED */
if (ORTE_SUCCESS != (ret = orte_smr.set_proc_state(ORTE_PROC_MY_NAME, ORTE_PROC_STATE_LAUNCHED, 0))) {
ORTE_ERROR_LOG(ret);
error = "singleton could not set launched state";
goto error;
}
}
/* setup the routed info - the selected routed component
* will know what to do. Some may put us in a blocking
* receive here so they can get ALL of the contact info
* from our peers. Others may just find the local daemon's
* contact info and immediately return.
*/
if( !orte_universe_info.console ) {
if (ORTE_SUCCESS != (ret = orte_routed.init_routes(ORTE_PROC_MY_NAME->jobid, NULL))) {
ORTE_ERROR_LOG(ret);
error = "orte_routed.init_routes";
goto error;
}
}
/*
* Setup the FileM
*/
if (ORTE_SUCCESS != (ret = orte_filem_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_filem_base_open";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_filem_base_select())) {
ORTE_ERROR_LOG(ret);
error = "orte_filem_base_select";
goto error;
}
#if OPAL_ENABLE_FT == 1
/*
* Setup the SnapC
*/
if (ORTE_SUCCESS != (ret = orte_snapc_base_open())) {
ORTE_ERROR_LOG(ret);
error = "orte_snapc_base_open";
goto error;
}
if (ORTE_SUCCESS != (ret = orte_snapc_base_select(orte_process_info.seed, !orte_process_info.daemon))) {
ORTE_ERROR_LOG(ret);
error = "orte_snapc_base_select";
goto error;
}
/* Need to figure out if we are an application or part of ORTE */
if(infrastructure ||
orte_process_info.seed ||
orte_process_info.daemon) {
/* ORTE doesn't need the OPAL CR stuff */
opal_cr_set_enabled(false);
}
else {
/* The application does however */
opal_cr_set_enabled(true);
}
#else
opal_cr_set_enabled(false);
#endif
/*
* Initalize the CR setup
* Note: Always do this, even in non-FT builds.
* If we don't some user level tools may hang.
*/
if (ORTE_SUCCESS != (ret = orte_cr_init())) {
ORTE_ERROR_LOG(ret);
error = "orte_cr_init";
goto error;
}
/* Since we are now finished with init, change the state to running */
orte_universe_info.state = ORTE_UNIVERSE_STATE_RUNNING;
/* startup the receive if we are not the HNP - unless we are a singleton,
* in which case we must start it up in case we do a comm_spawn!
*/
if (orte_process_info.singleton ||
(!orte_process_info.seed && orte_process_info.daemon)) {
if (ORTE_SUCCESS != (ret = orte_rml_base_comm_start())) {
ORTE_ERROR_LOG(ret);
error = "orte_rml_base_comm_start";
goto error;
}
}
/* All done */
orte_initialized = true;
return ORTE_SUCCESS;
error:
if (ret != ORTE_SUCCESS) {
opal_show_help("help-orte-runtime",
"orte_init:startup:internal-failure",
true, error, ORTE_ERROR_NAME(ret), ret);
}
return ret;
}

Некоторые файлы не были показаны из-за слишком большого количества измененных файлов Показать больше