These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component. This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done: As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in. In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in. The incoming changes revamp these procedures in three ways: 1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step. The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic. Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure. 2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed. The size of this data has been reduced in three ways: (a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes. To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose. (b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction. (c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using. While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly. 3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup. It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging. Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future. There are a few minor additional changes in the commit that I'll just note in passing: * propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details. * requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details. * cleanup of some stale header files This commit was SVN r16364.
This commit is contained in:
parent
ada43fef9e
commit
54b2cf747e
@ -517,6 +517,25 @@ AC_DEFINE_UNQUOTED([OPAL_ENABLE_TRACE], [$opal_want_trace],
|
||||
[Enable run-time tracing of internal functions])
|
||||
|
||||
|
||||
#
|
||||
# Jumbo application support
|
||||
#
|
||||
|
||||
AC_MSG_CHECKING([if want jumbo app support])
|
||||
AC_ARG_ENABLE([jumbo-apps],
|
||||
[AC_HELP_STRING([--enable-jumbo-apps],
|
||||
[Enable support for applications in excess of 32K processes and/or 32K jobs, or running on clusters in excess of 32k nodes (default: disabled)])])
|
||||
if test "$enable_jumbo_apps" = "yes"; then
|
||||
AC_MSG_RESULT([yes])
|
||||
orte_want_jumbo_apps=1
|
||||
else
|
||||
AC_MSG_RESULT([no])
|
||||
orte_want_jumbo_apps=0
|
||||
fi
|
||||
AC_DEFINE_UNQUOTED([ORTE_ENABLE_JUMBO_APPS], [$orte_want_jumbo_apps],
|
||||
[Enable support for applications in excess of 32K processes and/or 32K jobs, or running on clusters in excess of 32k nodes])
|
||||
|
||||
|
||||
#
|
||||
# Cross-compile data
|
||||
#
|
||||
|
@ -1141,8 +1141,10 @@ ompi_proc_t **ompi_comm_get_rprocs ( ompi_communicator_t *local_comm,
|
||||
goto err_exit;
|
||||
}
|
||||
|
||||
/* decode the names into a proc-list */
|
||||
rc = ompi_proc_unpack(rbuf, rsize, &rprocs );
|
||||
/* decode the names into a proc-list -- will never add a new proc
|
||||
as the result of this operation, so no need to get the newprocs
|
||||
list or call PML add_procs(). */
|
||||
rc = ompi_proc_unpack(rbuf, rsize, &rprocs, NULL, NULL);
|
||||
OBJ_RELEASE(rbuf);
|
||||
|
||||
err_exit:
|
||||
|
@ -51,6 +51,7 @@
|
||||
#include "ompi/info/info.h"
|
||||
#include "ompi/constants.h"
|
||||
#include "ompi/mca/pml/pml.h"
|
||||
#include "ompi/runtime/ompi_module_exchange.h"
|
||||
|
||||
#include "orte/util/proc_info.h"
|
||||
#include "orte/dss/dss.h"
|
||||
@ -63,6 +64,7 @@
|
||||
#include "orte/mca/rmgr/base/base.h"
|
||||
#include "orte/mca/smr/smr_types.h"
|
||||
#include "orte/mca/rml/rml.h"
|
||||
#include "orte/mca/grpcomm/grpcomm.h"
|
||||
|
||||
#include "orte/runtime/runtime.h"
|
||||
|
||||
@ -86,8 +88,8 @@ int ompi_comm_connect_accept ( ompi_communicator_t *comm, int root,
|
||||
ompi_group_t *group=comm->c_local_group;
|
||||
orte_process_name_t *rport=NULL, tmp_port_name;
|
||||
orte_buffer_t *nbuf=NULL, *nrbuf=NULL;
|
||||
ompi_proc_t **proc_list=NULL;
|
||||
int i,j;
|
||||
ompi_proc_t **proc_list=NULL, **new_proc_list;
|
||||
int i,j, new_proc_len;
|
||||
ompi_group_t *new_group_pointer;
|
||||
|
||||
size = ompi_comm_size ( comm );
|
||||
@ -219,11 +221,77 @@ int ompi_comm_connect_accept ( ompi_communicator_t *comm, int root,
|
||||
goto exit;
|
||||
}
|
||||
|
||||
rc = ompi_proc_unpack(nrbuf, rsize, &rprocs);
|
||||
rc = ompi_proc_unpack(nrbuf, rsize, &rprocs, &new_proc_len, &new_proc_list);
|
||||
if ( OMPI_SUCCESS != rc ) {
|
||||
goto exit;
|
||||
}
|
||||
|
||||
/* If we added new procs, we need to do the modex and then call
|
||||
PML add_procs */
|
||||
if (new_proc_len > 0) {
|
||||
opal_list_t all_procs;
|
||||
orte_namelist_t *name;
|
||||
orte_buffer_t mdx_buf, rbuf;
|
||||
|
||||
OBJ_CONSTRUCT(&all_procs, opal_list_t);
|
||||
|
||||
if (send_first) {
|
||||
for (i = 0 ; i < group->grp_proc_count ; ++i) {
|
||||
name = OBJ_NEW(orte_namelist_t);
|
||||
name->name = &(ompi_group_peer_lookup(group, i)->proc_name);
|
||||
opal_list_append(&all_procs, &name->item);
|
||||
}
|
||||
|
||||
for (i = 0 ; i < rsize ; ++i) {
|
||||
name = OBJ_NEW(orte_namelist_t);
|
||||
name->name = &(rprocs[i]->proc_name);
|
||||
opal_list_append(&all_procs, &name->item);
|
||||
}
|
||||
} else {
|
||||
for (i = 0 ; i < rsize ; ++i) {
|
||||
name = OBJ_NEW(orte_namelist_t);
|
||||
name->name = &(rprocs[i]->proc_name);
|
||||
opal_list_append(&all_procs, &name->item);
|
||||
}
|
||||
|
||||
for (i = 0 ; i < group->grp_proc_count ; ++i) {
|
||||
name = OBJ_NEW(orte_namelist_t);
|
||||
name->name = &(ompi_group_peer_lookup(group, i)->proc_name);
|
||||
opal_list_append(&all_procs, &name->item);
|
||||
}
|
||||
}
|
||||
|
||||
OBJ_CONSTRUCT(&mdx_buf, orte_buffer_t);
|
||||
if (OMPI_SUCCESS != (rc = ompi_modex_get_my_buffer(&mdx_buf))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto exit;
|
||||
}
|
||||
|
||||
OBJ_CONSTRUCT(&rbuf, orte_buffer_t);
|
||||
if (OMPI_SUCCESS != (rc = orte_grpcomm.allgather_list(&all_procs,
|
||||
&mdx_buf,
|
||||
&rbuf))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto exit;
|
||||
}
|
||||
OBJ_DESTRUCT(&mdx_buf);
|
||||
|
||||
if (OMPI_SUCCESS != (rc = ompi_modex_process_data(&rbuf))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto exit;
|
||||
}
|
||||
OBJ_DESTRUCT(&rbuf);
|
||||
|
||||
/*
|
||||
while (NULL != (item = opal_list_remove_first(&all_procs))) {
|
||||
OBJ_RELEASE(item);
|
||||
}
|
||||
OBJ_DESTRUCT(&all_procs);
|
||||
*/
|
||||
|
||||
MCA_PML_CALL(add_procs(new_proc_list, new_proc_len));
|
||||
}
|
||||
|
||||
OBJ_RELEASE(nrbuf);
|
||||
if ( rank == root ) {
|
||||
OBJ_RELEASE(nbuf);
|
||||
@ -407,7 +475,6 @@ ompi_comm_start_processes(int count, char **array_of_commands,
|
||||
orte_std_cntr_t num_apps, ai;
|
||||
orte_jobid_t new_jobid=ORTE_JOBID_INVALID;
|
||||
orte_app_context_t **apps=NULL;
|
||||
orte_proc_state_t state;
|
||||
|
||||
opal_list_t attributes;
|
||||
opal_list_item_t *item;
|
||||
@ -651,6 +718,7 @@ ompi_comm_start_processes(int count, char **array_of_commands,
|
||||
return MPI_ERR_SPAWN;
|
||||
}
|
||||
|
||||
#if 0
|
||||
/* tell the RTE that we want to be cross-connected to the children so we receive
|
||||
* their ORTE-level information - e.g., OOB contact info - when they
|
||||
* reach the STG1 stage gate
|
||||
@ -664,6 +732,7 @@ ompi_comm_start_processes(int count, char **array_of_commands,
|
||||
opal_progress_event_users_decrement();
|
||||
return MPI_ERR_SPAWN;
|
||||
}
|
||||
#endif
|
||||
|
||||
/* check for timing request - get stop time and report elapsed time if so */
|
||||
if (timing) {
|
||||
|
@ -459,6 +459,7 @@ int mca_pml_ob1_ft_event( int state )
|
||||
ompi_proc_t** procs = NULL;
|
||||
size_t num_procs;
|
||||
int ret, p;
|
||||
orte_buffer_t mdx_buf, rbuf;
|
||||
|
||||
if(OPAL_CRS_CHECKPOINT == state) {
|
||||
;
|
||||
@ -469,6 +470,10 @@ int mca_pml_ob1_ft_event( int state )
|
||||
else if(OPAL_CRS_RESTART == state) {
|
||||
/*
|
||||
* Get a list of processes
|
||||
* NOTE: Do *not* call ompi_proc_finalize as there are many places in
|
||||
* the code that point to indv. procs in this strucutre. For our
|
||||
* needs here we only need to fix up the modex, bml and pml
|
||||
* references.
|
||||
*/
|
||||
procs = ompi_proc_all(&num_procs);
|
||||
if(NULL == procs) {
|
||||
@ -487,6 +492,9 @@ int mca_pml_ob1_ft_event( int state )
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* Make sure the modex is NULL so it can be re-initalized
|
||||
*/
|
||||
for(p = 0; p < (int)num_procs; ++p) {
|
||||
if( NULL != procs[p]->proc_modex ) {
|
||||
OBJ_RELEASE(procs[p]->proc_modex);
|
||||
@ -494,6 +502,9 @@ int mca_pml_ob1_ft_event( int state )
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* Init the modex structures
|
||||
*/
|
||||
if (OMPI_SUCCESS != (ret = ompi_modex_init())) {
|
||||
opal_output(0,
|
||||
"pml:ob1: ft_event(Restart): modex_init Failed %d",
|
||||
@ -501,6 +512,16 @@ int mca_pml_ob1_ft_event( int state )
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* Load back up the hostname/arch information into the modex
|
||||
*/
|
||||
if (OMPI_SUCCESS != (ret = ompi_proc_publish_info())) {
|
||||
opal_output(0,
|
||||
"pml:ob1: ft_event(Restart): proc_init Failed %d",
|
||||
ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
}
|
||||
else if(OPAL_CRS_TERM == state ) {
|
||||
;
|
||||
@ -527,52 +548,61 @@ int mca_pml_ob1_ft_event( int state )
|
||||
}
|
||||
else if(OPAL_CRS_RESTART == state) {
|
||||
/*
|
||||
* Re-subscribe to the modex information
|
||||
* Exchange the modex information once again
|
||||
*/
|
||||
if (OMPI_SUCCESS != (ret = ompi_modex_subscribe_job(ORTE_PROC_MY_NAME->jobid))) {
|
||||
OBJ_CONSTRUCT(&mdx_buf, orte_buffer_t);
|
||||
if (OMPI_SUCCESS != (ret = ompi_modex_get_my_buffer(&mdx_buf))) {
|
||||
opal_output(0,
|
||||
"pml:ob1: ft_event(Restart): Failed to subscribe to the modex information %d",
|
||||
"pml:ob1: ft_event(Restart): Failed ompi_modex_get_my_buffer() = %d",
|
||||
ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
opal_output_verbose(10, ompi_cr_output,
|
||||
"pml:ob1: ft_event(Restart): Enter Stage Gate 1");
|
||||
if (ORTE_SUCCESS != (ret = orte_smr.set_proc_state(orte_process_info.my_name,
|
||||
ORTE_PROC_STATE_AT_STG1, 0))) {
|
||||
opal_output(0,
|
||||
"pml:ob1: ft_event(Restart): Stage Gate 1 Failed %d",
|
||||
ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
if (ORTE_SUCCESS != (ret = orte_grpcomm.xcast_gate(orte_gpr.deliver_notify_msg))) {
|
||||
opal_output(0,
|
||||
"pml:ob1: ft_event(Restart): Stage Gate 1 Failed %d",
|
||||
ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
if( OMPI_SUCCESS != (ret = mca_pml_ob1_add_procs(procs, num_procs) ) ) {
|
||||
opal_output(0, "pml:ob1: readd_procs: Failed in add_procs (%d)", ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* Set the STAGE 2 State
|
||||
* Do the allgather exchange of information
|
||||
*/
|
||||
opal_output_verbose(10, ompi_cr_output,
|
||||
"pml:ob1: ft_event(Restart): Enter Stage Gate 2");
|
||||
if (ORTE_SUCCESS != (ret = orte_smr.set_proc_state(orte_process_info.my_name,
|
||||
ORTE_PROC_STATE_AT_STG2, 0))) {
|
||||
opal_output(0,"pml:ob1: ft_event(Restart): Stage Gate 1 Failed %d",
|
||||
OBJ_CONSTRUCT(&rbuf, orte_buffer_t);
|
||||
if (OMPI_SUCCESS != (ret = orte_grpcomm.allgather(&mdx_buf, &rbuf))) {
|
||||
opal_output(0,
|
||||
"pml:ob1: ft_event(Restart): Failed orte_grpcomm.allgather() = %d",
|
||||
ret);
|
||||
return ret;
|
||||
}
|
||||
OBJ_DESTRUCT(&mdx_buf);
|
||||
|
||||
/*
|
||||
* Process the modex data into the proc structures
|
||||
*/
|
||||
if (OMPI_SUCCESS != (ret = ompi_modex_process_data(&rbuf))) {
|
||||
opal_output(0,
|
||||
"pml:ob1: ft_event(Restart): Failed ompi_modex_process_data() = %d",
|
||||
ret);
|
||||
return ret;
|
||||
}
|
||||
OBJ_DESTRUCT(&rbuf);
|
||||
|
||||
/*
|
||||
* Fill in remote proc information
|
||||
*/
|
||||
if (OMPI_SUCCESS != (ret = ompi_proc_get_info())) {
|
||||
opal_output(0,
|
||||
"pml:ob1: ft_event(Restart): Failed ompi_proc_get_info() = %d",
|
||||
ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
if (ORTE_SUCCESS != (ret = orte_grpcomm.xcast_gate(orte_gpr.deliver_notify_msg))) {
|
||||
opal_output(0,"pml:ob1: ft_event(Restart): Stage Gate 1 Failed %d",
|
||||
ret);
|
||||
/*
|
||||
* Startup the PML stack now that the modex is running again
|
||||
* Add the new procs
|
||||
*/
|
||||
if( OMPI_SUCCESS != (ret = mca_pml_ob1_add_procs(procs, num_procs) ) ) {
|
||||
opal_output(0, "pml:ob1: fr_event(Restart): Failed in add_procs (%d)", ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
/* Is this barrier necessary ? JJH */
|
||||
if (OMPI_SUCCESS != (ret = orte_grpcomm.barrier())) {
|
||||
opal_output(0, "pml:ob1: fr_event(Restart): Failed in orte_grpcomm.barrier (%d)", ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
|
118
ompi/proc/proc.c
118
ompi/proc/proc.c
@ -104,11 +104,9 @@ void ompi_proc_destruct(ompi_proc_t* proc)
|
||||
int ompi_proc_init(void)
|
||||
{
|
||||
orte_process_name_t *peers;
|
||||
orte_std_cntr_t i, npeers, datalen;
|
||||
void *data;
|
||||
orte_buffer_t* buf;
|
||||
uint32_t ui32;
|
||||
orte_std_cntr_t i, npeers;
|
||||
int rc;
|
||||
uint32_t ui32;
|
||||
|
||||
OBJ_CONSTRUCT(&ompi_proc_list, opal_list_t);
|
||||
OBJ_CONSTRUCT(&ompi_proc_lock, opal_mutex_t);
|
||||
@ -132,13 +130,43 @@ int ompi_proc_init(void)
|
||||
rc = ompi_arch_compute_local_id(&ui32);
|
||||
if (OMPI_SUCCESS != rc) return rc;
|
||||
|
||||
ompi_proc_local_proc->proc_nodeid = orte_system_info.nodeid;
|
||||
ompi_proc_local_proc->proc_arch = ui32;
|
||||
ompi_proc_local_proc->proc_hostname = strdup(orte_system_info.nodename);
|
||||
if (ompi_mpi_keep_peer_hostnames) {
|
||||
if (ompi_mpi_keep_fqdn_hostnames) {
|
||||
/* use the entire FQDN name */
|
||||
ompi_proc_local_proc->proc_hostname = strdup(orte_system_info.nodename);
|
||||
} else {
|
||||
/* use the unqualified name */
|
||||
char *tmp, *ptr;
|
||||
tmp = strdup(orte_system_info.nodename);
|
||||
if (NULL != (ptr = strchr(tmp, '.'))) {
|
||||
*ptr = '\0';
|
||||
}
|
||||
ompi_proc_local_proc->proc_hostname = strdup(tmp);
|
||||
free(tmp);
|
||||
}
|
||||
}
|
||||
|
||||
rc = ompi_proc_publish_info();
|
||||
|
||||
return rc;
|
||||
}
|
||||
|
||||
int ompi_proc_publish_info(void)
|
||||
{
|
||||
orte_std_cntr_t datalen;
|
||||
void *data;
|
||||
orte_buffer_t* buf;
|
||||
int rc;
|
||||
|
||||
/* pack our local data for others to use */
|
||||
buf = OBJ_NEW(orte_buffer_t);
|
||||
rc = ompi_proc_pack(&ompi_proc_local_proc, 1, buf);
|
||||
if (OMPI_SUCCESS != rc) return rc;
|
||||
if (OMPI_SUCCESS != rc) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
|
||||
/* send our data into the ether */
|
||||
rc = orte_dss.unload(buf, &data, &datalen);
|
||||
@ -169,6 +197,7 @@ ompi_proc_get_info(void)
|
||||
char *hostname;
|
||||
void *data;
|
||||
size_t datalen;
|
||||
orte_nodeid_t nodeid;
|
||||
|
||||
if (ORTE_EQUAL != orte_ns.compare_fields(ORTE_NS_CMP_JOBID,
|
||||
&ompi_proc_local_proc->proc_name,
|
||||
@ -189,7 +218,7 @@ ompi_proc_get_info(void)
|
||||
if (OMPI_SUCCESS != ret)
|
||||
goto out;
|
||||
|
||||
/* This isn't needed here, but packed just so that you
|
||||
/* This isn't needed here, but packed just so that you
|
||||
could, in theory, use the unpack code on this proc. We
|
||||
don't,because we aren't adding procs, but need to
|
||||
update them */
|
||||
@ -197,23 +226,34 @@ ompi_proc_get_info(void)
|
||||
if (ret != ORTE_SUCCESS)
|
||||
goto out;
|
||||
|
||||
ret = orte_dss.unpack(buf, &arch, &count, ORTE_UINT32);
|
||||
if (ret != ORTE_SUCCESS)
|
||||
goto out;
|
||||
ret = orte_dss.unpack(buf, &hostname, &count, ORTE_STRING);
|
||||
if (ret != ORTE_SUCCESS)
|
||||
ret = orte_dss.unpack(buf, &nodeid, &count, ORTE_NODEID);
|
||||
if (ret != ORTE_SUCCESS) {
|
||||
ORTE_ERROR_LOG(ret);
|
||||
goto out;
|
||||
}
|
||||
|
||||
ret = orte_dss.unpack(buf, &arch, &count, ORTE_UINT32);
|
||||
if (ret != ORTE_SUCCESS) {
|
||||
ORTE_ERROR_LOG(ret);
|
||||
goto out;
|
||||
}
|
||||
|
||||
ret = orte_dss.unpack(buf, &hostname, &count, ORTE_STRING);
|
||||
if (ret != ORTE_SUCCESS) {
|
||||
ORTE_ERROR_LOG(ret);
|
||||
goto out;
|
||||
}
|
||||
/* Free the buffer for the next proc */
|
||||
OBJ_RELEASE(buf);
|
||||
} else if (OMPI_ERR_NOT_IMPLEMENTED == ret) {
|
||||
arch = ompi_proc_local_proc->proc_arch;
|
||||
hostname = strdup("");
|
||||
ret = ORTE_SUCCESS;
|
||||
ret = ORTE_SUCCESS;
|
||||
} else {
|
||||
goto out;
|
||||
}
|
||||
|
||||
|
||||
proc->proc_nodeid = nodeid;
|
||||
proc->proc_arch = arch;
|
||||
/* if arch is different than mine, create a new convertor for this proc */
|
||||
if (proc->proc_arch != ompi_proc_local_proc->proc_arch) {
|
||||
@ -229,16 +269,14 @@ ompi_proc_get_info(void)
|
||||
ret = OMPI_ERR_NOT_SUPPORTED;
|
||||
goto out;
|
||||
#endif
|
||||
} else if (0 == strcmp(hostname, orte_system_info.nodename)) {
|
||||
}
|
||||
if (ompi_proc_local_proc->proc_nodeid == proc->proc_nodeid) {
|
||||
proc->proc_flags |= OMPI_PROC_FLAG_LOCAL;
|
||||
}
|
||||
|
||||
/* Save the hostname */
|
||||
if (ompi_mpi_keep_peer_hostnames) {
|
||||
/* the dss code will have strdup'ed this for us -- no need
|
||||
to do so again */
|
||||
proc->proc_hostname = hostname;
|
||||
}
|
||||
/* Save the hostname. The dss code will have strdup'ed this
|
||||
for us -- no need to do so again */
|
||||
proc->proc_hostname = hostname;
|
||||
}
|
||||
|
||||
out:
|
||||
@ -415,16 +453,25 @@ ompi_proc_pack(ompi_proc_t **proclist, int proclistsize, orte_buffer_t* buf)
|
||||
for (i=0; i<proclistsize; i++) {
|
||||
rc = orte_dss.pack(buf, &(proclist[i]->proc_name), 1, ORTE_NAME);
|
||||
if(rc != ORTE_SUCCESS) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
OPAL_THREAD_UNLOCK(&ompi_proc_lock);
|
||||
return rc;
|
||||
}
|
||||
rc = orte_dss.pack(buf, &(proclist[i]->proc_nodeid), 1, ORTE_NODEID);
|
||||
if(rc != ORTE_SUCCESS) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
OPAL_THREAD_UNLOCK(&ompi_proc_lock);
|
||||
return rc;
|
||||
}
|
||||
rc = orte_dss.pack(buf, &(proclist[i]->proc_arch), 1, ORTE_UINT32);
|
||||
if(rc != ORTE_SUCCESS) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
OPAL_THREAD_UNLOCK(&ompi_proc_lock);
|
||||
return rc;
|
||||
}
|
||||
rc = orte_dss.pack(buf, &(proclist[i]->proc_hostname), 1, ORTE_STRING);
|
||||
if(rc != ORTE_SUCCESS) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
OPAL_THREAD_UNLOCK(&ompi_proc_lock);
|
||||
return rc;
|
||||
}
|
||||
@ -435,7 +482,9 @@ ompi_proc_pack(ompi_proc_t **proclist, int proclistsize, orte_buffer_t* buf)
|
||||
|
||||
|
||||
int
|
||||
ompi_proc_unpack(orte_buffer_t* buf, int proclistsize, ompi_proc_t ***proclist)
|
||||
ompi_proc_unpack(orte_buffer_t* buf,
|
||||
int proclistsize, ompi_proc_t ***proclist,
|
||||
int *newproclistsize, ompi_proc_t ***newproclist)
|
||||
{
|
||||
int i;
|
||||
size_t newprocs_len = 0;
|
||||
@ -460,17 +509,26 @@ ompi_proc_unpack(orte_buffer_t* buf, int proclistsize, ompi_proc_t ***proclist)
|
||||
char *new_hostname;
|
||||
bool isnew = false;
|
||||
int rc;
|
||||
orte_nodeid_t new_nodeid;
|
||||
|
||||
rc = orte_dss.unpack(buf, &new_name, &count, ORTE_NAME);
|
||||
if (rc != ORTE_SUCCESS) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
rc = orte_dss.unpack(buf, &new_nodeid, &count, ORTE_NODEID);
|
||||
if (rc != ORTE_SUCCESS) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
rc = orte_dss.unpack(buf, &new_arch, &count, ORTE_UINT32);
|
||||
if (rc != ORTE_SUCCESS) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
rc = orte_dss.unpack(buf, &new_hostname, &count, ORTE_STRING);
|
||||
if (rc != ORTE_SUCCESS) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
|
||||
@ -478,6 +536,7 @@ ompi_proc_unpack(orte_buffer_t* buf, int proclistsize, ompi_proc_t ***proclist)
|
||||
if (isnew) {
|
||||
newprocs[newprocs_len++] = plist[i];
|
||||
|
||||
plist[i]->proc_nodeid = new_nodeid;
|
||||
plist[i]->proc_arch = new_arch;
|
||||
|
||||
/* if arch is different than mine, create a new convertor for this proc */
|
||||
@ -494,16 +553,21 @@ ompi_proc_unpack(orte_buffer_t* buf, int proclistsize, ompi_proc_t ***proclist)
|
||||
return OMPI_ERR_NOT_SUPPORTED;
|
||||
#endif
|
||||
}
|
||||
if (ompi_proc_local_proc->proc_nodeid == plist[i]->proc_nodeid) {
|
||||
plist[i]->proc_flags |= OMPI_PROC_FLAG_LOCAL;
|
||||
}
|
||||
|
||||
/* Save the hostname */
|
||||
if (ompi_mpi_keep_peer_hostnames) {
|
||||
plist[i]->proc_hostname = new_hostname;
|
||||
}
|
||||
plist[i]->proc_hostname = new_hostname;
|
||||
}
|
||||
}
|
||||
|
||||
if (newprocs_len > 0) MCA_PML_CALL(add_procs(newprocs, newprocs_len));
|
||||
if (newprocs != NULL) free(newprocs);
|
||||
if (NULL != newproclistsize) *newproclistsize = newprocs_len;
|
||||
if (NULL != newproclist) {
|
||||
*newproclist = newprocs;
|
||||
} else if (newprocs != NULL) {
|
||||
free(newprocs);
|
||||
}
|
||||
|
||||
*proclist = plist;
|
||||
return OMPI_SUCCESS;
|
||||
|
@ -54,6 +54,8 @@ struct ompi_proc_t {
|
||||
opal_list_item_t super;
|
||||
/** this process' name */
|
||||
orte_process_name_t proc_name;
|
||||
/** "nodeid" on which the proc resides */
|
||||
orte_nodeid_t proc_nodeid;
|
||||
/** PML specific proc data */
|
||||
struct mca_pml_base_endpoint_t* proc_pml;
|
||||
/** BML specific proc data */
|
||||
@ -119,6 +121,23 @@ OMPI_DECLSPEC extern ompi_proc_t* ompi_proc_local_proc;
|
||||
*/
|
||||
int ompi_proc_init(void);
|
||||
|
||||
/**
|
||||
* Publish local process information
|
||||
*
|
||||
* Used by ompi_proc_init() and elsewhere in the code to refresh any
|
||||
* local information not easily determined by the run-time ahead of time
|
||||
* (architecture and hostname).
|
||||
*
|
||||
* @note While an ompi_proc_t will exist with mostly valid information
|
||||
* for each process in the MPI_COMM_WORLD at the conclusion of this
|
||||
* call, some information will not be immediately available. This
|
||||
* includes the architecture and hostname, which will be available by
|
||||
* the conclusion of the stage gate.
|
||||
*
|
||||
* @retval OMPI_SUCESS Information available in the modex
|
||||
* @retval OMPI_ERRROR Failure due to unspecified error
|
||||
*/
|
||||
int ompi_proc_publish_info(void);
|
||||
|
||||
/**
|
||||
* Get data exchange information from remote processes
|
||||
@ -267,18 +286,37 @@ int ompi_proc_pack(ompi_proc_t **proclist, int proclistsize,
|
||||
* provided in the buffer. The lookup actions are always entirely
|
||||
* local. The proclist returned is a list of pointers to all procs in
|
||||
* the buffer, whether they were previously known or are new to this
|
||||
* process. PML_ADD_PROCS will be called on the list of new processes
|
||||
* discovered during this operation.
|
||||
* process.
|
||||
*
|
||||
* @note In previous versions of this function, The PML's add_procs()
|
||||
* function was called for any new processes discovered as a result of
|
||||
* this operation. That is no longer the case -- the caller must use
|
||||
* the newproclist information to call add_procs() if necessary.
|
||||
*
|
||||
* @note The reference count for procs created as a result of this
|
||||
* operation will be set to 1. Existing procs will not have their
|
||||
* reference count changed. The reference count of a proc at the
|
||||
* return of this function is the same regardless of whether NULL is
|
||||
* provided for newproclist. The user is responsible for freeing the
|
||||
* newproclist array.
|
||||
*
|
||||
* @param[in] buf orte_buffer containing the packed names
|
||||
* @param[in] proclistsize number of expected proc-pointres
|
||||
* @param[out] proclist list of process pointers
|
||||
* @param[out] newproclistsize Number of new procs added as a result
|
||||
* of the unpack operation. NULL may be
|
||||
* provided if information is not needed.
|
||||
* @param[out] newproclist List of new procs added as a result of
|
||||
* the unpack operation. NULL may be
|
||||
* provided if informationis not needed.
|
||||
*
|
||||
* Return value:
|
||||
* OMPI_SUCCESS on success
|
||||
* OMPI_ERROR else
|
||||
*/
|
||||
int ompi_proc_unpack(orte_buffer_t *buf, int proclistsize, ompi_proc_t ***proclist);
|
||||
int ompi_proc_unpack(orte_buffer_t *buf,
|
||||
int proclistsize, ompi_proc_t ***proclist,
|
||||
int *newproclistsize, ompi_proc_t ***newproclist);
|
||||
|
||||
|
||||
END_C_DECLS
|
||||
|
@ -48,6 +48,7 @@
|
||||
#include "orte/util/proc_info.h"
|
||||
#include "orte/mca/snapc/snapc.h"
|
||||
#include "orte/mca/snapc/base/base.h"
|
||||
#include "orte/mca/smr/smr.h"
|
||||
|
||||
#include "ompi/constants.h"
|
||||
#include "ompi/mca/pml/pml.h"
|
||||
@ -334,6 +335,12 @@ static int ompi_cr_coord_post_restart(void) {
|
||||
opal_output_verbose(10, ompi_cr_output,
|
||||
"ompi_cr: coord_post_restart: ompi_cr_coord_post_restart()");
|
||||
|
||||
/* register myself to require that I finalize before exiting */
|
||||
if (ORTE_SUCCESS != (ret = orte_smr.register_sync())) {
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/*
|
||||
* Notify PML
|
||||
* - Will notify BML and BTL's
|
||||
|
@ -152,7 +152,7 @@ ompi_modex_destruct(ompi_modex_proc_data_t * modex)
|
||||
}
|
||||
|
||||
OBJ_CLASS_INSTANCE(ompi_modex_proc_data_t, opal_object_t,
|
||||
ompi_modex_construct, ompi_modex_destruct);
|
||||
ompi_modex_construct, ompi_modex_destruct);
|
||||
|
||||
|
||||
|
||||
@ -196,15 +196,15 @@ ompi_modex_module_destruct(ompi_modex_module_data_t * module)
|
||||
{
|
||||
opal_list_item_t *item;
|
||||
while (NULL != (item = opal_list_remove_first(&module->module_cbs))) {
|
||||
OBJ_RELEASE(item);
|
||||
OBJ_RELEASE(item);
|
||||
}
|
||||
OBJ_DESTRUCT(&module->module_cbs);
|
||||
}
|
||||
|
||||
OBJ_CLASS_INSTANCE(ompi_modex_module_data_t,
|
||||
opal_list_item_t,
|
||||
ompi_modex_module_construct,
|
||||
ompi_modex_module_destruct);
|
||||
opal_list_item_t,
|
||||
ompi_modex_module_construct,
|
||||
ompi_modex_module_destruct);
|
||||
|
||||
/**
|
||||
* Callback data for modex updates
|
||||
@ -220,43 +220,12 @@ struct ompi_modex_cb_t {
|
||||
typedef struct ompi_modex_cb_t ompi_modex_cb_t;
|
||||
|
||||
OBJ_CLASS_INSTANCE(ompi_modex_cb_t,
|
||||
opal_list_item_t,
|
||||
NULL,
|
||||
NULL);
|
||||
opal_list_item_t,
|
||||
NULL,
|
||||
NULL);
|
||||
|
||||
|
||||
|
||||
/**
|
||||
* Container for segment subscription data
|
||||
*
|
||||
* Track segments we have subscribed to. Any jobid segment we are
|
||||
* subscribed to for updates will have one of these containers,
|
||||
* hopefully put on the ompi_modex_subscriptions list.
|
||||
*/
|
||||
struct ompi_modex_subscription_t {
|
||||
opal_list_item_t item;
|
||||
orte_jobid_t jobid;
|
||||
};
|
||||
typedef struct ompi_modex_subscription_t ompi_modex_subscription_t;
|
||||
|
||||
OBJ_CLASS_INSTANCE(ompi_modex_subscription_t,
|
||||
opal_list_item_t,
|
||||
NULL,
|
||||
NULL);
|
||||
|
||||
|
||||
/**
|
||||
* Global modex list for tracking subscriptions
|
||||
*
|
||||
* A list of ompi_modex_subscription_t structures, each representing a
|
||||
* jobid to which we have subscribed for modex updates.
|
||||
*
|
||||
* \note The ompi_modex_lock mutex should be held whenever this list
|
||||
* is being updated or searched.
|
||||
*/
|
||||
static opal_list_t ompi_modex_subscriptions;
|
||||
|
||||
|
||||
/**
|
||||
* Global modex list of proc data
|
||||
*
|
||||
@ -278,14 +247,24 @@ static opal_mutex_t ompi_modex_lock;
|
||||
|
||||
static opal_mutex_t ompi_modex_string_lock;
|
||||
|
||||
/*
|
||||
* Global buffer we use to collect modex info for later
|
||||
* transmission
|
||||
*/
|
||||
static orte_buffer_t ompi_modex_buffer;
|
||||
static orte_std_cntr_t ompi_modex_num_entries;
|
||||
|
||||
|
||||
int
|
||||
ompi_modex_init(void)
|
||||
{
|
||||
OBJ_CONSTRUCT(&ompi_modex_data, opal_hash_table_t);
|
||||
OBJ_CONSTRUCT(&ompi_modex_subscriptions, opal_list_t);
|
||||
OBJ_CONSTRUCT(&ompi_modex_lock, opal_mutex_t);
|
||||
OBJ_CONSTRUCT(&ompi_modex_string_lock, opal_mutex_t);
|
||||
|
||||
|
||||
OBJ_CONSTRUCT(&ompi_modex_buffer, orte_buffer_t);
|
||||
ompi_modex_num_entries = 0;
|
||||
|
||||
opal_hash_table_init(&ompi_modex_data, 256);
|
||||
|
||||
return OMPI_SUCCESS;
|
||||
@ -295,17 +274,12 @@ ompi_modex_init(void)
|
||||
int
|
||||
ompi_modex_finalize(void)
|
||||
{
|
||||
opal_list_item_t *item;
|
||||
|
||||
opal_hash_table_remove_all(&ompi_modex_data);
|
||||
OBJ_DESTRUCT(&ompi_modex_data);
|
||||
|
||||
while (NULL != (item = opal_list_remove_first(&ompi_modex_subscriptions)))
|
||||
OBJ_RELEASE(item);
|
||||
OBJ_DESTRUCT(&ompi_modex_subscriptions);
|
||||
|
||||
OBJ_DESTRUCT(&ompi_modex_string_lock);
|
||||
OBJ_DESTRUCT(&ompi_modex_lock);
|
||||
OBJ_DESTRUCT(&ompi_modex_buffer);
|
||||
|
||||
return OMPI_SUCCESS;
|
||||
}
|
||||
@ -326,11 +300,11 @@ ompi_modex_lookup_module(ompi_modex_proc_data_t *proc_data,
|
||||
{
|
||||
ompi_modex_module_data_t *module_data = NULL;
|
||||
for (module_data = (ompi_modex_module_data_t *) opal_list_get_first(&proc_data->modex_module_data);
|
||||
module_data != (ompi_modex_module_data_t *) opal_list_get_end(&proc_data->modex_module_data);
|
||||
module_data = (ompi_modex_module_data_t *) opal_list_get_next(module_data)) {
|
||||
if (mca_base_component_compatible(&module_data->component, component) == 0) {
|
||||
return module_data;
|
||||
}
|
||||
module_data != (ompi_modex_module_data_t *) opal_list_get_end(&proc_data->modex_module_data);
|
||||
module_data = (ompi_modex_module_data_t *) opal_list_get_next(module_data)) {
|
||||
if (mca_base_component_compatible(&module_data->component, component) == 0) {
|
||||
return module_data;
|
||||
}
|
||||
}
|
||||
|
||||
if (create_if_not_found) {
|
||||
@ -365,7 +339,7 @@ ompi_modex_lookup_orte_proc(orte_process_name_t *orte_proc)
|
||||
orte_hash_table_get_proc(&ompi_modex_data, orte_proc);
|
||||
if (NULL == proc_data) {
|
||||
/* The proc clearly exists, so create a modex structure
|
||||
for it and try to subscribe */
|
||||
for it */
|
||||
proc_data = OBJ_NEW(ompi_modex_proc_data_t);
|
||||
if (NULL == proc_data) {
|
||||
opal_output(0, "ompi_modex_lookup_orte_proc: unable to allocate ompi_modex_proc_data_t\n");
|
||||
@ -403,8 +377,6 @@ ompi_modex_lookup_proc(ompi_proc_t *proc)
|
||||
OBJ_RETAIN(proc_data);
|
||||
proc->proc_modex = &proc_data->super.super;
|
||||
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
|
||||
/* verify that we have subscribed to this segment */
|
||||
ompi_modex_subscribe_job(proc->proc_name.jobid);
|
||||
} else {
|
||||
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
|
||||
}
|
||||
@ -415,345 +387,233 @@ ompi_modex_lookup_proc(ompi_proc_t *proc)
|
||||
|
||||
|
||||
/**
|
||||
* Callback for registry notifications.
|
||||
* Get the local buffer's data
|
||||
*/
|
||||
static void
|
||||
ompi_modex_registry_callback(orte_gpr_notify_data_t * data,
|
||||
void *cbdata)
|
||||
int
|
||||
ompi_modex_get_my_buffer(orte_buffer_t *buf)
|
||||
{
|
||||
orte_std_cntr_t i, j, k;
|
||||
orte_gpr_value_t **values, *value;
|
||||
orte_gpr_keyval_t **keyval;
|
||||
orte_process_name_t *proc_name;
|
||||
int rc;
|
||||
|
||||
OPAL_THREAD_LOCK(&ompi_modex_lock);
|
||||
/* put our process name in the buffer so it can be unpacked later */
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.pack(buf, ORTE_PROC_MY_NAME, 1, ORTE_NAME))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
|
||||
return rc;
|
||||
}
|
||||
|
||||
/* put the number of entries into the buffer */
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.pack(buf, &ompi_modex_num_entries, 1, ORTE_STD_CNTR))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
|
||||
return rc;
|
||||
}
|
||||
|
||||
/* if there are entries, copy the data across */
|
||||
if (0 < ompi_modex_num_entries) {
|
||||
if (ORTE_SUCCESS != (orte_dss.copy_payload(buf, &ompi_modex_buffer))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
|
||||
return rc;
|
||||
}
|
||||
}
|
||||
|
||||
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
|
||||
/**
|
||||
* Process modex data
|
||||
*/
|
||||
int
|
||||
ompi_modex_process_data(orte_buffer_t *buf)
|
||||
{
|
||||
orte_std_cntr_t i, j, num_procs, num_entries;
|
||||
opal_list_item_t *item;
|
||||
void *bytes = NULL;
|
||||
orte_std_cntr_t cnt;
|
||||
orte_process_name_t proc_name;
|
||||
ompi_modex_proc_data_t *proc_data;
|
||||
ompi_modex_module_data_t *module_data;
|
||||
mca_base_component_t component;
|
||||
int rc;
|
||||
|
||||
/* process the callback */
|
||||
values = (orte_gpr_value_t **) (data->values)->addr;
|
||||
for (i = 0, k = 0; k < data->cnt &&
|
||||
i < (data->values)->size; i++) {
|
||||
if (NULL != values[i]) {
|
||||
k++;
|
||||
value = values[i];
|
||||
if (0 < value->cnt) { /* needs to be at least one keyval */
|
||||
/* Find the process name in the keyvals */
|
||||
keyval = value->keyvals;
|
||||
for (j = 0; j < value->cnt; j++) {
|
||||
if (0 != strcmp(keyval[j]->key, ORTE_PROC_NAME_KEY)) continue;
|
||||
/* this is the process name - extract it */
|
||||
if (ORTE_SUCCESS != orte_dss.get((void**)&proc_name, keyval[j]->value, ORTE_NAME)) {
|
||||
opal_output(0, "ompi_modex_registry_callback: unable to extract process name\n");
|
||||
return; /* nothing we can do */
|
||||
}
|
||||
goto GOTNAME;
|
||||
/* extract the number of entries in the buffer */
|
||||
cnt=1;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf, &num_procs, &cnt, ORTE_STD_CNTR))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
|
||||
/* process the buffer */
|
||||
for (i=0; i < num_procs; i++) {
|
||||
/* unpack the process name */
|
||||
cnt=1;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf, &proc_name, &cnt, ORTE_NAME))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
|
||||
/* look up the modex data structure */
|
||||
proc_data = ompi_modex_lookup_orte_proc(&proc_name);
|
||||
if (proc_data == NULL) {
|
||||
/* report the error */
|
||||
opal_output(0, "ompi_modex_process_data: received modex info for unknown proc %s\n",
|
||||
ORTE_NAME_PRINT(&proc_name));
|
||||
return OMPI_ERR_NOT_FOUND;
|
||||
}
|
||||
|
||||
/* unpack the number of entries for this proc */
|
||||
cnt=1;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf, &num_entries, &cnt, ORTE_STD_CNTR))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
|
||||
OPAL_THREAD_LOCK(&proc_data->modex_lock);
|
||||
|
||||
/*
|
||||
* Extract the component name and version - since there is one for each
|
||||
* component type/name/version - process them all
|
||||
*/
|
||||
for (j = 0; j < num_entries; j++) {
|
||||
size_t num_bytes;
|
||||
char *ptr;
|
||||
|
||||
cnt = 1;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf, &ptr, &cnt, ORTE_STRING))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
strcpy(component.mca_type_name, ptr);
|
||||
free(ptr);
|
||||
|
||||
cnt = 1;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf, &ptr, &cnt, ORTE_STRING))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
strcpy(component.mca_component_name, ptr);
|
||||
free(ptr);
|
||||
|
||||
cnt = 1;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf,
|
||||
&component.mca_component_major_version, &cnt, ORTE_INT32))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
cnt = 1;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf,
|
||||
&component.mca_component_minor_version, &cnt, ORTE_INT32))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
cnt = 1;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf, &num_bytes, &cnt, ORTE_SIZE))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
if (num_bytes != 0) {
|
||||
if (NULL == (bytes = malloc(num_bytes))) {
|
||||
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
|
||||
return ORTE_ERR_OUT_OF_RESOURCE;
|
||||
}
|
||||
opal_output(0, "ompi_modex_registry_callback: unable to find process name in notify message\n");
|
||||
return; /* if the name wasn't here, there is nothing we can do */
|
||||
GOTNAME:
|
||||
/* look up the modex data structure */
|
||||
proc_data = ompi_modex_lookup_orte_proc(proc_name);
|
||||
if (proc_data == NULL) continue;
|
||||
|
||||
OPAL_THREAD_LOCK(&proc_data->modex_lock);
|
||||
|
||||
/*
|
||||
* Extract the component name and version from the keyval object's key
|
||||
* Could be multiple keyvals returned since there is one for each
|
||||
* component type/name/version - process them all
|
||||
*/
|
||||
keyval = value->keyvals;
|
||||
for (j = 0; j < value->cnt; j++) {
|
||||
orte_buffer_t buffer;
|
||||
opal_list_item_t *item;
|
||||
char *ptr;
|
||||
void *bytes = NULL;
|
||||
orte_std_cntr_t cnt;
|
||||
size_t num_bytes;
|
||||
orte_byte_object_t *bo;
|
||||
|
||||
if (strcmp(keyval[j]->key, OMPI_MODEX_KEY) != 0)
|
||||
continue;
|
||||
|
||||
OBJ_CONSTRUCT(&buffer, orte_buffer_t);
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.get((void **) &bo, keyval[j]->value, ORTE_BYTE_OBJECT))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
continue;
|
||||
}
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.load(&buffer, bo->bytes, bo->size))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
continue;
|
||||
}
|
||||
cnt = 1;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&buffer, &ptr, &cnt, ORTE_STRING))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
continue;
|
||||
}
|
||||
strncpy(component.mca_type_name, ptr, MCA_BASE_MAX_COMPONENT_NAME_LEN);
|
||||
free(ptr);
|
||||
|
||||
cnt = 1;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&buffer, &ptr, &cnt, ORTE_STRING))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
continue;
|
||||
}
|
||||
strncpy(component.mca_component_name, ptr, MCA_BASE_MAX_COMPONENT_NAME_LEN);
|
||||
free(ptr);
|
||||
|
||||
cnt = 1;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&buffer,
|
||||
&component.mca_component_major_version, &cnt, ORTE_INT32))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
continue;
|
||||
}
|
||||
cnt = 1;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&buffer,
|
||||
&component.mca_component_minor_version, &cnt, ORTE_INT32))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
continue;
|
||||
}
|
||||
cnt = 1;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&buffer, &num_bytes, &cnt, ORTE_SIZE))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
continue;
|
||||
}
|
||||
if (num_bytes != 0) {
|
||||
if (NULL == (bytes = malloc(num_bytes))) {
|
||||
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
|
||||
continue;
|
||||
}
|
||||
cnt = (orte_std_cntr_t) num_bytes;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unpack(&buffer, bytes, &cnt, ORTE_BYTE))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
continue;
|
||||
}
|
||||
num_bytes = cnt;
|
||||
} else {
|
||||
bytes = NULL;
|
||||
}
|
||||
|
||||
/*
|
||||
* Lookup the corresponding modex structure
|
||||
*/
|
||||
if (NULL == (module_data = ompi_modex_lookup_module(proc_data,
|
||||
&component,
|
||||
true))) {
|
||||
opal_output(0, "ompi_modex_registry_callback: ompi_modex_lookup_module failed\n");
|
||||
OBJ_RELEASE(data);
|
||||
OPAL_THREAD_UNLOCK(&proc_data->modex_lock);
|
||||
return;
|
||||
}
|
||||
module_data->module_data = bytes;
|
||||
module_data->module_data_size = num_bytes;
|
||||
proc_data->modex_received_data = true;
|
||||
opal_condition_signal(&proc_data->modex_cond);
|
||||
|
||||
if (opal_list_get_size(&module_data->module_cbs)) {
|
||||
ompi_proc_t *proc = ompi_proc_find(proc_name);
|
||||
|
||||
if (NULL != proc) {
|
||||
OPAL_THREAD_LOCK(&proc->proc_lock);
|
||||
/* call any registered callbacks */
|
||||
for (item = opal_list_get_first(&module_data->module_cbs);
|
||||
item != opal_list_get_end(&module_data->module_cbs);
|
||||
item = opal_list_get_next(item)) {
|
||||
ompi_modex_cb_t *cb = (ompi_modex_cb_t *) item;
|
||||
cb->cbfunc(&module_data->component,
|
||||
proc, bytes, num_bytes, cb->cbdata);
|
||||
}
|
||||
OPAL_THREAD_UNLOCK(&proc->proc_lock);
|
||||
}
|
||||
}
|
||||
cnt = (orte_std_cntr_t) num_bytes;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unpack(buf, bytes, &cnt, ORTE_BYTE))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
num_bytes = cnt;
|
||||
} else {
|
||||
bytes = NULL;
|
||||
}
|
||||
|
||||
/*
|
||||
* Lookup the corresponding modex structure
|
||||
*/
|
||||
if (NULL == (module_data = ompi_modex_lookup_module(proc_data,
|
||||
&component,
|
||||
true))) {
|
||||
opal_output(0, "ompi_modex_process_data: ompi_modex_lookup_module failed\n");
|
||||
OPAL_THREAD_UNLOCK(&proc_data->modex_lock);
|
||||
} /* if value[i]->cnt > 0 */
|
||||
} /* if value[i] != NULL */
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
int
|
||||
ompi_modex_subscribe_job(orte_jobid_t jobid)
|
||||
{
|
||||
char *segment, *sub_name, *trig_name;
|
||||
orte_gpr_subscription_id_t sub_id;
|
||||
opal_list_item_t *item;
|
||||
ompi_modex_subscription_t *subscription;
|
||||
int rc;
|
||||
char *keys[] = {
|
||||
ORTE_PROC_NAME_KEY,
|
||||
OMPI_MODEX_KEY,
|
||||
NULL
|
||||
};
|
||||
|
||||
/* check for an existing subscription */
|
||||
OPAL_THREAD_LOCK(&ompi_modex_lock);
|
||||
for (item = opal_list_get_first(&ompi_modex_subscriptions) ;
|
||||
item != opal_list_get_end(&ompi_modex_subscriptions) ;
|
||||
item = opal_list_get_next(item)) {
|
||||
subscription = (ompi_modex_subscription_t *) item;
|
||||
if (subscription->jobid == jobid) {
|
||||
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
|
||||
return OMPI_SUCCESS;
|
||||
}
|
||||
}
|
||||
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
|
||||
|
||||
/* otherwise - subscribe to get this jobid's contact info */
|
||||
if (ORTE_SUCCESS != (rc = orte_schema.get_std_subscription_name(&sub_name,
|
||||
OMPI_MODEX_SUBSCRIPTION, jobid))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
/* attach to the stage-1 standard trigger */
|
||||
if (ORTE_SUCCESS != (rc = orte_schema.get_std_trigger_name(&trig_name,
|
||||
ORTE_STG1_TRIGGER, jobid))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
free(sub_name);
|
||||
return rc;
|
||||
}
|
||||
/* define the segment */
|
||||
if (ORTE_SUCCESS != (rc = orte_schema.get_job_segment_name(&segment, jobid))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
free(sub_name);
|
||||
free(trig_name);
|
||||
return rc;
|
||||
}
|
||||
if (jobid != orte_process_info.my_name->jobid) {
|
||||
if (ORTE_SUCCESS != (rc = orte_gpr.subscribe_N(&sub_id, NULL, NULL,
|
||||
ORTE_GPR_NOTIFY_ADD_ENTRY |
|
||||
ORTE_GPR_NOTIFY_VALUE_CHG |
|
||||
ORTE_GPR_NOTIFY_PRE_EXISTING,
|
||||
ORTE_GPR_KEYS_OR | ORTE_GPR_TOKENS_OR | ORTE_GPR_STRIPPED,
|
||||
segment,
|
||||
NULL, /* look at all
|
||||
* containers on this
|
||||
* segment */
|
||||
2, keys,
|
||||
ompi_modex_registry_callback, NULL))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
free(sub_name);
|
||||
free(trig_name);
|
||||
free(segment);
|
||||
return rc;
|
||||
}
|
||||
} else {
|
||||
if (ORTE_SUCCESS != (rc = orte_gpr.subscribe_N(&sub_id, trig_name, sub_name,
|
||||
ORTE_GPR_NOTIFY_ADD_ENTRY |
|
||||
ORTE_GPR_NOTIFY_VALUE_CHG |
|
||||
ORTE_GPR_NOTIFY_STARTS_AFTER_TRIG,
|
||||
ORTE_GPR_KEYS_OR | ORTE_GPR_TOKENS_OR | ORTE_GPR_STRIPPED,
|
||||
segment,
|
||||
NULL, /* look at all
|
||||
* containers on this
|
||||
* segment */
|
||||
2, keys,
|
||||
ompi_modex_registry_callback, NULL))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
free(sub_name);
|
||||
free(trig_name);
|
||||
free(segment);
|
||||
return rc;
|
||||
return OMPI_ERR_NOT_FOUND;
|
||||
}
|
||||
module_data->module_data = bytes;
|
||||
module_data->module_data_size = num_bytes;
|
||||
proc_data->modex_received_data = true;
|
||||
opal_condition_signal(&proc_data->modex_cond);
|
||||
|
||||
if (opal_list_get_size(&module_data->module_cbs)) {
|
||||
ompi_proc_t *proc = ompi_proc_find(&proc_name);
|
||||
|
||||
if (NULL != proc) {
|
||||
OPAL_THREAD_LOCK(&proc->proc_lock);
|
||||
/* call any registered callbacks */
|
||||
for (item = opal_list_get_first(&module_data->module_cbs);
|
||||
item != opal_list_get_end(&module_data->module_cbs);
|
||||
item = opal_list_get_next(item)) {
|
||||
ompi_modex_cb_t *cb = (ompi_modex_cb_t *) item;
|
||||
cb->cbfunc(&module_data->component,
|
||||
proc, bytes, num_bytes, cb->cbdata);
|
||||
}
|
||||
OPAL_THREAD_UNLOCK(&proc->proc_lock);
|
||||
}
|
||||
}
|
||||
}
|
||||
OPAL_THREAD_UNLOCK(&proc_data->modex_lock);
|
||||
}
|
||||
free(sub_name);
|
||||
free(trig_name);
|
||||
free(segment);
|
||||
|
||||
/* add this jobid to our list of subscriptions */
|
||||
OPAL_THREAD_LOCK(&ompi_modex_lock);
|
||||
subscription = OBJ_NEW(ompi_modex_subscription_t);
|
||||
subscription->jobid = jobid;
|
||||
opal_list_append(&ompi_modex_subscriptions, &subscription->item);
|
||||
OPAL_THREAD_UNLOCK(&ompi_modex_lock);
|
||||
|
||||
return OMPI_SUCCESS;
|
||||
}
|
||||
|
||||
|
||||
int
|
||||
ompi_modex_send(mca_base_component_t * source_component,
|
||||
const void *data,
|
||||
size_t size)
|
||||
const void *data,
|
||||
size_t size)
|
||||
{
|
||||
orte_jobid_t jobid;
|
||||
int rc;
|
||||
orte_buffer_t buffer;
|
||||
orte_std_cntr_t i, num_tokens;
|
||||
char *ptr, *segment, **tokens;
|
||||
orte_byte_object_t bo;
|
||||
orte_data_value_t value = ORTE_DATA_VALUE_EMPTY;
|
||||
char *ptr;
|
||||
|
||||
/* get location in GPR for the data */
|
||||
jobid = ORTE_PROC_MY_NAME->jobid;
|
||||
if (ORTE_SUCCESS != (rc = orte_schema.get_job_segment_name(&segment, jobid))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
return rc;
|
||||
}
|
||||
if (ORTE_SUCCESS != (rc = orte_schema.get_proc_tokens(&tokens,
|
||||
&num_tokens, orte_process_info.my_name))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
free(segment);
|
||||
return rc;
|
||||
}
|
||||
|
||||
OBJ_CONSTRUCT(&buffer, orte_buffer_t);
|
||||
|
||||
/* Pack the component name information into the buffer */
|
||||
OPAL_THREAD_LOCK(&ompi_modex_lock);
|
||||
|
||||
/* Pack the component name information into the local buffer */
|
||||
ptr = source_component->mca_type_name;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buffer, &ptr, 1, ORTE_STRING))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto cleanup;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.pack(&ompi_modex_buffer, &ptr, 1, ORTE_STRING))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto cleanup;
|
||||
}
|
||||
ptr = source_component->mca_component_name;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buffer, &ptr, 1, ORTE_STRING))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto cleanup;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.pack(&ompi_modex_buffer, &ptr, 1, ORTE_STRING))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto cleanup;
|
||||
}
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buffer, &source_component->mca_component_major_version, 1, ORTE_INT32))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto cleanup;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.pack(&ompi_modex_buffer, &source_component->mca_component_major_version, 1, ORTE_INT32))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto cleanup;
|
||||
}
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buffer, &source_component->mca_component_minor_version, 1, ORTE_INT32))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto cleanup;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.pack(&ompi_modex_buffer, &source_component->mca_component_minor_version, 1, ORTE_INT32))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto cleanup;
|
||||
}
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buffer, &size, 1, ORTE_SIZE))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto cleanup;
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.pack(&ompi_modex_buffer, &size, 1, ORTE_SIZE))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/* Pack the actual data into the buffer */
|
||||
if (0 != size) {
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.pack(&buffer, (void *) data, size, ORTE_BYTE))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto cleanup;
|
||||
}
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.pack(&ompi_modex_buffer, (void *) data, size, ORTE_BYTE))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto cleanup;
|
||||
}
|
||||
}
|
||||
if (ORTE_SUCCESS != (rc = orte_dss.unload(&buffer, (void **) &(bo.bytes), &(bo.size)))) {
|
||||
ORTE_ERROR_LOG(rc);
|
||||
goto cleanup;
|
||||
}
|
||||
OBJ_DESTRUCT(&buffer);
|
||||
|
||||