2007-12-21 06:02:00 +00:00
|
|
|
/* -*- Mode: C; c-basic-offset:4 ; -*- */
|
2004-08-05 16:31:30 +00:00
|
|
|
/*
|
2005-11-05 19:57:48 +00:00
|
|
|
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
2007-12-21 06:02:00 +00:00
|
|
|
* Copyright (c) 2004-2007 The University of Tennessee and The University
|
2005-11-05 19:57:48 +00:00
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
2006-02-07 03:32:36 +00:00
|
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
2004-11-28 20:09:25 +00:00
|
|
|
* University of Stuttgart. All rights reserved.
|
2005-03-24 12:43:37 +00:00
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
2007-08-04 00:41:26 +00:00
|
|
|
* Copyright (c) 2006-2007 University of Houston. All rights reserved.
|
2007-06-05 03:03:59 +00:00
|
|
|
* Copyright (c) 2006-2007 Los Alamos National Security, LLC. All rights
|
2006-11-22 02:06:52 +00:00
|
|
|
* reserved.
|
2007-08-04 00:41:26 +00:00
|
|
|
* Copyright (c) 2007 Cisco, Inc. All rights reserved.
|
2006-11-22 02:06:52 +00:00
|
|
|
*
|
2004-11-22 01:38:40 +00:00
|
|
|
* $COPYRIGHT$
|
2006-02-07 03:32:36 +00:00
|
|
|
*
|
2004-11-22 01:38:40 +00:00
|
|
|
* Additional copyrights may follow
|
2006-02-07 03:32:36 +00:00
|
|
|
*
|
2004-08-05 16:31:30 +00:00
|
|
|
* $HEADER$
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include "ompi_config.h"
|
|
|
|
#include <string.h>
|
|
|
|
#include <stdio.h>
|
2004-10-20 01:03:09 +00:00
|
|
|
#ifdef HAVE_SYS_UIO_H
|
2004-08-05 16:31:30 +00:00
|
|
|
#include <sys/uio.h>
|
2004-10-20 01:03:09 +00:00
|
|
|
#endif
|
2007-07-10 03:46:57 +00:00
|
|
|
#ifdef HAVE_NET_UIO_H
|
|
|
|
#include <net/uio.h>
|
|
|
|
#endif
|
2005-05-22 15:46:40 +00:00
|
|
|
#ifdef HAVE_UNISTD_H
|
2005-05-20 00:16:48 +00:00
|
|
|
#include <unistd.h>
|
2005-05-22 15:46:40 +00:00
|
|
|
#endif
|
2006-10-31 23:32:39 +00:00
|
|
|
#ifdef HAVE_SYS_TIME_H
|
|
|
|
#include <sys/time.h>
|
|
|
|
#endif /* HAVE_SYS_TIME_H */
|
2004-08-05 16:31:30 +00:00
|
|
|
|
2007-06-05 03:03:59 +00:00
|
|
|
#include "opal/util/opal_environ.h"
|
2006-02-08 17:40:11 +00:00
|
|
|
#include "opal/util/printf.h"
|
|
|
|
#include "opal/util/convert.h"
|
|
|
|
#include "opal/threads/mutex.h"
|
|
|
|
#include "opal/util/bit_ops.h"
|
|
|
|
#include "opal/util/argv.h"
|
|
|
|
|
2005-09-12 20:36:04 +00:00
|
|
|
#include "ompi/communicator/communicator.h"
|
|
|
|
#include "ompi/request/request.h"
|
2006-02-12 01:33:29 +00:00
|
|
|
#include "ompi/errhandler/errhandler.h"
|
|
|
|
#include "ompi/proc/proc.h"
|
|
|
|
#include "ompi/info/info.h"
|
|
|
|
#include "ompi/constants.h"
|
2006-02-08 17:40:11 +00:00
|
|
|
#include "ompi/mca/pml/pml.h"
|
These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
|
|
|
#include "ompi/runtime/ompi_module_exchange.h"
|
2004-08-05 16:31:30 +00:00
|
|
|
|
2006-02-08 17:40:11 +00:00
|
|
|
#include "orte/util/proc_info.h"
|
|
|
|
#include "orte/dss/dss.h"
|
|
|
|
#include "orte/mca/ns/ns.h"
|
|
|
|
#include "orte/mca/gpr/gpr.h"
|
|
|
|
#include "orte/mca/errmgr/errmgr.h"
|
2006-10-18 14:01:44 +00:00
|
|
|
#include "orte/mca/ras/ras_types.h"
|
2006-10-18 20:02:16 +00:00
|
|
|
#include "orte/mca/rmaps/rmaps_types.h"
|
2006-02-08 17:40:11 +00:00
|
|
|
#include "orte/mca/rmgr/rmgr.h"
|
2006-08-09 20:48:51 +00:00
|
|
|
#include "orte/mca/rmgr/base/base.h"
|
2006-08-16 16:35:09 +00:00
|
|
|
#include "orte/mca/smr/smr_types.h"
|
2006-02-08 17:40:11 +00:00
|
|
|
#include "orte/mca/rml/rml.h"
|
These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
|
|
|
#include "orte/mca/grpcomm/grpcomm.h"
|
2004-08-05 16:31:30 +00:00
|
|
|
|
2006-02-12 01:33:29 +00:00
|
|
|
#include "orte/runtime/runtime.h"
|
2004-09-29 12:41:55 +00:00
|
|
|
|
Clean up the way procs are added to the global process list after MPI_INIT:
* Do not add new procs to the global list during modex callback or
when sharing orte names during accept/connect. For modex, we
cache the modex info for later, in case that proc ever does get
added to the global proc list. For accept/connect orte name
exchange between the roots, we only need the orte name, so no
need to add a proc structure anyway. The procs will be added
to the global process list during the proc exchange later in
the wireup process
* Rename proc_get_namebuf and proc_get_proclist to proc_pack
and proc_unpack and extend them to include all information
needed to build that proc struct on a remote node (which
includes ORTE name, architecture, and hostname). Change
unpack to call pml_add_procs for the entire list of new
procs at once, rather than one at a time.
* Remove ompi_proc_find_and_add from the public proc
interface and make it a private function. This function
would add a half-created proc to the global proc list, so
making it harder to call is a good thing.
This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else. Currently, this is enough to implement MPI semantics. We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.
Refs trac:564
This commit was SVN r12798.
The following Trac tickets were found above:
Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
2006-12-07 19:56:54 +00:00
|
|
|
static int ompi_comm_get_rport (orte_process_name_t *port,
|
|
|
|
int send_first, struct ompi_proc_t *proc,
|
|
|
|
orte_rml_tag_t tag, orte_process_name_t *rport);
|
|
|
|
|
|
|
|
|
2004-08-05 16:31:30 +00:00
|
|
|
int ompi_comm_connect_accept ( ompi_communicator_t *comm, int root,
|
2005-03-14 20:57:21 +00:00
|
|
|
orte_process_name_t *port, int send_first,
|
|
|
|
ompi_communicator_t **newcomm, orte_rml_tag_t tag )
|
2004-08-05 16:31:30 +00:00
|
|
|
{
|
2005-04-12 21:59:13 +00:00
|
|
|
int size, rsize, rank, rc;
|
2006-08-15 19:54:10 +00:00
|
|
|
orte_std_cntr_t num_vals;
|
|
|
|
orte_std_cntr_t rnamebuflen = 0;
|
2005-09-28 23:50:42 +00:00
|
|
|
int rnamebuflen_int = 0;
|
2005-04-12 21:59:13 +00:00
|
|
|
void *rnamebuf=NULL;
|
2004-08-05 16:31:30 +00:00
|
|
|
|
|
|
|
ompi_communicator_t *newcomp=MPI_COMM_NULL;
|
|
|
|
ompi_proc_t **rprocs=NULL;
|
|
|
|
ompi_group_t *group=comm->c_local_group;
|
Clean up the way procs are added to the global process list after MPI_INIT:
* Do not add new procs to the global list during modex callback or
when sharing orte names during accept/connect. For modex, we
cache the modex info for later, in case that proc ever does get
added to the global proc list. For accept/connect orte name
exchange between the roots, we only need the orte name, so no
need to add a proc structure anyway. The procs will be added
to the global process list during the proc exchange later in
the wireup process
* Rename proc_get_namebuf and proc_get_proclist to proc_pack
and proc_unpack and extend them to include all information
needed to build that proc struct on a remote node (which
includes ORTE name, architecture, and hostname). Change
unpack to call pml_add_procs for the entire list of new
procs at once, rather than one at a time.
* Remove ompi_proc_find_and_add from the public proc
interface and make it a private function. This function
would add a half-created proc to the global proc list, so
making it harder to call is a good thing.
This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else. Currently, this is enough to implement MPI semantics. We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.
Refs trac:564
This commit was SVN r12798.
The following Trac tickets were found above:
Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
2006-12-07 19:56:54 +00:00
|
|
|
orte_process_name_t *rport=NULL, tmp_port_name;
|
2005-04-18 18:57:24 +00:00
|
|
|
orte_buffer_t *nbuf=NULL, *nrbuf=NULL;
|
These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
|
|
|
ompi_proc_t **proc_list=NULL, **new_proc_list;
|
|
|
|
int i,j, new_proc_len;
|
2007-08-04 00:41:26 +00:00
|
|
|
ompi_group_t *new_group_pointer;
|
2004-08-05 16:31:30 +00:00
|
|
|
|
|
|
|
size = ompi_comm_size ( comm );
|
|
|
|
rank = ompi_comm_rank ( comm );
|
|
|
|
|
2005-09-28 23:50:42 +00:00
|
|
|
/* tell the progress engine to tick the event library more
|
|
|
|
often, to make sure that the OOB messages get sent */
|
2006-11-22 02:06:52 +00:00
|
|
|
opal_progress_event_users_increment();
|
2005-09-28 23:50:42 +00:00
|
|
|
|
2004-08-05 16:31:30 +00:00
|
|
|
if ( rank == root ) {
|
2006-02-07 03:32:36 +00:00
|
|
|
/* The process receiving first does not have yet the contact
|
2004-08-05 16:31:30 +00:00
|
|
|
information of the remote process. Therefore, we have to
|
|
|
|
exchange that.
|
|
|
|
*/
|
2007-08-04 00:41:26 +00:00
|
|
|
|
|
|
|
if(!OMPI_GROUP_IS_DENSE(group)) {
|
|
|
|
proc_list = (ompi_proc_t **) calloc (group->grp_proc_count,
|
|
|
|
sizeof (ompi_proc_t *));
|
|
|
|
for(i=0 ; i<group->grp_proc_count ; i++)
|
|
|
|
proc_list[i] = ompi_group_peer_lookup(group,i);
|
|
|
|
}
|
|
|
|
|
2006-02-07 03:32:36 +00:00
|
|
|
if ( OMPI_COMM_JOIN_TAG != (int)tag ) {
|
2007-08-04 00:41:26 +00:00
|
|
|
if(OMPI_GROUP_IS_DENSE(group)){
|
|
|
|
rc = ompi_comm_get_rport(port,send_first,
|
|
|
|
group->grp_proc_pointers[rank], tag,
|
|
|
|
&tmp_port_name);
|
|
|
|
}
|
|
|
|
else {
|
|
|
|
rc = ompi_comm_get_rport(port,send_first,
|
|
|
|
proc_list[rank], tag,
|
|
|
|
&tmp_port_name);
|
|
|
|
}
|
2007-09-13 14:00:59 +00:00
|
|
|
if (OMPI_SUCCESS != rc) {
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
rport = &tmp_port_name;
|
2006-02-07 03:32:36 +00:00
|
|
|
} else {
|
|
|
|
rport = port;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Generate the message buffer containing the number of processes and the list of
|
|
|
|
participating processes */
|
|
|
|
nbuf = OBJ_NEW(orte_buffer_t);
|
|
|
|
if (NULL == nbuf) {
|
|
|
|
return OMPI_ERROR;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ORTE_SUCCESS != (rc = orte_dss.pack(nbuf, &size, 1, ORTE_INT))) {
|
2006-08-18 21:12:03 +00:00
|
|
|
ORTE_ERROR_LOG(rc);
|
2006-02-07 03:32:36 +00:00
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
|
2007-09-13 14:00:59 +00:00
|
|
|
if(OMPI_GROUP_IS_DENSE(group)) {
|
|
|
|
ompi_proc_pack(group->grp_proc_pointers, size, nbuf);
|
|
|
|
}
|
|
|
|
else {
|
|
|
|
ompi_proc_pack(proc_list, size, nbuf);
|
|
|
|
}
|
|
|
|
|
2006-02-07 03:32:36 +00:00
|
|
|
nrbuf = OBJ_NEW(orte_buffer_t);
|
|
|
|
if (NULL == nrbuf ) {
|
|
|
|
rc = OMPI_ERROR;
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Exchange the number and the list of processes in the groups */
|
|
|
|
if ( send_first ) {
|
|
|
|
rc = orte_rml.send_buffer(rport, nbuf, tag, 0);
|
2007-07-20 01:34:02 +00:00
|
|
|
rc = orte_rml.recv_buffer(rport, nrbuf, tag, 0);
|
2006-02-07 03:32:36 +00:00
|
|
|
} else {
|
2007-07-20 01:34:02 +00:00
|
|
|
rc = orte_rml.recv_buffer(rport, nrbuf, tag, 0);
|
2006-02-07 03:32:36 +00:00
|
|
|
rc = orte_rml.send_buffer(rport, nbuf, tag, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ORTE_SUCCESS != (rc = orte_dss.unload(nrbuf, &rnamebuf, &rnamebuflen))) {
|
2006-08-18 21:12:03 +00:00
|
|
|
ORTE_ERROR_LOG(rc);
|
2006-02-07 03:32:36 +00:00
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-09-28 23:50:42 +00:00
|
|
|
/* First convert the size_t to an int so we can cast in the bcast to a void *
|
2006-08-15 19:54:10 +00:00
|
|
|
* if we don't then we will get badness when using big vs little endian
|
|
|
|
* THIS IS NO LONGER REQUIRED AS THE LENGTH IS NOW A STD_CNTR_T, WHICH
|
|
|
|
* CORRELATES TO AN INT32
|
|
|
|
*/
|
2006-08-18 21:12:03 +00:00
|
|
|
rnamebuflen_int = (int)rnamebuflen;
|
2005-09-28 23:50:42 +00:00
|
|
|
|
2005-04-12 21:59:13 +00:00
|
|
|
/* bcast the buffer-length to all processes in the local comm */
|
2007-08-19 03:37:49 +00:00
|
|
|
rc = comm->c_coll.coll_bcast (&rnamebuflen_int, 1, MPI_INT, root, comm,
|
|
|
|
comm->c_coll.coll_bcast_module);
|
2004-08-05 16:31:30 +00:00
|
|
|
if ( OMPI_SUCCESS != rc ) {
|
|
|
|
goto exit;
|
|
|
|
}
|
2006-08-18 21:12:03 +00:00
|
|
|
rnamebuflen = rnamebuflen_int;
|
2004-08-05 16:31:30 +00:00
|
|
|
|
2005-04-12 21:59:13 +00:00
|
|
|
if ( rank != root ) {
|
2006-02-07 03:32:36 +00:00
|
|
|
/* non root processes need to allocate the buffer manually */
|
|
|
|
rnamebuf = (char *) malloc(rnamebuflen);
|
|
|
|
if ( NULL == rnamebuf ) {
|
|
|
|
rc = OMPI_ERR_OUT_OF_RESOURCE;
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* bcast list of processes to all procs in local group
|
2004-08-05 16:31:30 +00:00
|
|
|
and reconstruct the data. Note that proc_get_proclist
|
|
|
|
adds processes, which were not known yet to our
|
|
|
|
process pool.
|
|
|
|
*/
|
2007-08-19 03:37:49 +00:00
|
|
|
rc = comm->c_coll.coll_bcast (rnamebuf, rnamebuflen_int, MPI_BYTE, root, comm,
|
|
|
|
comm->c_coll.coll_bcast_module);
|
2004-08-05 16:31:30 +00:00
|
|
|
if ( OMPI_SUCCESS != rc ) {
|
2006-02-07 03:32:36 +00:00
|
|
|
goto exit;
|
2005-04-12 21:59:13 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
nrbuf = OBJ_NEW(orte_buffer_t);
|
|
|
|
if (NULL == nrbuf) {
|
2006-02-07 03:32:36 +00:00
|
|
|
goto exit;
|
2005-04-12 21:59:13 +00:00
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
if ( ORTE_SUCCESS != ( rc = orte_dss.load(nrbuf, rnamebuf, rnamebuflen))) {
|
2006-08-18 21:12:03 +00:00
|
|
|
ORTE_ERROR_LOG(rc);
|
2006-02-07 03:32:36 +00:00
|
|
|
goto exit;
|
2005-04-12 21:59:13 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
num_vals = 1;
|
2006-02-07 03:32:36 +00:00
|
|
|
if (ORTE_SUCCESS != (rc = orte_dss.unpack(nrbuf, &rsize, &num_vals, ORTE_INT))) {
|
2006-08-18 21:12:03 +00:00
|
|
|
ORTE_ERROR_LOG(rc);
|
2006-02-07 03:32:36 +00:00
|
|
|
goto exit;
|
2004-08-05 16:31:30 +00:00
|
|
|
}
|
2005-04-12 21:59:13 +00:00
|
|
|
|
These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
|
|
|
rc = ompi_proc_unpack(nrbuf, rsize, &rprocs, &new_proc_len, &new_proc_list);
|
2004-08-05 16:31:30 +00:00
|
|
|
if ( OMPI_SUCCESS != rc ) {
|
2006-02-07 03:32:36 +00:00
|
|
|
goto exit;
|
2004-08-05 16:31:30 +00:00
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
|
These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
|
|
|
/* If we added new procs, we need to do the modex and then call
|
|
|
|
PML add_procs */
|
|
|
|
if (new_proc_len > 0) {
|
|
|
|
opal_list_t all_procs;
|
|
|
|
orte_namelist_t *name;
|
|
|
|
orte_buffer_t mdx_buf, rbuf;
|
|
|
|
|
|
|
|
OBJ_CONSTRUCT(&all_procs, opal_list_t);
|
|
|
|
|
|
|
|
if (send_first) {
|
|
|
|
for (i = 0 ; i < group->grp_proc_count ; ++i) {
|
|
|
|
name = OBJ_NEW(orte_namelist_t);
|
|
|
|
name->name = &(ompi_group_peer_lookup(group, i)->proc_name);
|
|
|
|
opal_list_append(&all_procs, &name->item);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0 ; i < rsize ; ++i) {
|
|
|
|
name = OBJ_NEW(orte_namelist_t);
|
|
|
|
name->name = &(rprocs[i]->proc_name);
|
|
|
|
opal_list_append(&all_procs, &name->item);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
for (i = 0 ; i < rsize ; ++i) {
|
|
|
|
name = OBJ_NEW(orte_namelist_t);
|
|
|
|
name->name = &(rprocs[i]->proc_name);
|
|
|
|
opal_list_append(&all_procs, &name->item);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0 ; i < group->grp_proc_count ; ++i) {
|
|
|
|
name = OBJ_NEW(orte_namelist_t);
|
|
|
|
name->name = &(ompi_group_peer_lookup(group, i)->proc_name);
|
|
|
|
opal_list_append(&all_procs, &name->item);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
OBJ_CONSTRUCT(&mdx_buf, orte_buffer_t);
|
|
|
|
if (OMPI_SUCCESS != (rc = ompi_modex_get_my_buffer(&mdx_buf))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
OBJ_CONSTRUCT(&rbuf, orte_buffer_t);
|
|
|
|
if (OMPI_SUCCESS != (rc = orte_grpcomm.allgather_list(&all_procs,
|
|
|
|
&mdx_buf,
|
|
|
|
&rbuf))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
OBJ_DESTRUCT(&mdx_buf);
|
|
|
|
|
|
|
|
if (OMPI_SUCCESS != (rc = ompi_modex_process_data(&rbuf))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
OBJ_DESTRUCT(&rbuf);
|
|
|
|
|
|
|
|
/*
|
|
|
|
while (NULL != (item = opal_list_remove_first(&all_procs))) {
|
|
|
|
OBJ_RELEASE(item);
|
|
|
|
}
|
|
|
|
OBJ_DESTRUCT(&all_procs);
|
|
|
|
*/
|
|
|
|
|
|
|
|
MCA_PML_CALL(add_procs(new_proc_list, new_proc_len));
|
|
|
|
}
|
|
|
|
|
2005-03-14 20:57:21 +00:00
|
|
|
OBJ_RELEASE(nrbuf);
|
2004-09-16 10:07:42 +00:00
|
|
|
if ( rank == root ) {
|
2006-02-07 03:32:36 +00:00
|
|
|
OBJ_RELEASE(nbuf);
|
2004-09-16 10:07:42 +00:00
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
|
2007-08-04 00:41:26 +00:00
|
|
|
new_group_pointer=ompi_group_allocate(rsize);
|
|
|
|
if( NULL == new_group_pointer ) {
|
|
|
|
return MPI_ERR_GROUP;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* put group elements in the list */
|
|
|
|
for (j = 0; j < rsize; j++) {
|
|
|
|
new_group_pointer->grp_proc_pointers[j] = rprocs[j];
|
|
|
|
} /* end proc loop */
|
|
|
|
|
|
|
|
/* increment proc reference counters */
|
|
|
|
ompi_group_increment_proc_count(new_group_pointer);
|
|
|
|
|
|
|
|
/* set up communicator structure */
|
|
|
|
rc = ompi_comm_set ( &newcomp, /* new comm */
|
|
|
|
comm, /* old comm */
|
|
|
|
group->grp_proc_count, /* local_size */
|
|
|
|
NULL, /* local_procs */
|
|
|
|
rsize, /* remote_size */
|
|
|
|
NULL , /* remote_procs */
|
|
|
|
NULL, /* attrs */
|
|
|
|
comm->error_handler, /* error handler */
|
|
|
|
NULL, /* topo component */
|
2007-09-13 14:00:59 +00:00
|
|
|
group, /* local group */
|
|
|
|
new_group_pointer /* remote group */
|
2007-08-04 00:41:26 +00:00
|
|
|
);
|
2004-08-05 16:31:30 +00:00
|
|
|
if ( NULL == newcomp ) {
|
2006-02-07 03:32:36 +00:00
|
|
|
rc = OMPI_ERR_OUT_OF_RESOURCE;
|
|
|
|
goto exit;
|
2004-08-05 16:31:30 +00:00
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
|
2007-08-04 00:41:26 +00:00
|
|
|
ompi_group_decrement_proc_count (new_group_pointer);
|
|
|
|
OBJ_RELEASE(new_group_pointer);
|
|
|
|
new_group_pointer = MPI_GROUP_NULL;
|
|
|
|
|
2004-08-05 16:31:30 +00:00
|
|
|
/* allocate comm_cid */
|
|
|
|
rc = ompi_comm_nextcid ( newcomp, /* new communicator */
|
|
|
|
comm, /* old communicator */
|
|
|
|
NULL, /* bridge comm */
|
|
|
|
&root, /* local leader */
|
|
|
|
rport, /* remote leader */
|
|
|
|
OMPI_COMM_CID_INTRA_OOB, /* mode */
|
|
|
|
send_first ); /* send or recv first */
|
|
|
|
if ( OMPI_SUCCESS != rc ) {
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* activate comm and init coll-component */
|
|
|
|
rc = ompi_comm_activate ( newcomp, /* new communicator */
|
|
|
|
comm, /* old communicator */
|
|
|
|
NULL, /* bridge comm */
|
|
|
|
&root, /* local leader */
|
|
|
|
rport, /* remote leader */
|
|
|
|
OMPI_COMM_CID_INTRA_OOB, /* mode */
|
|
|
|
send_first, /* send or recv first */
|
2007-08-19 03:37:49 +00:00
|
|
|
0); /* sync_flag */
|
2004-08-05 16:31:30 +00:00
|
|
|
if ( OMPI_SUCCESS != rc ) {
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Question: do we have to re-start some low level stuff
|
|
|
|
to enable the usage of fast communication devices
|
2006-02-07 03:32:36 +00:00
|
|
|
between the two worlds ?
|
2004-08-05 16:31:30 +00:00
|
|
|
*/
|
2006-02-07 03:32:36 +00:00
|
|
|
|
|
|
|
|
2004-08-05 16:31:30 +00:00
|
|
|
exit:
|
2005-04-21 14:58:25 +00:00
|
|
|
/* done with OOB and such - slow our tick rate again */
|
2005-07-03 21:57:43 +00:00
|
|
|
opal_progress();
|
2006-11-22 02:06:52 +00:00
|
|
|
opal_progress_event_users_decrement();
|
2004-09-16 10:07:42 +00:00
|
|
|
|
2004-08-05 16:31:30 +00:00
|
|
|
if ( NULL != rprocs ) {
|
|
|
|
free ( rprocs );
|
|
|
|
}
|
2007-08-04 00:41:26 +00:00
|
|
|
if ( NULL != proc_list ) {
|
|
|
|
free ( proc_list );
|
|
|
|
}
|
2004-08-05 16:31:30 +00:00
|
|
|
if ( OMPI_SUCCESS != rc ) {
|
2007-08-25 12:18:55 +00:00
|
|
|
if ( MPI_COMM_NULL != newcomp && NULL != newcomp ) {
|
2004-08-05 16:31:30 +00:00
|
|
|
OBJ_RETAIN(newcomp);
|
|
|
|
newcomp = MPI_COMM_NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
*newcomm = newcomp;
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
2004-10-26 11:37:58 +00:00
|
|
|
/**********************************************************************/
|
|
|
|
/**********************************************************************/
|
|
|
|
/**********************************************************************/
|
2004-08-05 16:31:30 +00:00
|
|
|
/*
|
|
|
|
* This routine is necessary, since in the connect/accept case, the processes
|
|
|
|
* executing the connect operation have the OOB contact information of the
|
2006-02-07 03:32:36 +00:00
|
|
|
* leader of the remote group, however, the processes executing the
|
|
|
|
* accept get their own port_name = OOB contact information passed in as
|
2004-08-05 16:31:30 +00:00
|
|
|
* an argument. This is however useless.
|
2006-02-07 03:32:36 +00:00
|
|
|
*
|
Clean up the way procs are added to the global process list after MPI_INIT:
* Do not add new procs to the global list during modex callback or
when sharing orte names during accept/connect. For modex, we
cache the modex info for later, in case that proc ever does get
added to the global proc list. For accept/connect orte name
exchange between the roots, we only need the orte name, so no
need to add a proc structure anyway. The procs will be added
to the global process list during the proc exchange later in
the wireup process
* Rename proc_get_namebuf and proc_get_proclist to proc_pack
and proc_unpack and extend them to include all information
needed to build that proc struct on a remote node (which
includes ORTE name, architecture, and hostname). Change
unpack to call pml_add_procs for the entire list of new
procs at once, rather than one at a time.
* Remove ompi_proc_find_and_add from the public proc
interface and make it a private function. This function
would add a half-created proc to the global proc list, so
making it harder to call is a good thing.
This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else. Currently, this is enough to implement MPI semantics. We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.
Refs trac:564
This commit was SVN r12798.
The following Trac tickets were found above:
Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
2006-12-07 19:56:54 +00:00
|
|
|
* Therefore, the two root processes exchange this information at this
|
|
|
|
* point.
|
2004-08-05 16:31:30 +00:00
|
|
|
*
|
|
|
|
*/
|
Clean up the way procs are added to the global process list after MPI_INIT:
* Do not add new procs to the global list during modex callback or
when sharing orte names during accept/connect. For modex, we
cache the modex info for later, in case that proc ever does get
added to the global proc list. For accept/connect orte name
exchange between the roots, we only need the orte name, so no
need to add a proc structure anyway. The procs will be added
to the global process list during the proc exchange later in
the wireup process
* Rename proc_get_namebuf and proc_get_proclist to proc_pack
and proc_unpack and extend them to include all information
needed to build that proc struct on a remote node (which
includes ORTE name, architecture, and hostname). Change
unpack to call pml_add_procs for the entire list of new
procs at once, rather than one at a time.
* Remove ompi_proc_find_and_add from the public proc
interface and make it a private function. This function
would add a half-created proc to the global proc list, so
making it harder to call is a good thing.
This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else. Currently, this is enough to implement MPI semantics. We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.
Refs trac:564
This commit was SVN r12798.
The following Trac tickets were found above:
Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
2006-12-07 19:56:54 +00:00
|
|
|
int ompi_comm_get_rport(orte_process_name_t *port, int send_first,
|
|
|
|
ompi_proc_t *proc, orte_rml_tag_t tag,
|
|
|
|
orte_process_name_t *rport_name)
|
2004-08-05 16:31:30 +00:00
|
|
|
{
|
2004-09-16 10:07:42 +00:00
|
|
|
int rc;
|
2006-08-15 19:54:10 +00:00
|
|
|
orte_std_cntr_t num_vals;
|
2004-08-05 16:31:30 +00:00
|
|
|
|
|
|
|
if ( send_first ) {
|
2005-03-14 20:57:21 +00:00
|
|
|
orte_buffer_t *sbuf;
|
2004-08-05 16:31:30 +00:00
|
|
|
|
2005-03-14 20:57:21 +00:00
|
|
|
sbuf = OBJ_NEW(orte_buffer_t);
|
|
|
|
if (NULL == sbuf) {
|
Clean up the way procs are added to the global process list after MPI_INIT:
* Do not add new procs to the global list during modex callback or
when sharing orte names during accept/connect. For modex, we
cache the modex info for later, in case that proc ever does get
added to the global proc list. For accept/connect orte name
exchange between the roots, we only need the orte name, so no
need to add a proc structure anyway. The procs will be added
to the global process list during the proc exchange later in
the wireup process
* Rename proc_get_namebuf and proc_get_proclist to proc_pack
and proc_unpack and extend them to include all information
needed to build that proc struct on a remote node (which
includes ORTE name, architecture, and hostname). Change
unpack to call pml_add_procs for the entire list of new
procs at once, rather than one at a time.
* Remove ompi_proc_find_and_add from the public proc
interface and make it a private function. This function
would add a half-created proc to the global proc list, so
making it harder to call is a good thing.
This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else. Currently, this is enough to implement MPI semantics. We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.
Refs trac:564
This commit was SVN r12798.
The following Trac tickets were found above:
Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
2006-12-07 19:56:54 +00:00
|
|
|
return OMPI_ERROR;
|
2005-03-14 20:57:21 +00:00
|
|
|
}
|
2006-08-21 14:26:11 +00:00
|
|
|
if (ORTE_SUCCESS != (rc = orte_dss.pack(sbuf, &(proc->proc_name), 1, ORTE_NAME))) {
|
2006-08-18 21:12:03 +00:00
|
|
|
ORTE_ERROR_LOG(rc);
|
2006-08-21 14:26:11 +00:00
|
|
|
OBJ_RELEASE(sbuf);
|
Clean up the way procs are added to the global process list after MPI_INIT:
* Do not add new procs to the global list during modex callback or
when sharing orte names during accept/connect. For modex, we
cache the modex info for later, in case that proc ever does get
added to the global proc list. For accept/connect orte name
exchange between the roots, we only need the orte name, so no
need to add a proc structure anyway. The procs will be added
to the global process list during the proc exchange later in
the wireup process
* Rename proc_get_namebuf and proc_get_proclist to proc_pack
and proc_unpack and extend them to include all information
needed to build that proc struct on a remote node (which
includes ORTE name, architecture, and hostname). Change
unpack to call pml_add_procs for the entire list of new
procs at once, rather than one at a time.
* Remove ompi_proc_find_and_add from the public proc
interface and make it a private function. This function
would add a half-created proc to the global proc list, so
making it harder to call is a good thing.
This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else. Currently, this is enough to implement MPI semantics. We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.
Refs trac:564
This commit was SVN r12798.
The following Trac tickets were found above:
Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
2006-12-07 19:56:54 +00:00
|
|
|
return rc;
|
2006-08-21 14:26:11 +00:00
|
|
|
}
|
2004-08-12 22:41:42 +00:00
|
|
|
|
2006-09-29 20:28:45 +00:00
|
|
|
rc = orte_rml.send_buffer(port, sbuf, tag, 0);
|
|
|
|
OBJ_RELEASE(sbuf);
|
2007-04-06 19:18:31 +00:00
|
|
|
if ( 0 > rc ) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
Clean up the way procs are added to the global process list after MPI_INIT:
* Do not add new procs to the global list during modex callback or
when sharing orte names during accept/connect. For modex, we
cache the modex info for later, in case that proc ever does get
added to the global proc list. For accept/connect orte name
exchange between the roots, we only need the orte name, so no
need to add a proc structure anyway. The procs will be added
to the global process list during the proc exchange later in
the wireup process
* Rename proc_get_namebuf and proc_get_proclist to proc_pack
and proc_unpack and extend them to include all information
needed to build that proc struct on a remote node (which
includes ORTE name, architecture, and hostname). Change
unpack to call pml_add_procs for the entire list of new
procs at once, rather than one at a time.
* Remove ompi_proc_find_and_add from the public proc
interface and make it a private function. This function
would add a half-created proc to the global proc list, so
making it harder to call is a good thing.
This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else. Currently, this is enough to implement MPI semantics. We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.
Refs trac:564
This commit was SVN r12798.
The following Trac tickets were found above:
Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
2006-12-07 19:56:54 +00:00
|
|
|
*rport_name = *port;
|
2007-04-06 19:18:31 +00:00
|
|
|
} else {
|
2005-03-14 20:57:21 +00:00
|
|
|
orte_buffer_t *rbuf;
|
2004-08-05 16:31:30 +00:00
|
|
|
|
2005-03-14 20:57:21 +00:00
|
|
|
rbuf = OBJ_NEW(orte_buffer_t);
|
|
|
|
if (NULL == rbuf) {
|
Clean up the way procs are added to the global process list after MPI_INIT:
* Do not add new procs to the global list during modex callback or
when sharing orte names during accept/connect. For modex, we
cache the modex info for later, in case that proc ever does get
added to the global proc list. For accept/connect orte name
exchange between the roots, we only need the orte name, so no
need to add a proc structure anyway. The procs will be added
to the global process list during the proc exchange later in
the wireup process
* Rename proc_get_namebuf and proc_get_proclist to proc_pack
and proc_unpack and extend them to include all information
needed to build that proc struct on a remote node (which
includes ORTE name, architecture, and hostname). Change
unpack to call pml_add_procs for the entire list of new
procs at once, rather than one at a time.
* Remove ompi_proc_find_and_add from the public proc
interface and make it a private function. This function
would add a half-created proc to the global proc list, so
making it harder to call is a good thing.
This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else. Currently, this is enough to implement MPI semantics. We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.
Refs trac:564
This commit was SVN r12798.
The following Trac tickets were found above:
Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
2006-12-07 19:56:54 +00:00
|
|
|
return ORTE_ERROR;
|
2005-03-14 20:57:21 +00:00
|
|
|
}
|
2007-07-20 01:34:02 +00:00
|
|
|
if (ORTE_SUCCESS != (rc = orte_rml.recv_buffer(ORTE_NAME_WILDCARD, rbuf, tag, 0))) {
|
2006-08-21 14:26:11 +00:00
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
OBJ_RELEASE(rbuf);
|
Clean up the way procs are added to the global process list after MPI_INIT:
* Do not add new procs to the global list during modex callback or
when sharing orte names during accept/connect. For modex, we
cache the modex info for later, in case that proc ever does get
added to the global proc list. For accept/connect orte name
exchange between the roots, we only need the orte name, so no
need to add a proc structure anyway. The procs will be added
to the global process list during the proc exchange later in
the wireup process
* Rename proc_get_namebuf and proc_get_proclist to proc_pack
and proc_unpack and extend them to include all information
needed to build that proc struct on a remote node (which
includes ORTE name, architecture, and hostname). Change
unpack to call pml_add_procs for the entire list of new
procs at once, rather than one at a time.
* Remove ompi_proc_find_and_add from the public proc
interface and make it a private function. This function
would add a half-created proc to the global proc list, so
making it harder to call is a good thing.
This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else. Currently, this is enough to implement MPI semantics. We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.
Refs trac:564
This commit was SVN r12798.
The following Trac tickets were found above:
Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
2006-12-07 19:56:54 +00:00
|
|
|
return rc;
|
2006-08-21 14:26:11 +00:00
|
|
|
}
|
2006-09-29 20:28:45 +00:00
|
|
|
|
2005-03-14 20:57:21 +00:00
|
|
|
num_vals = 1;
|
Clean up the way procs are added to the global process list after MPI_INIT:
* Do not add new procs to the global list during modex callback or
when sharing orte names during accept/connect. For modex, we
cache the modex info for later, in case that proc ever does get
added to the global proc list. For accept/connect orte name
exchange between the roots, we only need the orte name, so no
need to add a proc structure anyway. The procs will be added
to the global process list during the proc exchange later in
the wireup process
* Rename proc_get_namebuf and proc_get_proclist to proc_pack
and proc_unpack and extend them to include all information
needed to build that proc struct on a remote node (which
includes ORTE name, architecture, and hostname). Change
unpack to call pml_add_procs for the entire list of new
procs at once, rather than one at a time.
* Remove ompi_proc_find_and_add from the public proc
interface and make it a private function. This function
would add a half-created proc to the global proc list, so
making it harder to call is a good thing.
This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else. Currently, this is enough to implement MPI semantics. We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.
Refs trac:564
This commit was SVN r12798.
The following Trac tickets were found above:
Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
2006-12-07 19:56:54 +00:00
|
|
|
if (ORTE_SUCCESS != (rc = orte_dss.unpack(rbuf, rport_name, &num_vals, ORTE_NAME))) {
|
2006-08-18 21:12:03 +00:00
|
|
|
ORTE_ERROR_LOG(rc);
|
2006-08-21 14:26:11 +00:00
|
|
|
OBJ_RELEASE(rbuf);
|
Clean up the way procs are added to the global process list after MPI_INIT:
* Do not add new procs to the global list during modex callback or
when sharing orte names during accept/connect. For modex, we
cache the modex info for later, in case that proc ever does get
added to the global proc list. For accept/connect orte name
exchange between the roots, we only need the orte name, so no
need to add a proc structure anyway. The procs will be added
to the global process list during the proc exchange later in
the wireup process
* Rename proc_get_namebuf and proc_get_proclist to proc_pack
and proc_unpack and extend them to include all information
needed to build that proc struct on a remote node (which
includes ORTE name, architecture, and hostname). Change
unpack to call pml_add_procs for the entire list of new
procs at once, rather than one at a time.
* Remove ompi_proc_find_and_add from the public proc
interface and make it a private function. This function
would add a half-created proc to the global proc list, so
making it harder to call is a good thing.
This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else. Currently, this is enough to implement MPI semantics. We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.
Refs trac:564
This commit was SVN r12798.
The following Trac tickets were found above:
Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
2006-12-07 19:56:54 +00:00
|
|
|
return rc;
|
2005-03-14 20:57:21 +00:00
|
|
|
}
|
|
|
|
OBJ_RELEASE(rbuf);
|
2004-12-02 13:28:10 +00:00
|
|
|
}
|
|
|
|
|
Clean up the way procs are added to the global process list after MPI_INIT:
* Do not add new procs to the global list during modex callback or
when sharing orte names during accept/connect. For modex, we
cache the modex info for later, in case that proc ever does get
added to the global proc list. For accept/connect orte name
exchange between the roots, we only need the orte name, so no
need to add a proc structure anyway. The procs will be added
to the global process list during the proc exchange later in
the wireup process
* Rename proc_get_namebuf and proc_get_proclist to proc_pack
and proc_unpack and extend them to include all information
needed to build that proc struct on a remote node (which
includes ORTE name, architecture, and hostname). Change
unpack to call pml_add_procs for the entire list of new
procs at once, rather than one at a time.
* Remove ompi_proc_find_and_add from the public proc
interface and make it a private function. This function
would add a half-created proc to the global proc list, so
making it harder to call is a good thing.
This means that there's only two ways to add new procs into the global proc list at this time: During MPI_INIT via the call to ompi_proc_init, where my job is added to the list and via ompi_proc_unpack using a buffer from a packed proc list sent to us by someone else. Currently, this is enough to implement MPI semantics. We can extend the interface more if we like, but that may require HNP communication to get the remote proc information and I wanted to avoid that if at all possible.
Refs trac:564
This commit was SVN r12798.
The following Trac tickets were found above:
Ticket 564 --> https://svn.open-mpi.org/trac/ompi/ticket/564
2006-12-07 19:56:54 +00:00
|
|
|
return OMPI_SUCCESS;
|
2004-08-05 16:31:30 +00:00
|
|
|
}
|
2004-09-29 12:41:55 +00:00
|
|
|
|
|
|
|
|
2004-10-26 11:37:58 +00:00
|
|
|
/**********************************************************************/
|
|
|
|
/**********************************************************************/
|
|
|
|
/**********************************************************************/
|
2004-12-16 15:42:02 +00:00
|
|
|
int
|
|
|
|
ompi_comm_start_processes(int count, char **array_of_commands,
|
2006-02-07 03:32:36 +00:00
|
|
|
char ***array_of_argv,
|
|
|
|
int *array_of_maxprocs,
|
|
|
|
MPI_Info *array_of_info,
|
2004-12-16 15:42:02 +00:00
|
|
|
char *port_name)
|
2004-09-29 12:41:55 +00:00
|
|
|
{
|
2006-02-07 03:32:36 +00:00
|
|
|
int rc, i, j, counter;
|
2004-12-23 13:04:58 +00:00
|
|
|
int have_wdir=0;
|
2006-08-09 20:48:51 +00:00
|
|
|
bool have_prefix;
|
2005-04-14 16:05:17 +00:00
|
|
|
int valuelen=OMPI_PATH_MAX, flag=0;
|
|
|
|
char cwd[OMPI_PATH_MAX];
|
2006-07-27 23:45:33 +00:00
|
|
|
char host[OMPI_PATH_MAX]; /*** should define OMPI_HOST_MAX ***/
|
2006-08-09 20:48:51 +00:00
|
|
|
char prefix[OMPI_PATH_MAX];
|
|
|
|
char *base_prefix;
|
2005-04-14 16:05:17 +00:00
|
|
|
|
2006-08-15 19:54:10 +00:00
|
|
|
orte_std_cntr_t num_apps, ai;
|
2006-10-10 23:59:48 +00:00
|
|
|
orte_jobid_t new_jobid=ORTE_JOBID_INVALID;
|
2005-04-14 16:05:17 +00:00
|
|
|
orte_app_context_t **apps=NULL;
|
2006-10-17 16:06:17 +00:00
|
|
|
|
|
|
|
opal_list_t attributes;
|
2006-10-18 20:02:16 +00:00
|
|
|
opal_list_item_t *item;
|
2005-04-14 16:05:17 +00:00
|
|
|
|
2006-10-31 23:32:39 +00:00
|
|
|
bool timing = false;
|
|
|
|
struct timeval ompistart, ompistop;
|
|
|
|
int param, value;
|
|
|
|
|
2004-09-29 12:41:55 +00:00
|
|
|
/* parse the info object */
|
2006-02-07 03:32:36 +00:00
|
|
|
/* check potentially for:
|
2004-09-29 12:41:55 +00:00
|
|
|
- "host": desired host where to spawn the processes
|
2006-08-09 20:48:51 +00:00
|
|
|
- "prefix": the path to the root of the directory tree where ompi
|
|
|
|
executables and libraries can be found
|
2004-09-29 12:41:55 +00:00
|
|
|
- "arch": desired architecture
|
|
|
|
- "wdir": directory, where executable can be found
|
|
|
|
- "path": list of directories where to look for the executable
|
|
|
|
- "file": filename, where additional information is provided.
|
|
|
|
- "soft": see page 92 of MPI-2.
|
|
|
|
*/
|
|
|
|
|
2005-04-21 14:58:25 +00:00
|
|
|
/* make sure the progress engine properly trips the event library */
|
2006-11-22 02:06:52 +00:00
|
|
|
opal_progress_event_users_increment();
|
2006-08-09 20:48:51 +00:00
|
|
|
|
2006-10-31 23:32:39 +00:00
|
|
|
/* check to see if we want timing information */
|
|
|
|
param = mca_base_param_reg_int_name("ompi", "timing",
|
|
|
|
"Request that critical timing loops be measured",
|
|
|
|
false, false, 0, &value);
|
|
|
|
if (value != 0) {
|
|
|
|
timing = true;
|
|
|
|
if (0 != gettimeofday(&ompistart, NULL)) {
|
|
|
|
opal_output(0, "ompi_comm_start_procs: could not obtain start time");
|
|
|
|
ompistart.tv_sec = 0;
|
|
|
|
ompistart.tv_usec = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2006-10-17 16:06:17 +00:00
|
|
|
/* setup to record the attributes */
|
|
|
|
OBJ_CONSTRUCT(&attributes, opal_list_t);
|
|
|
|
|
2006-08-09 20:48:51 +00:00
|
|
|
/* we want to be able to default the prefix to the one used for this job
|
|
|
|
* so that the ompi executables and libraries can be found. the user can
|
|
|
|
* later override this value by providing an MPI_Info value. for now, though,
|
|
|
|
* let's get the default value off the registry
|
|
|
|
*/
|
2007-04-06 19:18:31 +00:00
|
|
|
rc = orte_rmgr.get_app_context(orte_process_info.my_name->jobid, &apps, &num_apps);
|
|
|
|
if (ORTE_SUCCESS != rc) {
|
2006-08-09 20:48:51 +00:00
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
/* we'll just use the prefix from the first member of the app_context array.
|
|
|
|
* this shouldn't matter as they all should be the same. it could be NULL, of
|
|
|
|
* course (user might not have specified it), so we need to protect against that.
|
2006-10-10 23:59:48 +00:00
|
|
|
*
|
|
|
|
* It's possible that no app_contexts are returned (e.g., during a comm_spawn
|
|
|
|
* from a singleton), so check first
|
2006-08-09 20:48:51 +00:00
|
|
|
*/
|
2006-10-10 23:59:48 +00:00
|
|
|
if (NULL != apps && NULL != apps[0]->prefix_dir) {
|
2006-08-09 20:48:51 +00:00
|
|
|
base_prefix = strdup(apps[0]->prefix_dir);
|
|
|
|
} else {
|
|
|
|
base_prefix = NULL;
|
|
|
|
}
|
|
|
|
/* cleanup the memory we used */
|
2007-04-06 19:18:31 +00:00
|
|
|
if(NULL != apps) {
|
|
|
|
for (ai = 0; ai < num_apps; ai++) {
|
|
|
|
OBJ_RELEASE(apps[ai]);
|
|
|
|
}
|
|
|
|
free(apps);
|
2006-08-09 20:48:51 +00:00
|
|
|
}
|
2004-09-29 12:41:55 +00:00
|
|
|
|
2005-04-11 12:59:43 +00:00
|
|
|
/* Convert the list of commands to an array of orte_app_context_t
|
|
|
|
pointers */
|
|
|
|
apps = (orte_app_context_t**)malloc(count * sizeof(orte_app_context_t *));
|
|
|
|
if (NULL == apps) {
|
|
|
|
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
|
|
|
|
return ORTE_ERR_OUT_OF_RESOURCE;
|
2004-09-29 12:41:55 +00:00
|
|
|
}
|
2005-04-11 12:59:43 +00:00
|
|
|
for (i = 0; i < count; ++i) {
|
|
|
|
apps[i] = OBJ_NEW(orte_app_context_t);
|
|
|
|
if (NULL == apps[i]) {
|
|
|
|
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
|
|
|
|
/* rollback what was already done */
|
|
|
|
for (j=0; j < i; j++) OBJ_RELEASE(apps[j]);
|
2006-11-22 02:06:52 +00:00
|
|
|
opal_progress_event_users_decrement();
|
2005-04-11 12:59:43 +00:00
|
|
|
return ORTE_ERR_OUT_OF_RESOURCE;
|
2004-12-16 15:42:02 +00:00
|
|
|
}
|
2005-04-11 12:59:43 +00:00
|
|
|
/* copy over the name of the executable */
|
|
|
|
apps[i]->app = strdup(array_of_commands[i]);
|
|
|
|
if (NULL == apps[i]->app) {
|
|
|
|
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
|
|
|
|
/* rollback what was already done */
|
|
|
|
for (j=0; j < i; j++) OBJ_RELEASE(apps[j]);
|
2006-11-22 02:06:52 +00:00
|
|
|
opal_progress_event_users_decrement();
|
2005-04-11 12:59:43 +00:00
|
|
|
return ORTE_ERR_OUT_OF_RESOURCE;
|
2004-12-16 15:42:02 +00:00
|
|
|
}
|
2005-04-11 12:59:43 +00:00
|
|
|
/* record the number of procs to be generated */
|
|
|
|
apps[i]->num_procs = array_of_maxprocs[i];
|
2005-04-18 18:57:24 +00:00
|
|
|
|
2005-04-11 12:59:43 +00:00
|
|
|
/* copy over the argv array */
|
2006-07-28 15:47:16 +00:00
|
|
|
counter = 1;
|
2005-04-18 18:57:24 +00:00
|
|
|
|
2005-04-11 12:59:43 +00:00
|
|
|
if (MPI_ARGVS_NULL != array_of_argv &&
|
|
|
|
MPI_ARGV_NULL != array_of_argv[i]) {
|
|
|
|
/* first need to find out how many entries there are */
|
|
|
|
j=0;
|
|
|
|
while (NULL != array_of_argv[i][j]) {
|
|
|
|
j++;
|
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
counter += j;
|
|
|
|
}
|
2006-07-28 15:47:16 +00:00
|
|
|
|
|
|
|
/* now copy them over, ensuring to NULL terminate the array */
|
|
|
|
apps[i]->argv = (char**)malloc((1 + counter) * sizeof(char*));
|
|
|
|
if (NULL == apps[i]->argv) {
|
|
|
|
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
|
|
|
|
/* rollback what was already done */
|
|
|
|
for (j=0; j < i; j++) {
|
|
|
|
OBJ_RELEASE(apps[j]);
|
|
|
|
}
|
2006-11-22 02:06:52 +00:00
|
|
|
opal_progress_event_users_decrement();
|
2006-07-28 15:47:16 +00:00
|
|
|
return ORTE_ERR_OUT_OF_RESOURCE;
|
|
|
|
}
|
|
|
|
apps[i]->argv[0] = strdup(array_of_commands[i]);
|
|
|
|
for (j=1; j < counter; j++) {
|
|
|
|
apps[i]->argv[j] = strdup(array_of_argv[i][j-1]);
|
|
|
|
}
|
|
|
|
apps[i]->argv[counter] = NULL;
|
2006-02-07 03:32:36 +00:00
|
|
|
|
|
|
|
|
2005-04-11 12:59:43 +00:00
|
|
|
/* the environment gets set by the launcher
|
|
|
|
* all we need to do is add the specific values
|
|
|
|
* needed for comm_spawn
|
|
|
|
*/
|
2006-02-07 03:32:36 +00:00
|
|
|
/* Add environment variable with the contact information for the
|
|
|
|
child processes.
|
2006-07-28 15:47:16 +00:00
|
|
|
*/
|
2006-02-07 03:32:36 +00:00
|
|
|
counter = 1;
|
|
|
|
apps[i]->env = (char**)malloc((1+counter) * sizeof(char*));
|
2005-04-11 12:59:43 +00:00
|
|
|
if (NULL == apps[i]->env) {
|
|
|
|
ORTE_ERROR_LOG(ORTE_ERR_OUT_OF_RESOURCE);
|
|
|
|
/* rollback what was already done */
|
|
|
|
for (j=0; j < i; j++) OBJ_RELEASE(apps[j]);
|
2006-11-22 02:06:52 +00:00
|
|
|
opal_progress_event_users_decrement();
|
2005-04-11 12:59:43 +00:00
|
|
|
return ORTE_ERR_OUT_OF_RESOURCE;
|
|
|
|
}
|
2005-04-14 16:28:15 +00:00
|
|
|
asprintf(&(apps[i]->env[0]), "OMPI_PARENT_PORT=%s", port_name);
|
2005-04-11 12:59:43 +00:00
|
|
|
apps[i]->env[1] = NULL;
|
2005-11-16 22:20:33 +00:00
|
|
|
for (j = 0; NULL != environ[j]; ++j) {
|
|
|
|
if (0 == strncmp("OMPI_", environ[j], 5)) {
|
2006-02-07 03:32:36 +00:00
|
|
|
opal_argv_append_nosize(&apps[i]->env, environ[j]);
|
2005-11-16 22:20:33 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2006-07-27 23:45:33 +00:00
|
|
|
/* Check for well-known info keys */
|
2006-02-07 03:32:36 +00:00
|
|
|
have_wdir = 0;
|
2006-08-09 20:48:51 +00:00
|
|
|
have_prefix = false;
|
2005-04-11 12:59:43 +00:00
|
|
|
if ( array_of_info != NULL && array_of_info[i] != MPI_INFO_NULL ) {
|
2006-07-27 23:45:33 +00:00
|
|
|
|
|
|
|
/* check for 'wdir' */
|
2005-04-11 12:59:43 +00:00
|
|
|
ompi_info_get (array_of_info[i], "wdir", valuelen, cwd, &flag);
|
|
|
|
if ( flag ) {
|
2007-09-27 23:30:40 +00:00
|
|
|
apps[i]->cwd = strdup(cwd);
|
2005-04-11 12:59:43 +00:00
|
|
|
have_wdir = 1;
|
|
|
|
}
|
2006-07-27 23:45:33 +00:00
|
|
|
|
|
|
|
/* check for 'host' */
|
|
|
|
ompi_info_get (array_of_info[i], "host", sizeof(host), host, &flag);
|
|
|
|
if ( flag ) {
|
|
|
|
apps[i]->num_map = 1;
|
|
|
|
apps[i]->map_data = (orte_app_context_map_t **) malloc(sizeof(orte_app_context_map_t *));
|
|
|
|
apps[i]->map_data[0] = OBJ_NEW(orte_app_context_map_t);
|
|
|
|
apps[i]->map_data[0]->map_type = ORTE_APP_CONTEXT_MAP_HOSTNAME;
|
|
|
|
apps[i]->map_data[0]->map_data = strdup(host);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* 'path', 'arch', 'file', 'soft' -- to be implemented */
|
2006-08-09 20:48:51 +00:00
|
|
|
|
|
|
|
/* check for 'ompi_prefix' (OMPI-specific -- to effect the same
|
|
|
|
* behavior as --prefix option to orterun)
|
|
|
|
*/
|
|
|
|
ompi_info_get (array_of_info[i], "ompi_prefix", sizeof(prefix), prefix, &flag);
|
|
|
|
if ( flag ) {
|
|
|
|
apps[i]->prefix_dir = strdup(prefix);
|
|
|
|
have_prefix = true;
|
|
|
|
}
|
2005-04-11 12:59:43 +00:00
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
|
|
|
|
/* default value: If the user did not tell us where to look for the
|
2005-04-11 12:59:43 +00:00
|
|
|
executable, we assume the current working directory */
|
|
|
|
if ( !have_wdir ) {
|
|
|
|
getcwd(cwd, OMPI_PATH_MAX);
|
|
|
|
apps[i]->cwd = strdup(cwd);
|
2004-12-16 15:42:02 +00:00
|
|
|
}
|
2006-08-09 20:48:51 +00:00
|
|
|
|
|
|
|
/* if the user told us a new prefix, then we leave it alone. otherwise, if
|
|
|
|
* a prefix had been provided before, copy that one into the new app_context
|
|
|
|
* for use by the spawned children
|
|
|
|
*/
|
|
|
|
if ( !have_prefix && NULL != base_prefix) {
|
|
|
|
apps[i]->prefix_dir = strdup(base_prefix);
|
|
|
|
}
|
|
|
|
|
2005-04-11 12:59:43 +00:00
|
|
|
/* leave the map info alone - the launcher will
|
|
|
|
* decide where to put things
|
|
|
|
*/
|
2004-12-16 15:42:02 +00:00
|
|
|
} /* for (i = 0 ; i < count ; ++i) */
|
2004-09-29 12:41:55 +00:00
|
|
|
|
2006-08-09 20:48:51 +00:00
|
|
|
/* cleanup */
|
2007-04-06 19:18:31 +00:00
|
|
|
if (NULL != base_prefix) {
|
|
|
|
free(base_prefix);
|
|
|
|
}
|
2004-09-29 12:41:55 +00:00
|
|
|
|
Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that *did* start.
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.
Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.
Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.
With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.
Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".
This commit was SVN r14711.
2007-05-21 18:31:28 +00:00
|
|
|
/* tell the RTE that we want to be the new job to be a child of this process' job */
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
if (ORTE_SUCCESS != (rc = orte_rmgr.add_attribute(&attributes, ORTE_NS_USE_PARENT,
|
|
|
|
ORTE_JOBID, &(orte_process_info.my_name->jobid),
|
|
|
|
ORTE_RMGR_ATTR_OVERRIDE))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
OBJ_DESTRUCT(&attributes);
|
2006-11-22 02:06:52 +00:00
|
|
|
opal_progress_event_users_decrement();
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
return MPI_ERR_SPAWN;
|
|
|
|
}
|
|
|
|
|
2006-10-17 16:06:17 +00:00
|
|
|
/* tell the RTE that we want to the children to run inside of our allocation -
|
|
|
|
* don't go get one just for them
|
|
|
|
*/
|
2006-10-18 14:01:44 +00:00
|
|
|
if (ORTE_SUCCESS != (rc = orte_rmgr.add_attribute(&attributes, ORTE_RAS_USE_PARENT_ALLOCATION,
|
2006-10-18 20:02:16 +00:00
|
|
|
ORTE_JOBID, &(orte_process_info.my_name->jobid),
|
|
|
|
ORTE_RMGR_ATTR_OVERRIDE))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
OBJ_DESTRUCT(&attributes);
|
2006-11-22 02:06:52 +00:00
|
|
|
opal_progress_event_users_decrement();
|
2006-10-18 20:02:16 +00:00
|
|
|
return MPI_ERR_SPAWN;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* tell the RTE that we want the children mapped the same way as their parent */
|
|
|
|
if (ORTE_SUCCESS != (rc = orte_rmgr.add_attribute(&attributes, ORTE_RMAPS_USE_PARENT_PLAN,
|
|
|
|
ORTE_JOBID, &(orte_process_info.my_name->jobid),
|
|
|
|
ORTE_RMGR_ATTR_OVERRIDE))) {
|
2006-10-17 16:06:17 +00:00
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
OBJ_DESTRUCT(&attributes);
|
2006-11-22 02:06:52 +00:00
|
|
|
opal_progress_event_users_decrement();
|
2006-10-17 16:06:17 +00:00
|
|
|
return MPI_ERR_SPAWN;
|
|
|
|
}
|
|
|
|
|
These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
|
|
|
#if 0
|
2007-04-25 17:36:26 +00:00
|
|
|
/* tell the RTE that we want to be cross-connected to the children so we receive
|
Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that *did* start.
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.
Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.
Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.
With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.
Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".
This commit was SVN r14711.
2007-05-21 18:31:28 +00:00
|
|
|
* their ORTE-level information - e.g., OOB contact info - when they
|
|
|
|
* reach the STG1 stage gate
|
2007-04-25 17:36:26 +00:00
|
|
|
*/
|
Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that *did* start.
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.
Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.
Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.
With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.
Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".
This commit was SVN r14711.
2007-05-21 18:31:28 +00:00
|
|
|
state = ORTE_PROC_STATE_AT_STG1;
|
2007-04-25 17:36:26 +00:00
|
|
|
if (ORTE_SUCCESS != (rc = orte_rmgr.add_attribute(&attributes, ORTE_RMGR_XCONNECT_AT_SPAWN,
|
Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that *did* start.
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.
Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.
Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.
With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.
Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".
This commit was SVN r14711.
2007-05-21 18:31:28 +00:00
|
|
|
ORTE_PROC_STATE, &state,
|
2007-04-25 17:36:26 +00:00
|
|
|
ORTE_RMGR_ATTR_OVERRIDE))) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
OBJ_DESTRUCT(&attributes);
|
|
|
|
opal_progress_event_users_decrement();
|
|
|
|
return MPI_ERR_SPAWN;
|
|
|
|
}
|
These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
|
|
|
#endif
|
2007-04-25 17:36:26 +00:00
|
|
|
|
2006-10-31 23:32:39 +00:00
|
|
|
/* check for timing request - get stop time and report elapsed time if so */
|
|
|
|
if (timing) {
|
|
|
|
if (0 != gettimeofday(&ompistop, NULL)) {
|
|
|
|
opal_output(0, "ompi_comm_start_procs: could not obtain stop time");
|
|
|
|
} else {
|
|
|
|
opal_output(0, "ompi_comm_start_procs: time from start to prepare to spawn %ld usec",
|
|
|
|
(long int)((ompistop.tv_sec - ompistart.tv_sec)*1000000 +
|
|
|
|
(ompistop.tv_usec - ompistart.tv_usec)));
|
|
|
|
if (0 != gettimeofday(&ompistart, NULL)) {
|
|
|
|
opal_output(0, "ompi_comm_start_procs: could not obtain new start time");
|
|
|
|
ompistart.tv_sec = ompistop.tv_sec;
|
|
|
|
ompistart.tv_usec = ompistop.tv_usec;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-04-11 12:59:43 +00:00
|
|
|
/* spawn procs */
|
2007-04-06 19:18:31 +00:00
|
|
|
rc = orte_rmgr.spawn_job(apps, count, &new_jobid, 0, NULL, NULL,
|
|
|
|
ORTE_PROC_STATE_NONE, &attributes);
|
|
|
|
if (ORTE_SUCCESS != rc) {
|
2006-02-07 03:32:36 +00:00
|
|
|
ORTE_ERROR_LOG(rc);
|
2006-11-22 02:06:52 +00:00
|
|
|
opal_progress_event_users_decrement();
|
2005-04-21 14:58:25 +00:00
|
|
|
return MPI_ERR_SPAWN;
|
2004-09-29 12:41:55 +00:00
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
|
2006-10-31 23:32:39 +00:00
|
|
|
/* check for timing request - get stop time and report elapsed time if so */
|
|
|
|
if (timing) {
|
|
|
|
if (0 != gettimeofday(&ompistop, NULL)) {
|
|
|
|
opal_output(0, "ompi_comm_start_procs: could not obtain stop time");
|
|
|
|
} else {
|
|
|
|
opal_output(0, "ompi_comm_start_procs: time to spawn %ld usec",
|
|
|
|
(long int)((ompistop.tv_sec - ompistart.tv_sec)*1000000 +
|
|
|
|
(ompistop.tv_usec - ompistart.tv_usec)));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-04-14 16:05:17 +00:00
|
|
|
/* clean up */
|
2006-11-22 02:06:52 +00:00
|
|
|
opal_progress_event_users_decrement();
|
2007-04-06 19:18:31 +00:00
|
|
|
while (NULL != (item = opal_list_remove_first(&attributes))) {
|
|
|
|
OBJ_RELEASE(item);
|
|
|
|
}
|
2006-10-18 20:02:16 +00:00
|
|
|
OBJ_DESTRUCT(&attributes);
|
|
|
|
|
2005-04-14 16:05:17 +00:00
|
|
|
for ( i=0; i<count; i++) {
|
2007-04-06 19:18:31 +00:00
|
|
|
OBJ_RELEASE(apps[i]);
|
2004-12-16 15:42:02 +00:00
|
|
|
}
|
2005-04-14 16:05:17 +00:00
|
|
|
free (apps);
|
|
|
|
|
2004-09-29 12:41:55 +00:00
|
|
|
return OMPI_SUCCESS;
|
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
|
2004-10-26 11:37:58 +00:00
|
|
|
/**********************************************************************/
|
|
|
|
/**********************************************************************/
|
|
|
|
/**********************************************************************/
|
2004-09-29 12:41:55 +00:00
|
|
|
int ompi_comm_dyn_init (void)
|
|
|
|
{
|
2004-10-01 14:06:23 +00:00
|
|
|
char *envvarname=NULL, *port_name=NULL;
|
2004-09-29 12:41:55 +00:00
|
|
|
char *oob_port=NULL;
|
2005-03-14 20:57:21 +00:00
|
|
|
int root=0, send_first=1, rc;
|
|
|
|
orte_rml_tag_t tag;
|
2004-10-01 14:06:23 +00:00
|
|
|
ompi_communicator_t *newcomm=NULL;
|
2005-03-14 20:57:21 +00:00
|
|
|
orte_process_name_t *port_proc_name=NULL;
|
2004-11-05 12:58:14 +00:00
|
|
|
ompi_group_t *group = NULL;
|
|
|
|
ompi_errhandler_t *errhandler = NULL;
|
2004-09-29 12:41:55 +00:00
|
|
|
|
|
|
|
/* check for appropriate env variable */
|
2005-04-14 16:05:17 +00:00
|
|
|
asprintf(&envvarname, "OMPI_PARENT_PORT");
|
2004-09-29 12:41:55 +00:00
|
|
|
port_name = getenv(envvarname);
|
|
|
|
free (envvarname);
|
2006-02-07 03:32:36 +00:00
|
|
|
|
2004-09-29 12:41:55 +00:00
|
|
|
/* if env-variable is set, parse port and call comm_connect_accept */
|
|
|
|
if (NULL != port_name ) {
|
2007-04-06 19:18:31 +00:00
|
|
|
ompi_communicator_t *oldcomm;
|
2004-10-29 18:39:01 +00:00
|
|
|
|
2007-04-06 19:18:31 +00:00
|
|
|
/* split the content of the environment variable into
|
|
|
|
its pieces, which are : port_name and tag */
|
|
|
|
oob_port = ompi_parse_port (port_name, &tag);
|
|
|
|
rc = orte_ns.convert_string_to_process_name(&port_proc_name, oob_port);
|
|
|
|
if (ORTE_SUCCESS != rc) {
|
|
|
|
ORTE_ERROR_LOG(rc);
|
|
|
|
return rc;
|
|
|
|
}
|
2005-03-14 20:57:21 +00:00
|
|
|
|
2007-04-06 19:18:31 +00:00
|
|
|
rc = ompi_comm_connect_accept (MPI_COMM_WORLD, root, port_proc_name,
|
|
|
|
send_first, &newcomm, tag );
|
|
|
|
if (ORTE_SUCCESS != rc) {
|
|
|
|
return rc;
|
|
|
|
}
|
2005-09-28 06:13:51 +00:00
|
|
|
|
2007-04-06 19:18:31 +00:00
|
|
|
/* Set the parent communicator */
|
|
|
|
ompi_mpi_comm_parent = newcomm;
|
2004-09-29 12:41:55 +00:00
|
|
|
|
2007-04-06 19:18:31 +00:00
|
|
|
/* originally, we set comm_parent to comm_null (in comm_init),
|
|
|
|
* now we have to decrease the reference counters to the according
|
|
|
|
* objects
|
|
|
|
*/
|
2004-11-05 12:58:14 +00:00
|
|
|
|
2007-04-06 19:18:31 +00:00
|
|
|
oldcomm = &ompi_mpi_comm_null;
|
|
|
|
OBJ_RELEASE(oldcomm);
|
2004-11-05 12:58:14 +00:00
|
|
|
group = &ompi_mpi_group_null;
|
2007-04-06 19:18:31 +00:00
|
|
|
OBJ_RELEASE(group);
|
|
|
|
errhandler = &ompi_mpi_errors_are_fatal;
|
|
|
|
OBJ_RELEASE(errhandler);
|
2004-10-26 17:25:49 +00:00
|
|
|
|
2007-04-06 19:18:31 +00:00
|
|
|
/* Set name for debugging purposes */
|
|
|
|
snprintf(newcomm->c_name, MPI_MAX_OBJECT_NAME, "MPI_COMM_PARENT");
|
2004-09-29 12:41:55 +00:00
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
|
2004-09-29 12:41:55 +00:00
|
|
|
return OMPI_SUCCESS;
|
|
|
|
}
|
2004-10-26 11:37:58 +00:00
|
|
|
|
|
|
|
/**********************************************************************/
|
|
|
|
/**********************************************************************/
|
|
|
|
/**********************************************************************/
|
|
|
|
/* this routine runs through the list of communicators and
|
|
|
|
and does the disconnect for all dynamic communicators */
|
2004-10-26 14:54:23 +00:00
|
|
|
int ompi_comm_dyn_finalize (void)
|
2004-10-26 11:37:58 +00:00
|
|
|
{
|
2004-10-26 14:54:23 +00:00
|
|
|
int i,j=0, max=0;
|
|
|
|
ompi_comm_disconnect_obj **objs=NULL;
|
|
|
|
ompi_communicator_t *comm=NULL;
|
|
|
|
|
|
|
|
if ( 1 <ompi_comm_num_dyncomm ) {
|
2007-04-06 19:18:31 +00:00
|
|
|
objs = (ompi_comm_disconnect_obj **)malloc (ompi_comm_num_dyncomm*
|
|
|
|
sizeof(ompi_comm_disconnect_obj*));
|
|
|
|
if ( NULL == objs ) {
|
|
|
|
return OMPI_ERR_OUT_OF_RESOURCE;
|
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
|
2007-12-21 06:02:00 +00:00
|
|
|
max = opal_pointer_array_get_size(&ompi_mpi_communicators);
|
2007-04-06 19:18:31 +00:00
|
|
|
for ( i=3; i<max; i++ ) {
|
2007-12-21 06:02:00 +00:00
|
|
|
comm = (ompi_communicator_t*)opal_pointer_array_get_item(&ompi_mpi_communicators,i);
|
2007-04-06 19:18:31 +00:00
|
|
|
if ( OMPI_COMM_IS_DYNAMIC(comm)) {
|
|
|
|
objs[j++]=ompi_comm_disconnect_init(comm);
|
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
}
|
|
|
|
|
2007-04-06 19:18:31 +00:00
|
|
|
if ( j != ompi_comm_num_dyncomm+1 ) {
|
|
|
|
free (objs);
|
|
|
|
return OMPI_ERROR;
|
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
|
2007-04-06 19:18:31 +00:00
|
|
|
ompi_comm_disconnect_waitall (ompi_comm_num_dyncomm, objs);
|
|
|
|
free (objs);
|
2006-02-07 03:32:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2004-10-26 14:54:23 +00:00
|
|
|
return OMPI_SUCCESS;
|
2004-10-26 11:37:58 +00:00
|
|
|
}
|
2004-10-26 14:54:23 +00:00
|
|
|
|
2004-10-26 11:37:58 +00:00
|
|
|
/**********************************************************************/
|
|
|
|
/**********************************************************************/
|
|
|
|
/**********************************************************************/
|
|
|
|
|
|
|
|
ompi_comm_disconnect_obj *ompi_comm_disconnect_init ( ompi_communicator_t *comm)
|
|
|
|
{
|
|
|
|
ompi_comm_disconnect_obj *obj=NULL;
|
|
|
|
int ret;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
obj = (ompi_comm_disconnect_obj *) calloc(1,sizeof(ompi_comm_disconnect_obj));
|
2007-04-06 19:18:31 +00:00
|
|
|
if ( NULL == obj ) {
|
|
|
|
return NULL;
|
2004-10-26 11:37:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if ( OMPI_COMM_IS_INTER(comm) ) {
|
2007-04-06 19:18:31 +00:00
|
|
|
obj->size = ompi_comm_remote_size (comm);
|
|
|
|
} else {
|
|
|
|
obj->size = ompi_comm_size (comm);
|
2004-10-26 11:37:58 +00:00
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
|
2004-10-26 11:37:58 +00:00
|
|
|
obj->comm = comm;
|
|
|
|
obj->reqs = (ompi_request_t **) malloc(2*obj->size*sizeof(ompi_request_t *));
|
|
|
|
if ( NULL == obj->reqs ) {
|
2007-04-06 19:18:31 +00:00
|
|
|
free (obj);
|
|
|
|
return NULL;
|
2004-10-26 11:37:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* initiate all isend_irecvs. We use a dummy buffer stored on
|
|
|
|
the object, since we are sending zero size messages anyway. */
|
|
|
|
for ( i=0; i < obj->size; i++ ) {
|
2007-04-06 19:18:31 +00:00
|
|
|
ret = MCA_PML_CALL(irecv (&(obj->buf), 0, MPI_INT, i,
|
|
|
|
OMPI_COMM_BARRIER_TAG, comm,
|
|
|
|
&(obj->reqs[2*i])));
|
|
|
|
|
|
|
|
if ( OMPI_SUCCESS != ret ) {
|
|
|
|
free (obj->reqs);
|
|
|
|
free (obj);
|
|
|
|
return NULL;
|
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
|
2007-04-06 19:18:31 +00:00
|
|
|
ret = MCA_PML_CALL(isend (&(obj->buf), 0, MPI_INT, i,
|
|
|
|
OMPI_COMM_BARRIER_TAG,
|
|
|
|
MCA_PML_BASE_SEND_SYNCHRONOUS,
|
|
|
|
comm, &(obj->reqs[2*i+1])));
|
2006-02-07 03:32:36 +00:00
|
|
|
|
2007-04-06 19:18:31 +00:00
|
|
|
if ( OMPI_SUCCESS != ret ) {
|
|
|
|
free (obj->reqs);
|
|
|
|
free (obj);
|
|
|
|
return NULL;
|
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
}
|
|
|
|
|
2004-10-26 11:37:58 +00:00
|
|
|
/* return handle */
|
|
|
|
return obj;
|
|
|
|
}
|
|
|
|
/**********************************************************************/
|
|
|
|
/**********************************************************************/
|
|
|
|
/**********************************************************************/
|
|
|
|
/* - count how many requests are active
|
|
|
|
* - generate a request array large enough to hold
|
|
|
|
all active requests
|
|
|
|
* - call waitall on the overall request array
|
|
|
|
* - free the objects
|
|
|
|
*/
|
|
|
|
void ompi_comm_disconnect_waitall (int count, ompi_comm_disconnect_obj **objs)
|
|
|
|
{
|
2006-02-07 03:32:36 +00:00
|
|
|
|
2004-10-26 11:37:58 +00:00
|
|
|
ompi_request_t **reqs=NULL;
|
|
|
|
char *treq=NULL;
|
|
|
|
int totalcount = 0;
|
|
|
|
int i;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
for (i=0; i<count; i++) {
|
2007-04-06 19:18:31 +00:00
|
|
|
if (NULL == objs[i]) {
|
|
|
|
printf("Error in comm_disconnect_waitall\n");
|
|
|
|
return;
|
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
|
2007-04-06 19:18:31 +00:00
|
|
|
totalcount += objs[i]->size;
|
2006-02-07 03:32:36 +00:00
|
|
|
}
|
|
|
|
|
2004-10-26 11:37:58 +00:00
|
|
|
reqs = (ompi_request_t **) malloc (2*totalcount*sizeof(ompi_request_t *));
|
|
|
|
if ( NULL == reqs ) {
|
2007-04-06 19:18:31 +00:00
|
|
|
printf("ompi_comm_disconnect_waitall: error allocating memory\n");
|
|
|
|
return;
|
2004-10-26 11:37:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* generate a single, large array of pending requests */
|
|
|
|
treq = (char *)reqs;
|
|
|
|
for (i=0; i<count; i++) {
|
2007-04-06 19:18:31 +00:00
|
|
|
memcpy (treq, objs[i]->reqs, 2*objs[i]->size * sizeof(ompi_request_t *));
|
|
|
|
treq += 2*objs[i]->size * sizeof(ompi_request_t *);
|
2004-10-26 11:37:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* force all non-blocking all-to-alls to finish */
|
|
|
|
ret = ompi_request_wait_all (2*totalcount, reqs, MPI_STATUSES_IGNORE);
|
|
|
|
|
|
|
|
/* Finally, free everything */
|
|
|
|
for (i=0; i< count; i++ ) {
|
2007-04-06 19:18:31 +00:00
|
|
|
if (NULL != objs[i]->reqs ) {
|
|
|
|
free (objs[i]->reqs );
|
|
|
|
free (objs[i]);
|
|
|
|
}
|
2006-02-07 03:32:36 +00:00
|
|
|
}
|
|
|
|
|
2004-10-26 11:37:58 +00:00
|
|
|
free (reqs);
|
|
|
|
|
|
|
|
/* decrease the counter for dynamic communicators by 'count'.
|
|
|
|
Attention, this approach now requires, that we are just using
|
|
|
|
these routines for communicators which have been flagged dynamic */
|
|
|
|
ompi_comm_num_dyncomm -=count;
|
|
|
|
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**********************************************************************/
|
|
|
|
/**********************************************************************/
|
|
|
|
/**********************************************************************/
|
|
|
|
#define OMPI_COMM_MAXJOBIDS 64
|
|
|
|
void ompi_comm_mark_dyncomm (ompi_communicator_t *comm)
|
|
|
|
{
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
int i, j, numjobids=0;
|
2004-10-26 11:37:58 +00:00
|
|
|
int size, rsize;
|
|
|
|
int found;
|
2005-03-14 20:57:21 +00:00
|
|
|
orte_jobid_t jobids[OMPI_COMM_MAXJOBIDS], thisjobid;
|
2004-10-26 11:37:58 +00:00
|
|
|
ompi_group_t *grp=NULL;
|
2007-08-04 00:41:26 +00:00
|
|
|
ompi_proc_t *proc = NULL;
|
2004-10-26 11:37:58 +00:00
|
|
|
|
|
|
|
/* special case for MPI_COMM_NULL */
|
|
|
|
if ( comm == MPI_COMM_NULL ) {
|
2007-04-06 19:18:31 +00:00
|
|
|
return;
|
2004-10-26 11:37:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
size = ompi_comm_size (comm);
|
|
|
|
rsize = ompi_comm_remote_size(comm);
|
|
|
|
|
|
|
|
/* loop over all processes in local group and count number
|
|
|
|
of different jobids. */
|
|
|
|
grp = comm->c_local_group;
|
|
|
|
for (i=0; i< size; i++) {
|
2007-09-13 14:00:59 +00:00
|
|
|
proc = ompi_group_peer_lookup(grp,i);
|
|
|
|
thisjobid = proc->proc_name.jobid;
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
found = 0;
|
|
|
|
for ( j=0; j<numjobids; j++) {
|
|
|
|
if (thisjobid == jobids[j]) {
|
|
|
|
found = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!found ) {
|
|
|
|
jobids[numjobids++] = thisjobid;
|
2006-02-07 03:32:36 +00:00
|
|
|
}
|
2004-10-26 11:37:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* if inter-comm, loop over all processes in remote_group
|
|
|
|
and count number of different jobids */
|
|
|
|
grp = comm->c_remote_group;
|
|
|
|
for (i=0; i< rsize; i++) {
|
2007-09-13 14:00:59 +00:00
|
|
|
proc = ompi_group_peer_lookup(grp,i);
|
|
|
|
thisjobid = proc->proc_name.jobid;
|
Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things).
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
2006-11-14 19:34:59 +00:00
|
|
|
found = 0;
|
|
|
|
for ( j=0; j<numjobids; j++) {
|
|
|
|
if ( thisjobid == jobids[j]) {
|
|
|
|
found = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!found ) {
|
|
|
|
jobids[numjobids++] = thisjobid;
|
2006-02-07 03:32:36 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2004-10-26 11:37:58 +00:00
|
|
|
/* if number of joibds larger than one, set the disconnect flag*/
|
|
|
|
if ( numjobids > 1 ) {
|
2007-04-06 19:18:31 +00:00
|
|
|
ompi_comm_num_dyncomm++;
|
|
|
|
OMPI_COMM_SET_DYNAMIC(comm);
|
2004-10-26 11:37:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return;
|
|
|
|
}
|