2007-07-09 17:16:34 +00:00
|
|
|
/*
|
|
|
|
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
|
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
|
|
* University of Stuttgart. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
|
|
|
* Copyright (c) 2006-2007 Los Alamos National Security, LLC. All rights
|
|
|
|
* reserved.
|
|
|
|
* $COPYRIGHT$
|
|
|
|
*
|
|
|
|
* Additional copyrights may follow
|
|
|
|
*
|
|
|
|
* $HEADER$
|
|
|
|
*/
|
|
|
|
|
|
|
|
/** @file
|
|
|
|
* Open MPI module-related data transfer mechanism
|
|
|
|
*
|
|
|
|
* A system for publishing module-related data for global
|
|
|
|
* initialization. Known simply as the "modex", this interface
|
|
|
|
* provides a system for sharing data, particularly data related to
|
|
|
|
* modules and their availability on the system.
|
|
|
|
*
|
|
|
|
* The modex system is tightly integrated into the general run-time
|
|
|
|
* initialization system and takes advantage of global update periods
|
|
|
|
* to minimize the amount of network traffic. All updates are also
|
|
|
|
* stored in the general purpose registry, and can be read at any time
|
|
|
|
* during the life of the process. Care should be taken to not call
|
|
|
|
* the blocking receive during the first stage of global
|
|
|
|
* initialization, as data will not be available the process will
|
|
|
|
* likely hang.
|
|
|
|
*
|
|
|
|
* @note For the purpose of this interface, two components are
|
|
|
|
* "corresponding" if:
|
|
|
|
* - they share the same major and minor MCA version number
|
|
|
|
* - they have the same type name string
|
|
|
|
* - they share the same major and minor type version number
|
|
|
|
* - they have the same component name string
|
|
|
|
* - they share the same major and minor component version number
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef MCA_OMPI_MODULE_EXCHANGE_H
|
|
|
|
#define MCA_OMPI_MODULE_EXCHANGE_H
|
|
|
|
|
|
|
|
#ifdef HAVE_SYS_TYPES_H
|
|
|
|
#include <sys/types.h>
|
|
|
|
#endif
|
|
|
|
|
These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
|
|
|
#include "orte/dss/dss_types.h"
|
2007-07-09 17:16:34 +00:00
|
|
|
#include "orte/mca/ns/ns_types.h"
|
|
|
|
|
|
|
|
struct mca_base_component_t;
|
|
|
|
struct ompi_proc_t;
|
|
|
|
|
|
|
|
BEGIN_C_DECLS
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Send a module-specific buffer to all other corresponding MCA
|
|
|
|
* modules in peer processes
|
|
|
|
*
|
|
|
|
* This function takes a contiguous buffer of network-ordered data
|
|
|
|
* and makes it available to all other MCA processes during the
|
|
|
|
* selection process. Modules sent by one source_component can only
|
|
|
|
* be received by a corresponding module with the same
|
|
|
|
* component name.
|
|
|
|
*
|
|
|
|
* This function is indended to be used during MCA module
|
|
|
|
* initialization \em before \em selection (the selection process is
|
|
|
|
* defined differently for each component type). Each module will
|
|
|
|
* provide a buffer containing meta information and/or parameters
|
|
|
|
* that it wants to share with its corresponding modules in peer
|
|
|
|
* processes. This information typically contains location /
|
|
|
|
* contact information for establishing communication between
|
|
|
|
* processes (in a manner that is specific to that module). For
|
|
|
|
* example, a TCP-based module could provide its IP address and TCP
|
|
|
|
* port where it is waiting on listen(). The peer process receiving
|
|
|
|
* this buffer can therefore open a socket to the indicated IP
|
|
|
|
* address and TCP port.
|
|
|
|
*
|
|
|
|
* During the selection process, the MCA framework will effectively
|
|
|
|
* perform an "allgather" operation of all modex buffers; every
|
|
|
|
* buffer will be available to every peer process (see
|
|
|
|
* ompi_modex_recv()).
|
|
|
|
*
|
|
|
|
* The buffer is copied during the send call and may be modified or
|
|
|
|
* free()'ed immediately after the return from this function call.
|
|
|
|
*
|
|
|
|
* @note Buffer contents is transparent to the MCA framework -- it \em
|
|
|
|
* must already either be in network order or be in some format that
|
|
|
|
* peer processes will be able to read it, regardless of pointer sizes
|
|
|
|
* or endian bias.
|
|
|
|
*
|
2007-07-25 21:01:10 +00:00
|
|
|
* @param[in] source_component A pointer to this module's component
|
2007-07-09 17:16:34 +00:00
|
|
|
* structure
|
2007-07-25 21:01:10 +00:00
|
|
|
* @param[in] buffer A pointer to the beginning of the buffer to send
|
|
|
|
* @param[in] size Number of bytes in the buffer
|
2007-07-09 17:16:34 +00:00
|
|
|
*
|
|
|
|
* @retval OMPI_SUCCESS On success
|
|
|
|
* @retval OMPI_ERROR An unspecified error occurred
|
|
|
|
*/
|
|
|
|
OMPI_DECLSPEC int ompi_modex_send(struct mca_base_component_t *source_component,
|
|
|
|
const void *buffer, size_t size);
|
|
|
|
|
|
|
|
|
2007-08-09 18:53:28 +00:00
|
|
|
/**
|
|
|
|
* Send a buffer to all other corresponding peer process
|
|
|
|
*
|
|
|
|
* Similar to ompi_modex_send(), but uses a char* key instead of a
|
|
|
|
* component name for indexing. All other semantics apply.
|
|
|
|
*
|
|
|
|
* @note Buffer contents is transparent to the modex -- it \em must
|
|
|
|
* already either be in network order or be in some format that peer
|
|
|
|
* processes will be able to read it, regardless of pointer sizes or
|
|
|
|
* endian bias.
|
|
|
|
*
|
|
|
|
* @param[in] key A unique key for data storage / lookup
|
|
|
|
* @param[in] buffer A pointer to the beginning of the buffer to send
|
|
|
|
* @param[in] size Number of bytes in the buffer
|
|
|
|
*
|
|
|
|
* @retval OMPI_SUCCESS On success
|
|
|
|
* @retval OMPI_ERROR An unspecified error occurred
|
|
|
|
*/
|
|
|
|
OMPI_DECLSPEC int ompi_modex_send_string(const char* key,
|
|
|
|
const void *buffer, size_t size);
|
|
|
|
|
|
|
|
|
2007-07-09 17:16:34 +00:00
|
|
|
/**
|
|
|
|
* Receive a module-specific buffer from a corresponding MCA module
|
|
|
|
* in a specific peer process
|
|
|
|
*
|
|
|
|
* This is the corresponding "get" call to ompi_modex_send().
|
|
|
|
* After selection, modules can call this function to receive the
|
|
|
|
* buffer sent by their corresponding module on the process
|
|
|
|
* source_proc.
|
|
|
|
*
|
|
|
|
* If a buffer from a corresponding module is found, buffer will be
|
|
|
|
* filled with a pointer to a copy of the buffer that was sent by
|
|
|
|
* the peer process. It is the caller's responsibility to free this
|
|
|
|
* buffer. The size will be filled in with the total size of the
|
|
|
|
* buffer.
|
|
|
|
*
|
|
|
|
* @note If the modex system has received information from a given
|
|
|
|
* process, but has not yet received information for the given
|
|
|
|
* component, ompi_modex_recv() will return no data. This
|
|
|
|
* can not happen to a process that has gone through the normal
|
|
|
|
* startup proceedure, but if you believe this can happen with your
|
|
|
|
* component, you should use ompi_modex_recv_nb() to receive updates
|
|
|
|
* when the information becomes available.
|
|
|
|
*
|
2007-07-25 21:01:10 +00:00
|
|
|
* @param[in] dest_component A pointer to this module's component struct
|
|
|
|
* @param[in] source_proc Peer process to receive from
|
|
|
|
* @param[out] buffer A pointer to a (void*) that will be filled
|
2007-07-09 17:16:34 +00:00
|
|
|
* with a pointer to the received buffer
|
2007-07-25 21:01:10 +00:00
|
|
|
* @param[out] size Pointer to a size_t that will be filled with
|
2007-07-09 17:16:34 +00:00
|
|
|
* the number of bytes in the buffer
|
|
|
|
*
|
|
|
|
* @retval OMPI_SUCCESS If a corresponding module buffer is found and
|
|
|
|
* successfully returned to the caller.
|
|
|
|
* @retval OMPI_ERR_NOT_IMPLEMENTED Modex support is not available in
|
|
|
|
* this build of Open MPI (systems like the Cray XT)
|
|
|
|
* @retval OMPI_ERR_OUT_OF_RESOURCE No memory could be allocated for the
|
|
|
|
* buffer.
|
|
|
|
*/
|
|
|
|
OMPI_DECLSPEC int ompi_modex_recv(struct mca_base_component_t *dest_component,
|
|
|
|
struct ompi_proc_t *source_proc,
|
|
|
|
void **buffer, size_t *size);
|
|
|
|
|
2007-07-25 21:01:10 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* Non-blocking modex receive callback
|
|
|
|
*
|
|
|
|
* Prototype for non-blocking modex receive callback.
|
|
|
|
*
|
|
|
|
* @param[in] component Pointer to copy of the component struct
|
|
|
|
* @param[in] proc Peer process infromation is from
|
|
|
|
* @param[in] buffer Newly updated buffer
|
|
|
|
* @param[in] size Size (in bytes) of buffer
|
|
|
|
* @param[in] cbdata Callback data provided when non-blocking
|
|
|
|
* receive is posted
|
|
|
|
*/
|
2007-07-09 17:16:34 +00:00
|
|
|
typedef void (*ompi_modex_cb_fn_t)(struct mca_base_component_t *component,
|
|
|
|
struct ompi_proc_t* proc,
|
|
|
|
void* buffer,
|
|
|
|
size_t size,
|
|
|
|
void* cbdata);
|
|
|
|
|
2007-07-25 21:01:10 +00:00
|
|
|
|
2007-07-09 17:16:34 +00:00
|
|
|
/**
|
|
|
|
* Register to receive a callback on change to module specific data.
|
|
|
|
*
|
|
|
|
* The non-blocking version of ompi_modex_recv(). All information
|
|
|
|
* about ompi_modex_recv() applies to ompi_modex_recv_nb(), with the
|
|
|
|
* exception of what happens when data is available for the given peer
|
|
|
|
* process but not the specified module. In that case, no callback
|
|
|
|
* will be fired until data is available.
|
|
|
|
*
|
2007-07-25 21:01:10 +00:00
|
|
|
* @param[in] component A pointer to this module's component struct
|
|
|
|
* @param[in] proc Peer process to receive from
|
|
|
|
* @param[in] cbfunc Callback function when data is available,
|
2007-07-09 17:16:34 +00:00
|
|
|
* of type ompi_modex_cb_fn_t
|
2007-07-25 21:01:10 +00:00
|
|
|
* @param[in] cbdata Opaque callback data to pass to cbfunc
|
2007-07-09 17:16:34 +00:00
|
|
|
*
|
|
|
|
* @retval OMPI_SUCCESS Success
|
|
|
|
* @retval OMPI_ERR_OUT_OF_RESOURCE No memory could be allocated
|
|
|
|
* for internal data structures
|
|
|
|
*/
|
|
|
|
OMPI_DECLSPEC int ompi_modex_recv_nb(struct mca_base_component_t *component,
|
|
|
|
struct ompi_proc_t* proc,
|
|
|
|
ompi_modex_cb_fn_t cbfunc,
|
|
|
|
void* cbdata);
|
|
|
|
|
|
|
|
|
2007-08-09 18:53:28 +00:00
|
|
|
/**
|
|
|
|
* Receive a buffer from a given peer
|
|
|
|
*
|
|
|
|
* Similar to ompi_modex_recv(), but uses a char* key instead of a
|
|
|
|
* component name for indexing. All other semantics apply.
|
|
|
|
*
|
|
|
|
* @note If the modex system has received information from a given
|
|
|
|
* process, but has not yet received information for the given
|
|
|
|
* component, ompi_modex_recv_string() will return no data. This can
|
|
|
|
* not happen to a process that has gone through the normal startup
|
|
|
|
* proceedure, but if you believe this can happen with your component,
|
|
|
|
* you should use ompi_modex_recv_string_nb() to receive updates when
|
|
|
|
* the information becomes available.
|
|
|
|
*
|
|
|
|
* @param[in] key A unique key for data storage / lookup
|
|
|
|
* @param[in] source_proc Peer process to receive from
|
|
|
|
* @param[out] buffer A pointer to a (void*) that will be filled
|
|
|
|
* with a pointer to the received buffer
|
|
|
|
* @param[out] size Pointer to a size_t that will be filled with
|
|
|
|
* the number of bytes in the buffer
|
|
|
|
*
|
|
|
|
* @retval OMPI_SUCCESS If a corresponding module buffer is found and
|
|
|
|
* successfully returned to the caller.
|
|
|
|
* @retval OMPI_ERR_NOT_IMPLEMENTED Modex support is not available in
|
|
|
|
* this build of Open MPI (systems like the Cray XT)
|
|
|
|
* @retval OMPI_ERR_OUT_OF_RESOURCE No memory could be allocated for the
|
|
|
|
* buffer.
|
|
|
|
*/
|
|
|
|
OMPI_DECLSPEC int ompi_modex_recv_string(const char* key,
|
|
|
|
struct ompi_proc_t *source_proc,
|
|
|
|
void **buffer, size_t *size);
|
|
|
|
|
|
|
|
|
2007-07-25 21:01:10 +00:00
|
|
|
/**
|
These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
|
|
|
* Retrieve a copy of the modex buffer
|
2007-07-25 21:01:10 +00:00
|
|
|
*
|
These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
|
|
|
* Each component will "send" its data on its own. The modex
|
|
|
|
* collects that data into a local static buffer. At some point,
|
|
|
|
* we need to provide a copy of the collected info so someone
|
|
|
|
* (usually mpi_init) can send it to everyone else. This function
|
|
|
|
* xfers the payload in the local static buffer into the provided
|
|
|
|
* buffer, thus resetting the local buffer for future use.
|
2007-07-25 21:01:10 +00:00
|
|
|
*
|
|
|
|
* @note This function is probably not useful outside of application
|
|
|
|
* initialization code.
|
|
|
|
*
|
These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
|
|
|
* @param[in] *buf Pointer to the target buffer
|
|
|
|
*
|
|
|
|
* @retval OMPI_SUCCESS Successfully exchanged information
|
|
|
|
* @retval OMPI_ERROR An unspecified error occurred
|
|
|
|
*/
|
|
|
|
OMPI_DECLSPEC int ompi_modex_get_my_buffer(orte_buffer_t *buf);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Process the data in a modex buffer
|
|
|
|
*
|
|
|
|
* Given a buffer containing a set of modex entries, this
|
|
|
|
* function will destructively read the buffer, adding the
|
|
|
|
* modex info to each proc. An error will be returned if
|
|
|
|
* modex info is found for a proc that is not yet in the
|
|
|
|
* ompi_proc table
|
|
|
|
*
|
|
|
|
* @param[in] *buf Pointer to a buffer containing the data
|
2007-07-25 21:01:10 +00:00
|
|
|
*
|
These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
|
|
|
* @retval OMPI_SUCCESS Successfully exchanged information
|
2007-07-25 21:01:10 +00:00
|
|
|
* @retval OMPI_ERROR An unspecified error occurred
|
|
|
|
*/
|
These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC.
The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component.
This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done:
As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in.
In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in.
The incoming changes revamp these procedures in three ways:
1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step.
The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic.
Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure.
2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed.
The size of this data has been reduced in three ways:
(a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes.
To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose.
(b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction.
(c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using.
While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly.
3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup.
It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k*50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging.
Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future.
There are a few minor additional changes in the commit that I'll just note in passing:
* propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details.
* requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details.
* cleanup of some stale header files
This commit was SVN r16364.
2007-10-05 19:48:23 +00:00
|
|
|
OMPI_DECLSPEC int ompi_modex_process_data(orte_buffer_t *buf);
|
2007-07-09 17:16:34 +00:00
|
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Initialize the modex system
|
|
|
|
*
|
|
|
|
* Allocate memory for the local data cache and initialize the
|
|
|
|
* module exchange system. Does not cause communication nor any
|
|
|
|
* subscriptions to be placed on the registry.
|
2007-07-25 21:01:10 +00:00
|
|
|
*
|
|
|
|
* @retval OMPI_SUCCESS Successfully initialized modex subsystem
|
2007-07-09 17:16:34 +00:00
|
|
|
*/
|
|
|
|
OMPI_DECLSPEC int ompi_modex_init(void);
|
|
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Finalize the modex system
|
|
|
|
*
|
|
|
|
* Release any memory associated with the modex system, remove all
|
|
|
|
* subscriptions on the GPR and end all non-blocking update triggers
|
|
|
|
* currently available on the system.
|
2007-07-25 21:01:10 +00:00
|
|
|
*
|
|
|
|
* @retval OMPI_SUCCESS Successfully shut down modex subsystem
|
2007-07-09 17:16:34 +00:00
|
|
|
*/
|
|
|
|
OMPI_DECLSPEC int ompi_modex_finalize(void);
|
|
|
|
|
|
|
|
END_C_DECLS
|
|
|
|
|
|
|
|
#endif /* MCA_OMPI_MODULE_EXCHANGE_H */
|