1
1
Ralph Castain aec5cd08bd Per the PMIx RFC:
WHAT:    Merge the PMIx branch into the devel repo, creating a new
               OPAL “lmix” framework to abstract PMI support for all RTEs.
               Replace the ORTE daemon-level collectives with a new PMIx
               server and update the ORTE grpcomm framework to support
               server-to-server collectives

WHY:      We’ve had problems dealing with variations in PMI implementations,
               and need to extend the existing PMI definitions to meet exascale
               requirements.

WHEN:   Mon, Aug 25

WHERE:  https://github.com/rhc54/ompi-svn-mirror.git

Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.

All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.

Accordingly, we have:

* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.

* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.

* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint

* removed the prior OMPI/OPAL modex code

* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.

* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand

This commit was SVN r32570.
2014-08-21 18:56:47 +00:00

182 строки
6.3 KiB
C

/*
* Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
* University Research and Technology
* Corporation. All rights reserved.
* Copyright (c) 2004-2005 The University of Tennessee and The University
* of Tennessee Research Foundation. All rights
* reserved.
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
* Copyright (c) 2004-2005 The Regents of the University of California.
* All rights reserved.
* Copyright (c) 2011-2012 Los Alamos National Security, LLC.
* All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
/** @file:
*
* The OpenRTE Group Communications
*
* The OpenRTE Group Comm framework provides communication services that
* span entire jobs or collections of processes. It is not intended to be
* used for point-to-point communications (the RML does that), nor should
* it be viewed as a high-performance communication channel for large-scale
* data transfers.
*/
#ifndef MCA_GRPCOMM_H
#define MCA_GRPCOMM_H
/*
* includes
*/
#include "orte_config.h"
#include "orte/constants.h"
#include "orte/types.h"
#include "opal/mca/mca.h"
#include "opal/class/opal_list.h"
#include "opal/dss/dss_types.h"
#include "orte/mca/rml/rml_types.h"
BEGIN_C_DECLS
/* define a callback function to be invoked upon
* collective completion */
typedef void (*orte_grpcomm_cbfunc_t)(int status, opal_buffer_t *buf, void *cbdata);
/* Define a collective signature so we don't need to
* track global collective id's */
typedef struct {
opal_object_t super;
orte_process_name_t *signature;
size_t sz;
} orte_grpcomm_signature_t;
OBJ_CLASS_DECLARATION(orte_grpcomm_signature_t);
/* Internal component object for tracking ongoing
* allgather operations */
typedef struct {
opal_list_item_t super;
/* collective's signature */
orte_grpcomm_signature_t *sig;
/* collection bucket */
opal_buffer_t bucket;
/* participating daemons */
orte_vpid_t *dmns;
size_t ndmns;
/* number reported in */
size_t nreported;
/* callback function */
orte_grpcomm_cbfunc_t cbfunc;
/* user-provided callback data */
void *cbdata;
} orte_grpcomm_coll_t;
OBJ_CLASS_DECLARATION(orte_grpcomm_coll_t);
/*
* Component functions - all MUST be provided!
*/
/* initialize the selected module */
typedef int (*orte_grpcomm_base_module_init_fn_t)(void);
/* finalize the selected module */
typedef void (*orte_grpcomm_base_module_finalize_fn_t)(void);
/* Scalably send a message. Caller will provide an array
* of daemon vpids that are to receive the message. A NULL
* pointer indicates that all daemons are participating. */
typedef int (*orte_grpcomm_base_module_xcast_fn_t)(orte_vpid_t *vpids,
size_t nprocs,
opal_buffer_t *msg);
/* allgather - gather data from all specified daemons. Barrier operations
* will provide a zero-byte buffer. Caller will provide an array
* of daemon vpids that are participating in the allgather via the
* orte_grpcomm_coll_t object. A NULL pointer indicates that all daemons
* are participating.
*
* NOTE: this is a non-blocking call. The callback function cached in
* the orte_grpcomm_coll_t will be invoked upon completion. */
typedef int (*orte_grpcomm_base_module_allgather_fn_t)(orte_grpcomm_coll_t *coll,
opal_buffer_t *buf);
/*
* Ver 3.0 - internal modules
*/
typedef struct {
orte_grpcomm_base_module_init_fn_t init;
orte_grpcomm_base_module_finalize_fn_t finalize;
/* collective operations */
orte_grpcomm_base_module_xcast_fn_t xcast;
orte_grpcomm_base_module_allgather_fn_t allgather;
} orte_grpcomm_base_module_t;
/* the Public APIs */
/* Scalably send a message. Caller will provide an array
* of process names that are to receive the message. A NULL
* pointer indicates that all known procs are to receive
* the message. A pointer to a name that includes ORTE_VPID_WILDCARD
* will send the message to all procs in the specified jobid.
* The message will be sent to the daemons hosting the specified
* procs for processing and relay. */
typedef int (*orte_grpcomm_base_API_xcast_fn_t)(orte_grpcomm_signature_t *sig,
orte_rml_tag_t tag,
opal_buffer_t *msg);
/* allgather - gather data from all specified procs. Barrier operations
* will provide a zero-byte buffer. Caller will provide an array
* of application proc vpids that are participating in the allgather. A NULL
* pointer indicates that all known procs are participating. A pointer
* to a name that includes ORTE_VPID_WILDCARD indicates that all procs
* in the specified jobid are contributing.
*
* NOTE: this is a non-blocking call. The provided callback function
* will be invoked upon completion. */
typedef int (*orte_grpcomm_base_API_allgather_fn_t)(orte_grpcomm_signature_t *sig,
opal_buffer_t *buf,
orte_grpcomm_cbfunc_t cbfunc,
void *cbdata);
typedef struct {
/* collective operations */
orte_grpcomm_base_API_xcast_fn_t xcast;
orte_grpcomm_base_API_allgather_fn_t allgather;
} orte_grpcomm_API_module_t;
/*
* the standard component data structure
*/
struct orte_grpcomm_base_component_3_0_0_t {
mca_base_component_t base_version;
mca_base_component_data_t base_data;
};
typedef struct orte_grpcomm_base_component_3_0_0_t orte_grpcomm_base_component_3_0_0_t;
typedef orte_grpcomm_base_component_3_0_0_t orte_grpcomm_base_component_t;
/*
* Macro for use in components that are of type grpcomm v3.0.0
*/
#define ORTE_GRPCOMM_BASE_VERSION_3_0_0 \
/* grpcomm v3.0 is chained to MCA v2.0 */ \
MCA_BASE_VERSION_2_0_0, \
/* grpcomm v3.0 */ \
"grpcomm", 3, 0, 0
/* Global structure for accessing grpcomm functions */
ORTE_DECLSPEC extern orte_grpcomm_API_module_t orte_grpcomm;
END_C_DECLS
#endif