1
1
openmpi/ompi/mca/bcol/bcol.h

806 строки
26 KiB
C
Исходник Обычный вид История

coll/ml: add support for blocking and non-blocking allreduce, reduce, and allgather. The new collectives provide a signifigant performance increase over tuned for small and medium messages. We are initially setting the priority lower than tuned until this has had some time to soak in the trunk. Please set coll_ml_priority to 90 for MTT runs. Credit for this work goes to Manjunath Gorentla Venkata (ORNL), Pavel Shamis (ORNL), and Nathan Hjelm (LANL). Commit details (for reference): Import ORNL's collectives for MPI_Allreduce, MPI_Reduce, and MPI_Allgather. We need to take the basesmuma header into account when calculating the ptpcoll small message thresholds. Add a define to bcol.h indicating the maximum header size so we can take the header into account while not making ptpcoll dependent on information from basesmuma. This resolves an issue with allreduce where ptpcoll overwrites the header of the next buffer in the basesmuma bank. Fix reduce and make a sequential collective launcher in coll_ml_inlines.h The root calculation for reduce was wrong for any root != 0. There are four possibilities for the root: - The root is not the current process but is in the current hierarchy. In this case the root is the index of the global root as specified in the root vector. - The root is not the current process and is not in the next level of the hierarchy. In this case 0 must be the local root since this process will never communicate with the real root. - The root is not the current process but will be in next level of the hierarchy. In this case the current process must be the root. - I am the root. The root is my index. Tested with IMB which rotates the root on every call to MPI_Reduce. Consider IMB the reproducer for the issue this commit solves. Make the bcast algorithm decision an enumerated variable Resolve various asset failures when destructing coll ml requests. Two issues: - Always reset the request to be invalid before returning it to the free list. This will avoid an asset in ompi_request_t's destructor. OMPI_REQUEST_FINI does this (and also releases the fortran handle index). - Never explicitly construct or destruct the superclass of an opal object. This screws up the class function tables and will cause either an assert failure or a segmentation fault when destructing coll ml requests. Cleanup allgather. I removed the duplicate non-blocking and blocking functions and modeled the cleanup after what I found in allreduce. Also cleaned up the code somewhat. Don't bother copying from the send to the recieve buffer in bcol_basesmuma_allreduce_intra_fanin_fanout if the pointers are the same. The eliminates a warning about memcpy and aliasing and avoids an unnecessary call to memcpy. Alwasy call CHECK_AND_RELEASE on memsync collectives. There was a call to OBJ_RELEASE on the collective communicator but because CHECK_AND_RECYLCE was never called there was not matching call to OBJ_RELEASE. This caused coll ml to leak communicators. Make allreduce use the sequential collective launcher in coll_ml_inlines.h Just launch the next collective in the component progress. I am a little unsure about this patch. There appears to be some sort of race between collectives that causes buffer exhaustion in some cases (IMB Allreduce is a reproducer). Changing progress to only launch the next bcol seems to resolve the issue but might not be the best fix. Note that I see little-no performance penalty for this change. Fix allreduce when there are extra sources. There was an issue with the buffer offset calculation when there are extra sources. In the case of extra sources == 1 the offset was set to buffer_size (just past the header of the next buffer). I adjusted the buffer size to take into accoun the maximum header size (see the earlier commit that added this) and simplified the offset calculation. Make reduce/allreduce non-blocking. This is required for MPI_Comm_idup to work correctly. This has been tested with various layouts using the ibm testsuite and imb and appears to have the same performance as the old blocking version. Fix allgather for non-contiguous layouts and simplify parsing the topology. Some things in this patch: - There were several comments to the effect that level 0 of the hierarchy MUST contain all of the ranks. At least one function made this assumption but it was not true. I changed the sbgp components and the coll ml initization code to enforce this requirement. - Ensure that hierarchy level 0 has the ranks in the correct scatter gather order. This removes the need for a separate sort list and fixes the offset calculation for allgather. - There were several passes over the hierarchy to determine properties of the hierarchy. I eliminated these extra passes and the memory allocation associated with them and calculate the tree properties on the fly. The same DFS recursion also handles the re-order of level 0. All these changes have been verified with MPI_Allreduce, MPI_Reduce, and MPI_Allgather. All functions now pass all IBM/Open MPI, and IMB tests. coll/ml: correct pointer usage for MPI_BOTTOM Since contiguous datatypes are copied via memcpy (bypassing the convertor) we need to adjust for the lb of the datatype. This corrects problems found testing code that uses MPI_BOTTOM (NULL) as the send pointer. Add fallback collectives for allreduce and reduce. cmr=v1.7.5:reviewer=pasha This commit was SVN r30363.
2014-01-22 19:39:19 +04:00
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2009-2012 Oak Ridge National Laboratory. All rights reserved.
* Copyright (c) 2009-2012 Mellanox Technologies. All rights reserved.
* Copyright (c) 2013-2015 Los Alamos National Security, LLC. All rights
coll/ml: add support for blocking and non-blocking allreduce, reduce, and allgather. The new collectives provide a signifigant performance increase over tuned for small and medium messages. We are initially setting the priority lower than tuned until this has had some time to soak in the trunk. Please set coll_ml_priority to 90 for MTT runs. Credit for this work goes to Manjunath Gorentla Venkata (ORNL), Pavel Shamis (ORNL), and Nathan Hjelm (LANL). Commit details (for reference): Import ORNL's collectives for MPI_Allreduce, MPI_Reduce, and MPI_Allgather. We need to take the basesmuma header into account when calculating the ptpcoll small message thresholds. Add a define to bcol.h indicating the maximum header size so we can take the header into account while not making ptpcoll dependent on information from basesmuma. This resolves an issue with allreduce where ptpcoll overwrites the header of the next buffer in the basesmuma bank. Fix reduce and make a sequential collective launcher in coll_ml_inlines.h The root calculation for reduce was wrong for any root != 0. There are four possibilities for the root: - The root is not the current process but is in the current hierarchy. In this case the root is the index of the global root as specified in the root vector. - The root is not the current process and is not in the next level of the hierarchy. In this case 0 must be the local root since this process will never communicate with the real root. - The root is not the current process but will be in next level of the hierarchy. In this case the current process must be the root. - I am the root. The root is my index. Tested with IMB which rotates the root on every call to MPI_Reduce. Consider IMB the reproducer for the issue this commit solves. Make the bcast algorithm decision an enumerated variable Resolve various asset failures when destructing coll ml requests. Two issues: - Always reset the request to be invalid before returning it to the free list. This will avoid an asset in ompi_request_t's destructor. OMPI_REQUEST_FINI does this (and also releases the fortran handle index). - Never explicitly construct or destruct the superclass of an opal object. This screws up the class function tables and will cause either an assert failure or a segmentation fault when destructing coll ml requests. Cleanup allgather. I removed the duplicate non-blocking and blocking functions and modeled the cleanup after what I found in allreduce. Also cleaned up the code somewhat. Don't bother copying from the send to the recieve buffer in bcol_basesmuma_allreduce_intra_fanin_fanout if the pointers are the same. The eliminates a warning about memcpy and aliasing and avoids an unnecessary call to memcpy. Alwasy call CHECK_AND_RELEASE on memsync collectives. There was a call to OBJ_RELEASE on the collective communicator but because CHECK_AND_RECYLCE was never called there was not matching call to OBJ_RELEASE. This caused coll ml to leak communicators. Make allreduce use the sequential collective launcher in coll_ml_inlines.h Just launch the next collective in the component progress. I am a little unsure about this patch. There appears to be some sort of race between collectives that causes buffer exhaustion in some cases (IMB Allreduce is a reproducer). Changing progress to only launch the next bcol seems to resolve the issue but might not be the best fix. Note that I see little-no performance penalty for this change. Fix allreduce when there are extra sources. There was an issue with the buffer offset calculation when there are extra sources. In the case of extra sources == 1 the offset was set to buffer_size (just past the header of the next buffer). I adjusted the buffer size to take into accoun the maximum header size (see the earlier commit that added this) and simplified the offset calculation. Make reduce/allreduce non-blocking. This is required for MPI_Comm_idup to work correctly. This has been tested with various layouts using the ibm testsuite and imb and appears to have the same performance as the old blocking version. Fix allgather for non-contiguous layouts and simplify parsing the topology. Some things in this patch: - There were several comments to the effect that level 0 of the hierarchy MUST contain all of the ranks. At least one function made this assumption but it was not true. I changed the sbgp components and the coll ml initization code to enforce this requirement. - Ensure that hierarchy level 0 has the ranks in the correct scatter gather order. This removes the need for a separate sort list and fixes the offset calculation for allgather. - There were several passes over the hierarchy to determine properties of the hierarchy. I eliminated these extra passes and the memory allocation associated with them and calculate the tree properties on the fly. The same DFS recursion also handles the re-order of level 0. All these changes have been verified with MPI_Allreduce, MPI_Reduce, and MPI_Allgather. All functions now pass all IBM/Open MPI, and IMB tests. coll/ml: correct pointer usage for MPI_BOTTOM Since contiguous datatypes are copied via memcpy (bypassing the convertor) we need to adjust for the lb of the datatype. This corrects problems found testing code that uses MPI_BOTTOM (NULL) as the send pointer. Add fallback collectives for allreduce and reduce. cmr=v1.7.5:reviewer=pasha This commit was SVN r30363.
2014-01-22 19:39:19 +04:00
* reserved.
* Copyright (c) 2015 Research Organization for Information Science
* and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#ifndef MCA_BCOL_H
#define MCA_BCOL_H
#include "ompi_config.h"
#include "opal/class/opal_list.h"
#include "ompi/mca/mca.h"
#include "ompi/mca/coll/coll.h"
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-) WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic. This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
#include "opal/mca/mpool/mpool.h"
#include "ompi/mca/sbgp/sbgp.h"
#include "ompi/datatype/ompi_datatype.h"
#include "ompi/op/op.h"
#include "ompi/include/ompi/constants.h"
#include "ompi/patterns/net/netpatterns_knomial_tree.h"
#include "opal/util/show_help.h"
#include <limits.h>
#if defined(c_plusplus) || defined(__cplusplus)
extern "C" {
#endif
/* Forward declaration - please do not remove it */
struct ml_buffers_t;
struct mca_bcol_base_coll_fn_comm_attributes_t;
struct mca_bcol_base_coll_fn_invoke_attributes_t;
struct mca_bcol_base_coll_fn_desc_t;
#define NUM_MSG_RANGES 5
#define MSG_RANGE_INITIAL (1024)*12
#define MSG_RANGE_INC 10
#define BCOL_THRESHOLD_UNLIMITED (INT_MAX)
coll/ml: add support for blocking and non-blocking allreduce, reduce, and allgather. The new collectives provide a signifigant performance increase over tuned for small and medium messages. We are initially setting the priority lower than tuned until this has had some time to soak in the trunk. Please set coll_ml_priority to 90 for MTT runs. Credit for this work goes to Manjunath Gorentla Venkata (ORNL), Pavel Shamis (ORNL), and Nathan Hjelm (LANL). Commit details (for reference): Import ORNL's collectives for MPI_Allreduce, MPI_Reduce, and MPI_Allgather. We need to take the basesmuma header into account when calculating the ptpcoll small message thresholds. Add a define to bcol.h indicating the maximum header size so we can take the header into account while not making ptpcoll dependent on information from basesmuma. This resolves an issue with allreduce where ptpcoll overwrites the header of the next buffer in the basesmuma bank. Fix reduce and make a sequential collective launcher in coll_ml_inlines.h The root calculation for reduce was wrong for any root != 0. There are four possibilities for the root: - The root is not the current process but is in the current hierarchy. In this case the root is the index of the global root as specified in the root vector. - The root is not the current process and is not in the next level of the hierarchy. In this case 0 must be the local root since this process will never communicate with the real root. - The root is not the current process but will be in next level of the hierarchy. In this case the current process must be the root. - I am the root. The root is my index. Tested with IMB which rotates the root on every call to MPI_Reduce. Consider IMB the reproducer for the issue this commit solves. Make the bcast algorithm decision an enumerated variable Resolve various asset failures when destructing coll ml requests. Two issues: - Always reset the request to be invalid before returning it to the free list. This will avoid an asset in ompi_request_t's destructor. OMPI_REQUEST_FINI does this (and also releases the fortran handle index). - Never explicitly construct or destruct the superclass of an opal object. This screws up the class function tables and will cause either an assert failure or a segmentation fault when destructing coll ml requests. Cleanup allgather. I removed the duplicate non-blocking and blocking functions and modeled the cleanup after what I found in allreduce. Also cleaned up the code somewhat. Don't bother copying from the send to the recieve buffer in bcol_basesmuma_allreduce_intra_fanin_fanout if the pointers are the same. The eliminates a warning about memcpy and aliasing and avoids an unnecessary call to memcpy. Alwasy call CHECK_AND_RELEASE on memsync collectives. There was a call to OBJ_RELEASE on the collective communicator but because CHECK_AND_RECYLCE was never called there was not matching call to OBJ_RELEASE. This caused coll ml to leak communicators. Make allreduce use the sequential collective launcher in coll_ml_inlines.h Just launch the next collective in the component progress. I am a little unsure about this patch. There appears to be some sort of race between collectives that causes buffer exhaustion in some cases (IMB Allreduce is a reproducer). Changing progress to only launch the next bcol seems to resolve the issue but might not be the best fix. Note that I see little-no performance penalty for this change. Fix allreduce when there are extra sources. There was an issue with the buffer offset calculation when there are extra sources. In the case of extra sources == 1 the offset was set to buffer_size (just past the header of the next buffer). I adjusted the buffer size to take into accoun the maximum header size (see the earlier commit that added this) and simplified the offset calculation. Make reduce/allreduce non-blocking. This is required for MPI_Comm_idup to work correctly. This has been tested with various layouts using the ibm testsuite and imb and appears to have the same performance as the old blocking version. Fix allgather for non-contiguous layouts and simplify parsing the topology. Some things in this patch: - There were several comments to the effect that level 0 of the hierarchy MUST contain all of the ranks. At least one function made this assumption but it was not true. I changed the sbgp components and the coll ml initization code to enforce this requirement. - Ensure that hierarchy level 0 has the ranks in the correct scatter gather order. This removes the need for a separate sort list and fixes the offset calculation for allgather. - There were several passes over the hierarchy to determine properties of the hierarchy. I eliminated these extra passes and the memory allocation associated with them and calculate the tree properties on the fly. The same DFS recursion also handles the re-order of level 0. All these changes have been verified with MPI_Allreduce, MPI_Reduce, and MPI_Allgather. All functions now pass all IBM/Open MPI, and IMB tests. coll/ml: correct pointer usage for MPI_BOTTOM Since contiguous datatypes are copied via memcpy (bypassing the convertor) we need to adjust for the lb of the datatype. This corrects problems found testing code that uses MPI_BOTTOM (NULL) as the send pointer. Add fallback collectives for allreduce and reduce. cmr=v1.7.5:reviewer=pasha This commit was SVN r30363.
2014-01-22 19:39:19 +04:00
/* Maximum size of a bcol's header. This allows us to correctly calculate the message
* thresholds. If the header of any bcol exceeds this value then increase this one
* to match. */
#define BCOL_HEADER_MAX 96
#define BCOL_HEAD_ALIGN 32 /* will turn into an MCA parameter after debug */
/*
* Functions supported
*/
enum bcol_coll {
/* blocking functions */
BCOL_ALLGATHER,
BCOL_ALLGATHERV,
BCOL_ALLREDUCE,
BCOL_ALLTOALL,
BCOL_ALLTOALLV,
BCOL_ALLTOALLW,
BCOL_BARRIER,
BCOL_BCAST,
BCOL_EXSCAN,
BCOL_GATHER,
BCOL_GATHERV,
BCOL_REDUCE,
BCOL_REDUCE_SCATTER,
BCOL_SCAN,
BCOL_SCATTER,
BCOL_SCATTERV,
BCOL_FANIN,
BCOL_FANOUT,
/* nonblocking functions */
BCOL_IALLGATHER,
BCOL_IALLGATHERV,
BCOL_IALLREDUCE,
BCOL_IALLTOALL,
BCOL_IALLTOALLV,
BCOL_IALLTOALLW,
BCOL_IBARRIER,
BCOL_IBCAST,
BCOL_IEXSCAN,
BCOL_IGATHER,
BCOL_IGATHERV,
BCOL_IREDUCE,
BCOL_IREDUCE_SCATTER,
BCOL_ISCAN,
BCOL_ISCATTER,
BCOL_ISCATTERV,
BCOL_IFANIN,
BCOL_IFANOUT,
BCOL_SYNC,
/* New function - needed for intermediate steps */
BCOL_REDUCE_TO_LEADER,
BCOL_NUM_OF_FUNCTIONS
};
typedef enum bcol_coll bcol_coll;
typedef enum bcol_elem_type {
BCOL_SINGLE_ELEM_TYPE,
BCOL_MULTI_ELEM_TYPE,
BCOL_NUM_OF_ELEM_TYPES
} bcol_elem_type;
typedef int (*mca_bcol_base_module_coll_support_all_types_fn_t)(bcol_coll coll_name);
typedef int (*mca_bcol_base_module_coll_support_fn_t)(int op, int dtype, bcol_elem_type elem_num);
/*
* Collective function status
*/
enum {
BCOL_FN_NOT_STARTED = (OMPI_ERR_MAX - 1),
BCOL_FN_STARTED = (OMPI_ERR_MAX - 2),
BCOL_FN_COMPLETE = (OMPI_ERR_MAX - 3)
};
/**
* Collective component initialization
*
* Initialize the given collective component. This function should
* initialize any component-level. data. It will be called exactly
* once during MPI_INIT.
*
* @note The component framework is not lazily opened, so attempts
* should be made to minimze the amount of memory allocated during
* this function.
*
* @param[in] enable_progress_threads True if the component needs to
* support progress threads
* @param[in] enable_mpi_threads True if the component needs to
* support MPI_THREAD_MULTIPLE
*
* @retval OMPI_SUCCESS Component successfully initialized
* @retval ORTE_ERROR An unspecified error occurred
*/
typedef int (*mca_bcol_base_component_init_query_fn_t)
(bool enable_progress_threads, bool enable_mpi_threads);
/**
* Query whether a component is available for the given sub-group
*
* Query whether the component is available for the given
* sub-group. If the component is available, an array of pointers should be
* allocated and returned (with refcount at 1). The module will not
* be used for collective operations until module_enable() is called
* on the module, but may be destroyed (via OBJ_RELEASE) either before
* or after module_enable() is called. If the module needs to release
* resources obtained during query(), it should do so in the module
* destructor.
*
* A component may provide NULL to this function to indicate it does
* not wish to run or return an error during module_enable().
*
* @note The communicator is available for point-to-point
* communication, but other functionality is not available during this
* phase of initialization.
*
* @param[in] sbgp Pointer to sub-group module.
* @param[out] priority Priority setting for component on
* this communicator
* @param[out] num_modules Number of modules that where generated
* for the sub-group module.
*
* @returns An array of pointer to an initialized modules structures if the component can
* provide a modules with the requested functionality or NULL if the
* component should not be used on the given communicator.
*/
typedef struct mca_bcol_base_module_t **(*mca_bcol_base_component_comm_query_fn_t)
(mca_sbgp_base_module_t *sbgp, int *num_modules);
typedef int (*mca_bcol_barrier_init_fn_t)(struct mca_bcol_base_module_t *bcol_module,
mca_sbgp_base_module_t *sbgp_module);
/*
* Macro for use in modules that are of type btl v2.0.0
*/
#define MCA_BCOL_BASE_VERSION_2_0_0 \
OMPI_MCA_BASE_VERSION_2_1_0("bcol", 2, 0, 0)
/* This is really an abstarction violation, but is the easiest way to get
* started. For memory management we need to know what bcol components
* have compatible memory management schemes. Such compatibility can
* be used to eliminate memory copies between levels in the collective
* operation hierarchy, by having the output buffer of one level be the
* input buffer to the next level
*/
enum {
BCOL_SHARED_MEMORY_UMA=0,
BCOL_SHARED_MEMORY_SOCKET,
BCOL_POINT_TO_POINT,
BCOL_IB_OFFLOAD,
BCOL_SIZE
};
OMPI_DECLSPEC extern int bcol_mpool_compatibility[BCOL_SIZE][BCOL_SIZE];
OMPI_DECLSPEC extern int bcol_mpool_index[BCOL_SIZE][BCOL_SIZE];
/* what are the input parameters ? too many void * pointers here */
typedef int (*bcol_register_mem_fn_t)(void *context_data, void *base,
size_t size, void **reg_desc);
/* deregistration function */
typedef int (*bcol_deregister_mem_fn_t)(void *context_data, void *reg_desc);
/* Bcol network context definition */
struct bcol_base_network_context_t {
opal_object_t super;
/* Context id - defined by upper layer, ML */
int context_id;
/* Any context information that bcol what to use */
void *context_data;
/* registration function */
bcol_register_mem_fn_t register_memory_fn;
/* deregistration function */
bcol_deregister_mem_fn_t deregister_memory_fn;
};
typedef struct bcol_base_network_context_t bcol_base_network_context_t;
OMPI_DECLSPEC OBJ_CLASS_DECLARATION(bcol_base_network_context_t);
/*
*primitive function types
*/
/* bcast */
enum {
/* small data function */
BCOL_BCAST_SMALL_DATA,
/* small data - dynamic decision making supported */
BCOL_BCAST_SMALL_DATA_DYNAMIC,
/* number of functions */
BCOL_NUM_BCAST_FUNCTIONS
};
/**
* BCOL instance.
*/
/* no limit on fragment size - this supports using user buffers rather
* than library buffers
*/
#define FRAG_SIZE_NO_LIMIT -1
/* forward declaration */
struct coll_bcol_collective_description_t;
struct mca_bcol_base_component_2_0_0_t {
/** Base component description */
mca_base_component_t bcol_version;
/** Component initialization function */
mca_bcol_base_component_init_query_fn_t collm_init_query;
/** Query whether component is useable for given communicator */
mca_bcol_base_component_comm_query_fn_t collm_comm_query;
/** If bcol supports all possible data types */
mca_bcol_base_module_coll_support_fn_t coll_support;
/** If bcol supports all possible data types for given collective operation */
mca_bcol_base_module_coll_support_all_types_fn_t coll_support_all_types;
/** Use this flag to prevent init_query multiple calls
in case we have the same bcol more than on a single level */
bool init_done;
/** If collective calls with bcols of this type need to be ordered */
bool need_ordering;
/** MCA parameter: Priority of this component */
int priority;
/** Bcast function pointers */
struct coll_bcol_collective_description_t *
bcast_functions[BCOL_NUM_BCAST_FUNCTIONS];
/** Number of network contexts - need this for resource management */
int n_net_contexts;
/** List of network contexts */
bcol_base_network_context_t **network_contexts;
/*
* Fragmentation support
*/
/** Minimum fragement size */
MCA/base: Add new MCA variable system Features: - Support for an override parameter file (openmpi-mca-param-override.conf). Variable values in this file can not be overridden by any file or environment value. - Support for boolean, unsigned, and unsigned long long variables. - Support for true/false values. - Support for enumerations on integer variables. - Support for MPIT scope, verbosity, and binding. - Support for command line source. - Support for setting variable source via the environment using OMPI_MCA_SOURCE_<var name>=source (either command or file:filename) - Cleaner API. - Support for variable groups (equivalent to MPIT categories). Notes: - Variables must be created with a backing store (char **, int *, or bool *) that must live at least as long as the variable. - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of mca_base_var_set_value() to change the value. - String values are duplicated when the variable is registered. It is up to the caller to free the original value if necessary. The new value will be freed by the mca_base_var system and must not be freed by the user. - Variables with constant scope may not be settable. - Variable groups (and all associated variables) are deregistered when the component is closed or the component repository item is freed. This prevents a segmentation fault from accessing a variable after its component is unloaded. - After some discussion we decided we should remove the automatic registration of component priority variables. Few component actually made use of this feature. - The enumerator interface was updated to be general enough to handle future uses of the interface. - The code to generate ompi_info output has been moved into the MCA variable system. See mca_base_var_dump(). opal: update core and components to mca_base_var system orte: update core and components to mca_base_var system ompi: update core and components to mca_base_var system This commit also modifies the rmaps framework. The following variables were moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode, rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables. This commit was SVN r28236.
2013-03-28 01:09:41 +04:00
int min_frag_size;
/** Maximum fragment size */
MCA/base: Add new MCA variable system Features: - Support for an override parameter file (openmpi-mca-param-override.conf). Variable values in this file can not be overridden by any file or environment value. - Support for boolean, unsigned, and unsigned long long variables. - Support for true/false values. - Support for enumerations on integer variables. - Support for MPIT scope, verbosity, and binding. - Support for command line source. - Support for setting variable source via the environment using OMPI_MCA_SOURCE_<var name>=source (either command or file:filename) - Cleaner API. - Support for variable groups (equivalent to MPIT categories). Notes: - Variables must be created with a backing store (char **, int *, or bool *) that must live at least as long as the variable. - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of mca_base_var_set_value() to change the value. - String values are duplicated when the variable is registered. It is up to the caller to free the original value if necessary. The new value will be freed by the mca_base_var system and must not be freed by the user. - Variables with constant scope may not be settable. - Variable groups (and all associated variables) are deregistered when the component is closed or the component repository item is freed. This prevents a segmentation fault from accessing a variable after its component is unloaded. - After some discussion we decided we should remove the automatic registration of component priority variables. Few component actually made use of this feature. - The enumerator interface was updated to be general enough to handle future uses of the interface. - The code to generate ompi_info output has been moved into the MCA variable system. See mca_base_var_dump(). opal: update core and components to mca_base_var system orte: update core and components to mca_base_var system ompi: update core and components to mca_base_var system This commit also modifies the rmaps framework. The following variables were moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode, rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables. This commit was SVN r28236.
2013-03-28 01:09:41 +04:00
int max_frag_size;
/** Supports direct use of user-buffers */
MCA/base: Add new MCA variable system Features: - Support for an override parameter file (openmpi-mca-param-override.conf). Variable values in this file can not be overridden by any file or environment value. - Support for boolean, unsigned, and unsigned long long variables. - Support for true/false values. - Support for enumerations on integer variables. - Support for MPIT scope, verbosity, and binding. - Support for command line source. - Support for setting variable source via the environment using OMPI_MCA_SOURCE_<var name>=source (either command or file:filename) - Cleaner API. - Support for variable groups (equivalent to MPIT categories). Notes: - Variables must be created with a backing store (char **, int *, or bool *) that must live at least as long as the variable. - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of mca_base_var_set_value() to change the value. - String values are duplicated when the variable is registered. It is up to the caller to free the original value if necessary. The new value will be freed by the mca_base_var system and must not be freed by the user. - Variables with constant scope may not be settable. - Variable groups (and all associated variables) are deregistered when the component is closed or the component repository item is freed. This prevents a segmentation fault from accessing a variable after its component is unloaded. - After some discussion we decided we should remove the automatic registration of component priority variables. Few component actually made use of this feature. - The enumerator interface was updated to be general enough to handle future uses of the interface. - The code to generate ompi_info output has been moved into the MCA variable system. See mca_base_var_dump(). opal: update core and components to mca_base_var system orte: update core and components to mca_base_var system ompi: update core and components to mca_base_var system This commit also modifies the rmaps framework. The following variables were moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode, rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables. This commit was SVN r28236.
2013-03-28 01:09:41 +04:00
bool can_use_user_buffers;
};
typedef struct mca_bcol_base_component_2_0_0_t mca_bcol_base_component_2_0_0_t;
typedef struct mca_bcol_base_component_2_0_0_t mca_bcol_base_component_t;
OMPI_DECLSPEC OBJ_CLASS_DECLARATION(mca_bcol_base_component_t);
/* forward declaration */
struct mca_coll_ml_descriptor_t;
struct mca_bcol_base_payload_buffer_desc_t;
struct mca_bcol_base_route_info_t;
typedef struct {
int order_num; /* Seq num of collective fragment */
int bcols_started; /* How many bcols need ordering have been started */
int n_fns_need_ordering; /* The number of functions are called for bcols need ordering */
} mca_bcol_base_order_info_t;
coll/ml: add support for blocking and non-blocking allreduce, reduce, and allgather. The new collectives provide a signifigant performance increase over tuned for small and medium messages. We are initially setting the priority lower than tuned until this has had some time to soak in the trunk. Please set coll_ml_priority to 90 for MTT runs. Credit for this work goes to Manjunath Gorentla Venkata (ORNL), Pavel Shamis (ORNL), and Nathan Hjelm (LANL). Commit details (for reference): Import ORNL's collectives for MPI_Allreduce, MPI_Reduce, and MPI_Allgather. We need to take the basesmuma header into account when calculating the ptpcoll small message thresholds. Add a define to bcol.h indicating the maximum header size so we can take the header into account while not making ptpcoll dependent on information from basesmuma. This resolves an issue with allreduce where ptpcoll overwrites the header of the next buffer in the basesmuma bank. Fix reduce and make a sequential collective launcher in coll_ml_inlines.h The root calculation for reduce was wrong for any root != 0. There are four possibilities for the root: - The root is not the current process but is in the current hierarchy. In this case the root is the index of the global root as specified in the root vector. - The root is not the current process and is not in the next level of the hierarchy. In this case 0 must be the local root since this process will never communicate with the real root. - The root is not the current process but will be in next level of the hierarchy. In this case the current process must be the root. - I am the root. The root is my index. Tested with IMB which rotates the root on every call to MPI_Reduce. Consider IMB the reproducer for the issue this commit solves. Make the bcast algorithm decision an enumerated variable Resolve various asset failures when destructing coll ml requests. Two issues: - Always reset the request to be invalid before returning it to the free list. This will avoid an asset in ompi_request_t's destructor. OMPI_REQUEST_FINI does this (and also releases the fortran handle index). - Never explicitly construct or destruct the superclass of an opal object. This screws up the class function tables and will cause either an assert failure or a segmentation fault when destructing coll ml requests. Cleanup allgather. I removed the duplicate non-blocking and blocking functions and modeled the cleanup after what I found in allreduce. Also cleaned up the code somewhat. Don't bother copying from the send to the recieve buffer in bcol_basesmuma_allreduce_intra_fanin_fanout if the pointers are the same. The eliminates a warning about memcpy and aliasing and avoids an unnecessary call to memcpy. Alwasy call CHECK_AND_RELEASE on memsync collectives. There was a call to OBJ_RELEASE on the collective communicator but because CHECK_AND_RECYLCE was never called there was not matching call to OBJ_RELEASE. This caused coll ml to leak communicators. Make allreduce use the sequential collective launcher in coll_ml_inlines.h Just launch the next collective in the component progress. I am a little unsure about this patch. There appears to be some sort of race between collectives that causes buffer exhaustion in some cases (IMB Allreduce is a reproducer). Changing progress to only launch the next bcol seems to resolve the issue but might not be the best fix. Note that I see little-no performance penalty for this change. Fix allreduce when there are extra sources. There was an issue with the buffer offset calculation when there are extra sources. In the case of extra sources == 1 the offset was set to buffer_size (just past the header of the next buffer). I adjusted the buffer size to take into accoun the maximum header size (see the earlier commit that added this) and simplified the offset calculation. Make reduce/allreduce non-blocking. This is required for MPI_Comm_idup to work correctly. This has been tested with various layouts using the ibm testsuite and imb and appears to have the same performance as the old blocking version. Fix allgather for non-contiguous layouts and simplify parsing the topology. Some things in this patch: - There were several comments to the effect that level 0 of the hierarchy MUST contain all of the ranks. At least one function made this assumption but it was not true. I changed the sbgp components and the coll ml initization code to enforce this requirement. - Ensure that hierarchy level 0 has the ranks in the correct scatter gather order. This removes the need for a separate sort list and fixes the offset calculation for allgather. - There were several passes over the hierarchy to determine properties of the hierarchy. I eliminated these extra passes and the memory allocation associated with them and calculate the tree properties on the fly. The same DFS recursion also handles the re-order of level 0. All these changes have been verified with MPI_Allreduce, MPI_Reduce, and MPI_Allgather. All functions now pass all IBM/Open MPI, and IMB tests. coll/ml: correct pointer usage for MPI_BOTTOM Since contiguous datatypes are copied via memcpy (bypassing the convertor) we need to adjust for the lb of the datatype. This corrects problems found testing code that uses MPI_BOTTOM (NULL) as the send pointer. Add fallback collectives for allreduce and reduce. cmr=v1.7.5:reviewer=pasha This commit was SVN r30363.
2014-01-22 19:39:19 +04:00
/* structure that encapsultes information propagated amongst multiple
* fragments whereby completing the entire ensemble of fragments is
* necessary in order to complete the entire collective
*/
struct bcol_fragment_descriptor_t {
/* start iterator */
int head;
/* end iterator */
int tail;
/* current iteration */
int start_iter;
/* number of full iterations this frag */
int num_iter;
/* end iter */
int end_iter;
};
typedef struct bcol_fragment_descriptor_t bcol_fragment_descriptor_t;
struct bcol_function_args_t {
/* full message sequence number */
int64_t sequence_num;
/* full message descriptor - single copy of fragment invariant
* parameters */
/* Pasha: We don need this one for new flow - remove it */
struct mca_coll_ml_descriptor_t *full_message_descriptor;
struct mca_bcol_base_route_info_t *root_route;
/* function status */
int function_status;
/* root, for rooted operations */
int root;
/* input buffer */
const void *sbuf;
void *rbuf;
const void *userbuf;
struct mca_bcol_base_payload_buffer_desc_t *src_desc;
struct mca_bcol_base_payload_buffer_desc_t *dst_desc;
/* ml buffer size */
uint32_t buffer_size;
/* index of buffer in ml payload cache */
int buffer_index;
int count;
struct ompi_datatype_t *dtype;
struct ompi_op_t *op;
int sbuf_offset;
int rbuf_offset;
/* for bcol opaque data */
void *bcol_opaque_data;
coll/ml: add support for blocking and non-blocking allreduce, reduce, and allgather. The new collectives provide a signifigant performance increase over tuned for small and medium messages. We are initially setting the priority lower than tuned until this has had some time to soak in the trunk. Please set coll_ml_priority to 90 for MTT runs. Credit for this work goes to Manjunath Gorentla Venkata (ORNL), Pavel Shamis (ORNL), and Nathan Hjelm (LANL). Commit details (for reference): Import ORNL's collectives for MPI_Allreduce, MPI_Reduce, and MPI_Allgather. We need to take the basesmuma header into account when calculating the ptpcoll small message thresholds. Add a define to bcol.h indicating the maximum header size so we can take the header into account while not making ptpcoll dependent on information from basesmuma. This resolves an issue with allreduce where ptpcoll overwrites the header of the next buffer in the basesmuma bank. Fix reduce and make a sequential collective launcher in coll_ml_inlines.h The root calculation for reduce was wrong for any root != 0. There are four possibilities for the root: - The root is not the current process but is in the current hierarchy. In this case the root is the index of the global root as specified in the root vector. - The root is not the current process and is not in the next level of the hierarchy. In this case 0 must be the local root since this process will never communicate with the real root. - The root is not the current process but will be in next level of the hierarchy. In this case the current process must be the root. - I am the root. The root is my index. Tested with IMB which rotates the root on every call to MPI_Reduce. Consider IMB the reproducer for the issue this commit solves. Make the bcast algorithm decision an enumerated variable Resolve various asset failures when destructing coll ml requests. Two issues: - Always reset the request to be invalid before returning it to the free list. This will avoid an asset in ompi_request_t's destructor. OMPI_REQUEST_FINI does this (and also releases the fortran handle index). - Never explicitly construct or destruct the superclass of an opal object. This screws up the class function tables and will cause either an assert failure or a segmentation fault when destructing coll ml requests. Cleanup allgather. I removed the duplicate non-blocking and blocking functions and modeled the cleanup after what I found in allreduce. Also cleaned up the code somewhat. Don't bother copying from the send to the recieve buffer in bcol_basesmuma_allreduce_intra_fanin_fanout if the pointers are the same. The eliminates a warning about memcpy and aliasing and avoids an unnecessary call to memcpy. Alwasy call CHECK_AND_RELEASE on memsync collectives. There was a call to OBJ_RELEASE on the collective communicator but because CHECK_AND_RECYLCE was never called there was not matching call to OBJ_RELEASE. This caused coll ml to leak communicators. Make allreduce use the sequential collective launcher in coll_ml_inlines.h Just launch the next collective in the component progress. I am a little unsure about this patch. There appears to be some sort of race between collectives that causes buffer exhaustion in some cases (IMB Allreduce is a reproducer). Changing progress to only launch the next bcol seems to resolve the issue but might not be the best fix. Note that I see little-no performance penalty for this change. Fix allreduce when there are extra sources. There was an issue with the buffer offset calculation when there are extra sources. In the case of extra sources == 1 the offset was set to buffer_size (just past the header of the next buffer). I adjusted the buffer size to take into accoun the maximum header size (see the earlier commit that added this) and simplified the offset calculation. Make reduce/allreduce non-blocking. This is required for MPI_Comm_idup to work correctly. This has been tested with various layouts using the ibm testsuite and imb and appears to have the same performance as the old blocking version. Fix allgather for non-contiguous layouts and simplify parsing the topology. Some things in this patch: - There were several comments to the effect that level 0 of the hierarchy MUST contain all of the ranks. At least one function made this assumption but it was not true. I changed the sbgp components and the coll ml initization code to enforce this requirement. - Ensure that hierarchy level 0 has the ranks in the correct scatter gather order. This removes the need for a separate sort list and fixes the offset calculation for allgather. - There were several passes over the hierarchy to determine properties of the hierarchy. I eliminated these extra passes and the memory allocation associated with them and calculate the tree properties on the fly. The same DFS recursion also handles the re-order of level 0. All these changes have been verified with MPI_Allreduce, MPI_Reduce, and MPI_Allgather. All functions now pass all IBM/Open MPI, and IMB tests. coll/ml: correct pointer usage for MPI_BOTTOM Since contiguous datatypes are copied via memcpy (bypassing the convertor) we need to adjust for the lb of the datatype. This corrects problems found testing code that uses MPI_BOTTOM (NULL) as the send pointer. Add fallback collectives for allreduce and reduce. cmr=v1.7.5:reviewer=pasha This commit was SVN r30363.
2014-01-22 19:39:19 +04:00
/* An output argument that will be used by BCOL function to tell ML that the result of the BCOL is in rbuf */
bool result_in_rbuf;
coll/ml: add support for blocking and non-blocking allreduce, reduce, and allgather. The new collectives provide a signifigant performance increase over tuned for small and medium messages. We are initially setting the priority lower than tuned until this has had some time to soak in the trunk. Please set coll_ml_priority to 90 for MTT runs. Credit for this work goes to Manjunath Gorentla Venkata (ORNL), Pavel Shamis (ORNL), and Nathan Hjelm (LANL). Commit details (for reference): Import ORNL's collectives for MPI_Allreduce, MPI_Reduce, and MPI_Allgather. We need to take the basesmuma header into account when calculating the ptpcoll small message thresholds. Add a define to bcol.h indicating the maximum header size so we can take the header into account while not making ptpcoll dependent on information from basesmuma. This resolves an issue with allreduce where ptpcoll overwrites the header of the next buffer in the basesmuma bank. Fix reduce and make a sequential collective launcher in coll_ml_inlines.h The root calculation for reduce was wrong for any root != 0. There are four possibilities for the root: - The root is not the current process but is in the current hierarchy. In this case the root is the index of the global root as specified in the root vector. - The root is not the current process and is not in the next level of the hierarchy. In this case 0 must be the local root since this process will never communicate with the real root. - The root is not the current process but will be in next level of the hierarchy. In this case the current process must be the root. - I am the root. The root is my index. Tested with IMB which rotates the root on every call to MPI_Reduce. Consider IMB the reproducer for the issue this commit solves. Make the bcast algorithm decision an enumerated variable Resolve various asset failures when destructing coll ml requests. Two issues: - Always reset the request to be invalid before returning it to the free list. This will avoid an asset in ompi_request_t's destructor. OMPI_REQUEST_FINI does this (and also releases the fortran handle index). - Never explicitly construct or destruct the superclass of an opal object. This screws up the class function tables and will cause either an assert failure or a segmentation fault when destructing coll ml requests. Cleanup allgather. I removed the duplicate non-blocking and blocking functions and modeled the cleanup after what I found in allreduce. Also cleaned up the code somewhat. Don't bother copying from the send to the recieve buffer in bcol_basesmuma_allreduce_intra_fanin_fanout if the pointers are the same. The eliminates a warning about memcpy and aliasing and avoids an unnecessary call to memcpy. Alwasy call CHECK_AND_RELEASE on memsync collectives. There was a call to OBJ_RELEASE on the collective communicator but because CHECK_AND_RECYLCE was never called there was not matching call to OBJ_RELEASE. This caused coll ml to leak communicators. Make allreduce use the sequential collective launcher in coll_ml_inlines.h Just launch the next collective in the component progress. I am a little unsure about this patch. There appears to be some sort of race between collectives that causes buffer exhaustion in some cases (IMB Allreduce is a reproducer). Changing progress to only launch the next bcol seems to resolve the issue but might not be the best fix. Note that I see little-no performance penalty for this change. Fix allreduce when there are extra sources. There was an issue with the buffer offset calculation when there are extra sources. In the case of extra sources == 1 the offset was set to buffer_size (just past the header of the next buffer). I adjusted the buffer size to take into accoun the maximum header size (see the earlier commit that added this) and simplified the offset calculation. Make reduce/allreduce non-blocking. This is required for MPI_Comm_idup to work correctly. This has been tested with various layouts using the ibm testsuite and imb and appears to have the same performance as the old blocking version. Fix allgather for non-contiguous layouts and simplify parsing the topology. Some things in this patch: - There were several comments to the effect that level 0 of the hierarchy MUST contain all of the ranks. At least one function made this assumption but it was not true. I changed the sbgp components and the coll ml initization code to enforce this requirement. - Ensure that hierarchy level 0 has the ranks in the correct scatter gather order. This removes the need for a separate sort list and fixes the offset calculation for allgather. - There were several passes over the hierarchy to determine properties of the hierarchy. I eliminated these extra passes and the memory allocation associated with them and calculate the tree properties on the fly. The same DFS recursion also handles the re-order of level 0. All these changes have been verified with MPI_Allreduce, MPI_Reduce, and MPI_Allgather. All functions now pass all IBM/Open MPI, and IMB tests. coll/ml: correct pointer usage for MPI_BOTTOM Since contiguous datatypes are copied via memcpy (bypassing the convertor) we need to adjust for the lb of the datatype. This corrects problems found testing code that uses MPI_BOTTOM (NULL) as the send pointer. Add fallback collectives for allreduce and reduce. cmr=v1.7.5:reviewer=pasha This commit was SVN r30363.
2014-01-22 19:39:19 +04:00
bool root_flag; /* True if the rank is root of operation */
bool need_dt_support; /* will trigger alternate code path for some colls */
int status; /* Used for non-blocking collective completion */
uint32_t frag_size; /* fragment size for large messages */
int hier_factor; /* factor used when bcast is invoked as a service function back down
* the tree in allgather for example, the pacl_len is not the actual
* len of the data needing bcasting
*/
mca_bcol_base_order_info_t order_info;
coll/ml: add support for blocking and non-blocking allreduce, reduce, and allgather. The new collectives provide a signifigant performance increase over tuned for small and medium messages. We are initially setting the priority lower than tuned until this has had some time to soak in the trunk. Please set coll_ml_priority to 90 for MTT runs. Credit for this work goes to Manjunath Gorentla Venkata (ORNL), Pavel Shamis (ORNL), and Nathan Hjelm (LANL). Commit details (for reference): Import ORNL's collectives for MPI_Allreduce, MPI_Reduce, and MPI_Allgather. We need to take the basesmuma header into account when calculating the ptpcoll small message thresholds. Add a define to bcol.h indicating the maximum header size so we can take the header into account while not making ptpcoll dependent on information from basesmuma. This resolves an issue with allreduce where ptpcoll overwrites the header of the next buffer in the basesmuma bank. Fix reduce and make a sequential collective launcher in coll_ml_inlines.h The root calculation for reduce was wrong for any root != 0. There are four possibilities for the root: - The root is not the current process but is in the current hierarchy. In this case the root is the index of the global root as specified in the root vector. - The root is not the current process and is not in the next level of the hierarchy. In this case 0 must be the local root since this process will never communicate with the real root. - The root is not the current process but will be in next level of the hierarchy. In this case the current process must be the root. - I am the root. The root is my index. Tested with IMB which rotates the root on every call to MPI_Reduce. Consider IMB the reproducer for the issue this commit solves. Make the bcast algorithm decision an enumerated variable Resolve various asset failures when destructing coll ml requests. Two issues: - Always reset the request to be invalid before returning it to the free list. This will avoid an asset in ompi_request_t's destructor. OMPI_REQUEST_FINI does this (and also releases the fortran handle index). - Never explicitly construct or destruct the superclass of an opal object. This screws up the class function tables and will cause either an assert failure or a segmentation fault when destructing coll ml requests. Cleanup allgather. I removed the duplicate non-blocking and blocking functions and modeled the cleanup after what I found in allreduce. Also cleaned up the code somewhat. Don't bother copying from the send to the recieve buffer in bcol_basesmuma_allreduce_intra_fanin_fanout if the pointers are the same. The eliminates a warning about memcpy and aliasing and avoids an unnecessary call to memcpy. Alwasy call CHECK_AND_RELEASE on memsync collectives. There was a call to OBJ_RELEASE on the collective communicator but because CHECK_AND_RECYLCE was never called there was not matching call to OBJ_RELEASE. This caused coll ml to leak communicators. Make allreduce use the sequential collective launcher in coll_ml_inlines.h Just launch the next collective in the component progress. I am a little unsure about this patch. There appears to be some sort of race between collectives that causes buffer exhaustion in some cases (IMB Allreduce is a reproducer). Changing progress to only launch the next bcol seems to resolve the issue but might not be the best fix. Note that I see little-no performance penalty for this change. Fix allreduce when there are extra sources. There was an issue with the buffer offset calculation when there are extra sources. In the case of extra sources == 1 the offset was set to buffer_size (just past the header of the next buffer). I adjusted the buffer size to take into accoun the maximum header size (see the earlier commit that added this) and simplified the offset calculation. Make reduce/allreduce non-blocking. This is required for MPI_Comm_idup to work correctly. This has been tested with various layouts using the ibm testsuite and imb and appears to have the same performance as the old blocking version. Fix allgather for non-contiguous layouts and simplify parsing the topology. Some things in this patch: - There were several comments to the effect that level 0 of the hierarchy MUST contain all of the ranks. At least one function made this assumption but it was not true. I changed the sbgp components and the coll ml initization code to enforce this requirement. - Ensure that hierarchy level 0 has the ranks in the correct scatter gather order. This removes the need for a separate sort list and fixes the offset calculation for allgather. - There were several passes over the hierarchy to determine properties of the hierarchy. I eliminated these extra passes and the memory allocation associated with them and calculate the tree properties on the fly. The same DFS recursion also handles the re-order of level 0. All these changes have been verified with MPI_Allreduce, MPI_Reduce, and MPI_Allgather. All functions now pass all IBM/Open MPI, and IMB tests. coll/ml: correct pointer usage for MPI_BOTTOM Since contiguous datatypes are copied via memcpy (bypassing the convertor) we need to adjust for the lb of the datatype. This corrects problems found testing code that uses MPI_BOTTOM (NULL) as the send pointer. Add fallback collectives for allreduce and reduce. cmr=v1.7.5:reviewer=pasha This commit was SVN r30363.
2014-01-22 19:39:19 +04:00
bcol_fragment_descriptor_t frag_info;
};
struct mca_bcol_base_route_info_t {
int level;
int rank;
};
typedef struct mca_bcol_base_route_info_t mca_bcol_base_route_info_t;
struct mca_bcol_base_lmngr_block_t {
opal_list_item_t super;
struct mca_coll_ml_lmngr_t *lmngr;
void* base_addr;
};
typedef struct mca_bcol_base_lmngr_block_t mca_bcol_base_lmngr_block_t;
OBJ_CLASS_DECLARATION(mca_bcol_base_lmngr_block_t);
struct mca_bcol_base_memory_block_desc_t {
/* memory block for payload buffers */
struct mca_bcol_base_lmngr_block_t *block;
/* Address offset in bytes -- Indicates free memory in the block */
uint64_t block_addr_offset;
/* size of the memory block */
size_t size_block;
/* number of memory banks */
uint32_t num_banks;
/* number of buffers per bank */
uint32_t num_buffers_per_bank;
/* size of a payload buffer */
uint32_t size_buffer;
/* pointer to buffer descriptors initialized */
struct mca_bcol_base_payload_buffer_desc_t *buffer_descs;
/* index of the next free buffer in the block */
uint64_t next_free_buffer;
uint32_t *bank_release_counters;
/* Counter that defines what bank should be synchronized next
* since collectives could be completed out of order, we have to make
* sure that memory synchronization collectives started in order ! */
2015-06-24 06:59:57 +03:00
int memsync_counter;
/* This arrays of flags used to signal that the bank is ready for recycling */
bool *ready_for_memsync;
/* This flags monitors if bank is open for usage. Usually we expect that user
* will do the check only on buffer-zero allocation */
bool *bank_is_busy;
};
/* convenience typedef */
typedef struct mca_bcol_base_memory_block_desc_t mca_bcol_base_memory_block_desc_t;
typedef void (*mca_bcol_base_release_buff_fn_t)(struct mca_bcol_base_memory_block_desc_t *ml_memblock, uint32_t buff_id);
struct mca_bcol_base_payload_buffer_desc_t {
void *base_data_addr; /* buffer address */
void *data_addr; /* buffer address + header offset */
uint64_t generation_number; /* my generation */
uint64_t bank_index; /* my bank */
uint64_t buffer_index; /* my buff index */
};
/* convenience typedef */
typedef struct mca_bcol_base_payload_buffer_desc_t mca_bcol_base_payload_buffer_desc_t;
typedef struct bcol_function_args_t bcol_function_args_t;
/* The collective operation is defined by a series of collective operations
* invoked through a function pointer. Each function may be different,
* so will store the arguments in a struct and pass a pointer to the struct,
* and use this as a way to hide the different function signatures.
*
* @param[in] input_args Structure with function arguments
* @param[in] bcol_desc Component specific paremeters
* @param[out] status return status of the function
* MCA_BCOL_COMPLETE - function completed
* MCA_BCOL_IN_PROGRESS - function incomplete
*
* @retval OMPI_SUCCESS successful completion
* @retval OMPI_ERROR function returned error
*/
/* forward declaration */
struct mca_bcol_base_module_t;
/* collective function prototype - all functions have the same interface
* so that we can call them via a function pointer */
struct mca_bcol_base_function_t;
typedef int (*mca_bcol_base_module_collective_fn_primitives_t)
(bcol_function_args_t *input_args, struct mca_bcol_base_function_t *const_args);
typedef int (*mca_bcol_base_module_collective_init_fn_primitives_t)
(struct mca_bcol_base_module_t *bcol_module);
/**
* function to query for collctive function attributes
*
* @param attribute (IN) the attribute of interest
* @param algorithm_parameters (OUT) the value of attribute for this
* function. If this attribute is not supported,
* OMPI_ERR_NOT_FOUND is returned.
*/
typedef int (*mca_bcol_get_collective_attributes)(int attribute,
void *algorithm_parameters);
/* data structure for tracking the relevant data needed for ml level
* algorithm construction (e.g., function selection), initialization, and
* usage.
*/
struct coll_bcol_collective_description_t {
/* collective initiation function - first functin called */
mca_bcol_base_module_collective_fn_primitives_t coll_fn;
/* collective progress function - first functin called */
mca_bcol_base_module_collective_fn_primitives_t progress_fn;
/* collective progress function - first functin called */
mca_bcol_get_collective_attributes get_attributes;
/* attributes supported - bit map */
uint64_t attribute;
};
typedef struct coll_bcol_collective_description_t
coll_bcol_collective_description_t;
/* collective operation attributes */
enum {
/* supports dynamic decisions - e.g., do not need to have the collective
* operation fully defined before it can be started
*/
BCOL_ATTRIBUTE_DYNAMIC,
/* number of attributes */
BCOL_NUM_ATTRIBUTES
};
/* For rooted collectives,
* does the algorithm knows its data source ?
*/
enum {
DATA_SRC_KNOWN=0,
DATA_SRC_UNKNOWN,
DATA_SRC_TYPES
};
enum {
BLOCKING,
NON_BLOCKING
};
/* gvm For selection logic */
struct mca_bcol_base_coll_fn_comm_attributes_t {
int bcoll_type;
int comm_size_min;
int comm_size_max;
int data_src;
int waiting_semantics;
};
typedef struct mca_bcol_base_coll_fn_comm_attributes_t
mca_bcol_base_coll_fn_comm_attributes_t;
struct mca_bcol_base_coll_fn_invoke_attributes_t {
int bcol_msg_min;
int bcol_msg_max;
uint64_t datatype_bitmap; /* Max is OMPI_DATATYPE_MAX_PREDEFINED defined to be 45 */
uint32_t op_types_bitmap; /* bit map of optypes supported */
};
typedef struct mca_bcol_base_coll_fn_invoke_attributes_t
mca_bcol_base_coll_fn_invoke_attributes_t;
struct mca_bcol_base_coll_fn_desc_t {
opal_list_item_t super;
struct mca_bcol_base_coll_fn_comm_attributes_t *comm_attr;
struct mca_bcol_base_coll_fn_invoke_attributes_t *inv_attr;
mca_bcol_base_module_collective_fn_primitives_t coll_fn;
mca_bcol_base_module_collective_fn_primitives_t progress_fn;
};
typedef struct mca_bcol_base_coll_fn_desc_t mca_bcol_base_coll_fn_desc_t;
OBJ_CLASS_DECLARATION(mca_bcol_base_coll_fn_desc_t);
/* end selection logic */
typedef int (*mca_bcol_base_module_collective_init_fn_t)
(struct mca_bcol_base_module_t *bcol_module,
mca_sbgp_base_module_t *sbgp_module);
/* per communicator memory initialization function */
typedef int (*mca_bcol_module_mem_init)(struct ml_buffers_t *registered_buffers,
mca_bcol_base_component_t *module);
/* Initialize memory block - ml_memory_block initialization interface function
*
* Invoked at the ml level, used to pass bcol specific registration information
* for the "ml_memory_block"
*
* @param[in] ml_memory_block Pointer to the ml_memory_block. This struct
* contains bcol specific registration information and a call back function
* used for resource recycling.
*
* @param[in] reg_data bcol specific registration data.
*
* @returns On Success: OMPI_SUCCESS
* On Failure: OMPI_ERROR
*
*/
/*typedef int (*mca_bcol_base_init_memory_fn_t)
(struct mca_bcol_base_memory_block_desc_t *ml_block, void *reg_data);*/
typedef int (*mca_bcol_base_init_memory_fn_t)
(struct mca_bcol_base_memory_block_desc_t *payload_block,
uint32_t data_offset,
struct mca_bcol_base_module_t *bcol,
void *reg_data);
typedef int (*mca_common_allgather_init_fn_t)
(struct mca_bcol_base_module_t *bcol_module);
typedef void (*mca_bcol_base_set_thresholds_fn_t)
(struct mca_bcol_base_module_t *bcol_module);
enum {
MCA_BCOL_BASE_ZERO_COPY = 1,
MCA_BCOL_BASE_NO_ML_BUFFER_FOR_LARGE_MSG = 1 << 1,
MCA_BCOL_BASE_NO_ML_BUFFER_FOR_BARRIER = 1 << 2
};
/* base module */
struct mca_bcol_base_module_t {
/* base coll component */
opal_object_t super;
/* bcol component (Pasha: Do we really need cache the component?)*/
mca_bcol_base_component_t *bcol_component;
/* network context that is used by this bcol
only one context per bcol is allowed */
bcol_base_network_context_t *network_context;
/* We are going to use the context index a lot,
int order to decrease number of dereferences
bcol->network_context->index
we are caching the value on bcol */
int context_index;
/* Set of flags that describe features supported by bcol */
uint64_t supported_mode;
/* per communicator memory initialization function */
mca_bcol_module_mem_init init_module;
/* sub-grouping module partner */
mca_sbgp_base_module_t *sbgp_partner_module;
/* size of subgroup - cache this, so can have access when
* sbgp_partner_module no longer existes */
int size_of_subgroup;
/* sequence number offset - want to make sure that we start
* id'ing collectives with id 0, so we can have simple
* resource management.
*/
int64_t squence_number_offset;
/* number of times to poll for operation completion before
* breaking out of a non-blocking collective operation
*/
int n_poll_loops;
/* size of header that will go in data buff, should not include
* any info regarding alignment, let the ml level handle this
*/
uint32_t header_size;
/* Each bcol is assigned a unique value
* see if we can get away with 16-bit id
*/
int16_t bcol_id;
/*FIXME:
* Since mca_bcol_base_module_t is the only parameter which will be passed
* into the bcol_basesmuma_bcast_init(), add the flag to indicate whether
* the hdl-based algorithms will get enabled.
*/
bool use_hdl;
/*
* Collective function pointers
*/
/* changing function signature - will replace bcol_functions */
mca_bcol_base_module_collective_fn_primitives_t bcol_function_table[BCOL_NUM_OF_FUNCTIONS];
/* Tables hold pointers to functions */
mca_bcol_base_module_collective_init_fn_primitives_t bcol_function_init_table[BCOL_NUM_OF_FUNCTIONS];
opal_list_t bcol_fns_table[BCOL_NUM_OF_FUNCTIONS];
struct mca_bcol_base_coll_fn_desc_t*
filtered_fns_table[DATA_SRC_TYPES][2][BCOL_NUM_OF_FUNCTIONS][NUM_MSG_RANGES+1][OMPI_OP_NUM_OF_TYPES][OMPI_DATATYPE_MAX_PREDEFINED];
/*
* Bcol interface function to pass bcol specific
* info and memory recycling call back
*/
mca_bcol_base_init_memory_fn_t bcol_memory_init;
/*
* netpatterns interface function, would like to invoke this on
* on the ml level
*/
mca_common_allgather_init_fn_t k_nomial_tree;
/* Each bcol caches a list which describes how many ranks
* are "below" each rank in this bcol
*/
int *list_n_connected;
/* offsets for scatter/gather */
int hier_scather_offset;
/* Small message threshold for each collective */
int small_message_thresholds[BCOL_NUM_OF_FUNCTIONS];
/* Set small_message_thresholds array */
mca_bcol_base_set_thresholds_fn_t set_small_msg_thresholds;
/* Pointer to the order counter on the upper layer,
used if the bcol needs to be ordered */
int *next_inorder;
};
typedef struct mca_bcol_base_module_t mca_bcol_base_module_t;
OMPI_DECLSPEC OBJ_CLASS_DECLARATION(mca_bcol_base_module_t);
/* function description */
struct mca_bcol_base_function_t {
int fn_idx;
/* module */
struct mca_bcol_base_module_t *bcol_module;
/*
* The following two parameters are used for bcol modules
* that want to do some optimizations based on the fact that
* n functions from the same bcol module are called in a row.
* For example, in the iboffload case, on the first call one
* will want to initialize the MWR, and start to instantiate
* it, but only post it at the end of the last call.
* The index of this function in a sequence of consecutive
* functions from the same bcol
*/
int index_in_consecutive_same_bcol_calls;
/* number of times functions from this bcol are
* called in order
*/
int n_of_this_type_in_a_row;
/*
* number of times functions from this module are called in the
* collective operation.
*/
int n_of_this_type_in_collective;
int index_of_this_type_in_collective;
};
typedef struct mca_bcol_base_function_t mca_bcol_base_function_t;
struct mca_bcol_base_descriptor_t {
opal_free_list_item_t super;
/* Vasily: will be described in the future */
};
typedef struct mca_bcol_base_descriptor_t mca_bcol_base_descriptor_t;
coll/ml: add support for blocking and non-blocking allreduce, reduce, and allgather. The new collectives provide a signifigant performance increase over tuned for small and medium messages. We are initially setting the priority lower than tuned until this has had some time to soak in the trunk. Please set coll_ml_priority to 90 for MTT runs. Credit for this work goes to Manjunath Gorentla Venkata (ORNL), Pavel Shamis (ORNL), and Nathan Hjelm (LANL). Commit details (for reference): Import ORNL's collectives for MPI_Allreduce, MPI_Reduce, and MPI_Allgather. We need to take the basesmuma header into account when calculating the ptpcoll small message thresholds. Add a define to bcol.h indicating the maximum header size so we can take the header into account while not making ptpcoll dependent on information from basesmuma. This resolves an issue with allreduce where ptpcoll overwrites the header of the next buffer in the basesmuma bank. Fix reduce and make a sequential collective launcher in coll_ml_inlines.h The root calculation for reduce was wrong for any root != 0. There are four possibilities for the root: - The root is not the current process but is in the current hierarchy. In this case the root is the index of the global root as specified in the root vector. - The root is not the current process and is not in the next level of the hierarchy. In this case 0 must be the local root since this process will never communicate with the real root. - The root is not the current process but will be in next level of the hierarchy. In this case the current process must be the root. - I am the root. The root is my index. Tested with IMB which rotates the root on every call to MPI_Reduce. Consider IMB the reproducer for the issue this commit solves. Make the bcast algorithm decision an enumerated variable Resolve various asset failures when destructing coll ml requests. Two issues: - Always reset the request to be invalid before returning it to the free list. This will avoid an asset in ompi_request_t's destructor. OMPI_REQUEST_FINI does this (and also releases the fortran handle index). - Never explicitly construct or destruct the superclass of an opal object. This screws up the class function tables and will cause either an assert failure or a segmentation fault when destructing coll ml requests. Cleanup allgather. I removed the duplicate non-blocking and blocking functions and modeled the cleanup after what I found in allreduce. Also cleaned up the code somewhat. Don't bother copying from the send to the recieve buffer in bcol_basesmuma_allreduce_intra_fanin_fanout if the pointers are the same. The eliminates a warning about memcpy and aliasing and avoids an unnecessary call to memcpy. Alwasy call CHECK_AND_RELEASE on memsync collectives. There was a call to OBJ_RELEASE on the collective communicator but because CHECK_AND_RECYLCE was never called there was not matching call to OBJ_RELEASE. This caused coll ml to leak communicators. Make allreduce use the sequential collective launcher in coll_ml_inlines.h Just launch the next collective in the component progress. I am a little unsure about this patch. There appears to be some sort of race between collectives that causes buffer exhaustion in some cases (IMB Allreduce is a reproducer). Changing progress to only launch the next bcol seems to resolve the issue but might not be the best fix. Note that I see little-no performance penalty for this change. Fix allreduce when there are extra sources. There was an issue with the buffer offset calculation when there are extra sources. In the case of extra sources == 1 the offset was set to buffer_size (just past the header of the next buffer). I adjusted the buffer size to take into accoun the maximum header size (see the earlier commit that added this) and simplified the offset calculation. Make reduce/allreduce non-blocking. This is required for MPI_Comm_idup to work correctly. This has been tested with various layouts using the ibm testsuite and imb and appears to have the same performance as the old blocking version. Fix allgather for non-contiguous layouts and simplify parsing the topology. Some things in this patch: - There were several comments to the effect that level 0 of the hierarchy MUST contain all of the ranks. At least one function made this assumption but it was not true. I changed the sbgp components and the coll ml initization code to enforce this requirement. - Ensure that hierarchy level 0 has the ranks in the correct scatter gather order. This removes the need for a separate sort list and fixes the offset calculation for allgather. - There were several passes over the hierarchy to determine properties of the hierarchy. I eliminated these extra passes and the memory allocation associated with them and calculate the tree properties on the fly. The same DFS recursion also handles the re-order of level 0. All these changes have been verified with MPI_Allreduce, MPI_Reduce, and MPI_Allgather. All functions now pass all IBM/Open MPI, and IMB tests. coll/ml: correct pointer usage for MPI_BOTTOM Since contiguous datatypes are copied via memcpy (bypassing the convertor) we need to adjust for the lb of the datatype. This corrects problems found testing code that uses MPI_BOTTOM (NULL) as the send pointer. Add fallback collectives for allreduce and reduce. cmr=v1.7.5:reviewer=pasha This commit was SVN r30363.
2014-01-22 19:39:19 +04:00
static inline __opal_attribute_always_inline__ size_t
mca_bcol_base_get_buff_length(ompi_datatype_t *dtype, int count)
{
ptrdiff_t lb, extent;
ompi_datatype_get_extent(dtype, &lb, &extent);
return (size_t) (extent * count);
}
#define MCA_BCOL_CHECK_ORDER(module, bcol_function_args) \
do { \
if (*((module)->next_inorder) != \
(bcol_function_args)->order_info.order_num) { \
return BCOL_FN_NOT_STARTED; \
} \
} while (0);
#define MCA_BCOL_UPDATE_ORDER_COUNTER(module, order_info) \
do { \
(order_info)->bcols_started++; \
if ((order_info)->n_fns_need_ordering == \
(order_info)->bcols_started) { \
++(*((module)->next_inorder)); \
} \
} while (0);
#if defined(c_plusplus) || defined(__cplusplus)
}
#endif
#endif /* MCA_BCOL_H */