2004-08-21 00:49:07 +00:00
|
|
|
/*
|
2005-11-05 19:57:48 +00:00
|
|
|
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
2004-11-28 20:09:25 +00:00
|
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
|
|
* University of Stuttgart. All rights reserved.
|
2005-03-24 12:43:37 +00:00
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
2004-11-22 01:38:40 +00:00
|
|
|
* $COPYRIGHT$
|
|
|
|
*
|
|
|
|
* Additional copyrights may follow
|
|
|
|
*
|
2004-08-21 00:49:07 +00:00
|
|
|
* $HEADER$
|
|
|
|
*/
|
2005-09-02 12:57:47 +00:00
|
|
|
/** @file */
|
2004-08-21 00:49:07 +00:00
|
|
|
|
|
|
|
#include "ompi_config.h"
|
|
|
|
|
2006-02-12 01:33:29 +00:00
|
|
|
#include "ompi/constants.h"
|
2005-08-23 21:22:00 +00:00
|
|
|
#include "ompi/communicator/communicator.h"
|
|
|
|
#include "ompi/mca/coll/coll.h"
|
2006-02-12 01:33:29 +00:00
|
|
|
#include "opal/sys/atomic.h"
|
2004-08-21 00:49:07 +00:00
|
|
|
#include "coll_sm.h"
|
|
|
|
|
2005-09-02 12:57:47 +00:00
|
|
|
/**
|
|
|
|
* Shared memory barrier.
|
|
|
|
*
|
2005-09-07 15:46:33 +00:00
|
|
|
* Tree-based algorithm for a barrier: a fan in to rank 0 followed by
|
|
|
|
* a fan out using the barrier segments in the shared memory area.
|
2005-09-02 12:57:47 +00:00
|
|
|
*
|
2005-09-07 15:46:33 +00:00
|
|
|
* There are 2 sets of barrier buffers -- since there can only be, at
|
|
|
|
* most, 2 outstanding barriers at any time, there is no need for more
|
|
|
|
* than this. The generalized in-use flags, control, and data
|
|
|
|
* segments are not used.
|
2004-08-21 00:49:07 +00:00
|
|
|
*
|
2005-09-07 15:46:33 +00:00
|
|
|
* The general algorithm is for a given process to wait for its N
|
|
|
|
* children to fan in by monitoring a uint32_t in its barrier "in"
|
|
|
|
* buffer. When this value reaches N (i.e., each of the children have
|
|
|
|
* atomically incremented the value), then the process atomically
|
|
|
|
* increases the uint32_t in its parent's "in" buffer. Then the
|
|
|
|
* process waits for the parent to set a "1" in the process' "out"
|
|
|
|
* buffer. Once this happens, the process writes a "1" in each of its
|
|
|
|
* children's "out" buffers, and returns.
|
|
|
|
*
|
|
|
|
* There's corner cases, of course, such as the root that has no
|
|
|
|
* parent, and the leaves that have no children. But that's the
|
|
|
|
* general idea.
|
2004-08-21 00:49:07 +00:00
|
|
|
*/
|
2007-08-19 03:37:49 +00:00
|
|
|
int mca_coll_sm_barrier_intra(struct ompi_communicator_t *comm,
|
2008-07-28 22:40:57 +00:00
|
|
|
mca_coll_base_module_t *module)
|
2004-08-21 00:49:07 +00:00
|
|
|
{
|
2005-09-02 12:57:47 +00:00
|
|
|
int rank, buffer_set;
|
2007-08-19 03:37:49 +00:00
|
|
|
mca_coll_sm_comm_t *data;
|
2005-09-02 12:57:47 +00:00
|
|
|
uint32_t i, num_children;
|
2006-08-24 16:53:10 +00:00
|
|
|
volatile uint32_t *me_in, *me_out, *parent, *children = NULL;
|
2005-09-02 12:57:47 +00:00
|
|
|
int uint_control_size;
|
2007-08-19 03:37:49 +00:00
|
|
|
mca_coll_sm_module_t *sm_module = (mca_coll_sm_module_t*) module;
|
2005-09-02 12:57:47 +00:00
|
|
|
|
2009-10-02 17:13:56 +00:00
|
|
|
/* Lazily enable the module the first time we invoke a collective
|
|
|
|
on it */
|
|
|
|
if (!sm_module->enabled) {
|
|
|
|
int ret;
|
|
|
|
if (OMPI_SUCCESS != (ret = ompi_coll_sm_lazy_enable(module, comm))) {
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2005-09-02 12:57:47 +00:00
|
|
|
uint_control_size =
|
|
|
|
mca_coll_sm_component.sm_control_size / sizeof(uint32_t);
|
Fixes trac:1988. The little bug that turned out to be huge. Yoinks.
* Various cosmetic/style updates in the btl sm
* Clean up concept of mpool module (I think that code was written way
back when the concept of "modules" was fuzzy)
* Bring over some old fixes from the /tmp/timattox-sm-coll/ tree to
fix potential segv's when mmap'ed regions were at different
addresses in different processes (thanks Tim!).
* Change sm coll to no longer use mpool as its main source of shmem;
rather, just mmap its own segment (because it's fixed size --
there was nothing to be gained by using mpool; shedding the use of
mpool saved a lot of complexity in the sm coll setup). This
effectively made Tim's fixes moot (because now everything is an
offset into the mmap that is computed locally; there are no global
pointers). :-)
* Slightly updated common/sm to allow making mmap's for a specific
set of procs (vs. ''all'' procs in the process). This potentially
allows for same-host-inter-proc mmaps -- yay!
* Fixed many, many things in the coll sm (particularly in reduce):
* Fixed handling of MPI_IN_PLACE in reduce and allreduce
* Fixed handling of non-contiguous datatypes in reduce
* Changed the order of reductions to go from process (n-1)'s data
to process 0's data, because that's how all other OMPI coll
components work
* Fixed lots of usage of ddt functions
* When using a non-contiguous datatype, if the root process is not
(n-1), now we used a 2nd convertor to copy from shmem to the rbuf
(saves a memory copy vs. what was done before)
* Lots and lots of little cleanups, clarifications, and minor
optimizations (although still more could be done -- e.g., I think
the use of write memory barriers is fairly sub-optimal; they
could be ganged together at the root, for example)
I'm marking this as "fixes trac:1988" and closing the ticket; if something
is still broken, we can re-open the ticket.
This commit was SVN r21967.
The following Trac tickets were found above:
Ticket 1988 --> https://svn.open-mpi.org/trac/ompi/ticket/1988
2009-09-15 00:25:21 +00:00
|
|
|
data = sm_module->sm_comm_data;
|
2005-08-23 22:02:28 +00:00
|
|
|
rank = ompi_comm_rank(comm);
|
2005-09-02 12:57:47 +00:00
|
|
|
num_children = data->mcb_tree[rank].mcstn_num_children;
|
|
|
|
buffer_set = ((data->mcb_barrier_count++) % 2) * 2;
|
|
|
|
me_in = &data->mcb_barrier_control_me[buffer_set];
|
|
|
|
me_out = me_in + uint_control_size;
|
|
|
|
me_out = (uint32_t*)
|
|
|
|
(((char*) me_in) + mca_coll_sm_component.sm_control_size);
|
|
|
|
|
|
|
|
/* Wait for my children to write to my *in* buffer */
|
|
|
|
|
|
|
|
if (0 != num_children) {
|
|
|
|
/* Get children *out* buffer */
|
|
|
|
children = data->mcb_barrier_control_children + buffer_set +
|
|
|
|
uint_control_size;
|
2009-09-21 22:20:44 +00:00
|
|
|
SPIN_CONDITION(*me_in == num_children, exit_label1);
|
2005-09-02 12:57:47 +00:00
|
|
|
*me_in = 0;
|
2005-08-23 21:22:00 +00:00
|
|
|
}
|
|
|
|
|
2005-09-02 12:57:47 +00:00
|
|
|
/* Send to my parent and wait for a response (don't poll on
|
2005-08-23 21:22:00 +00:00
|
|
|
parent's out buffer -- that would cause a lot of network
|
2005-09-02 12:57:47 +00:00
|
|
|
traffic / contention / faults / etc. Instead, children poll on
|
|
|
|
local memory and therefore only num_children messages are sent
|
|
|
|
across the network [vs. num_children *each* time all the
|
|
|
|
children poll] -- i.e., the memory is only being polled by one
|
|
|
|
process, and it is only changed *once* by an external
|
2005-08-23 21:22:00 +00:00
|
|
|
process) */
|
|
|
|
|
2005-09-02 12:57:47 +00:00
|
|
|
if (0 != rank) {
|
|
|
|
/* Get parent *in* buffer */
|
|
|
|
parent = &data->mcb_barrier_control_parent[buffer_set];
|
|
|
|
opal_atomic_add(parent, 1);
|
|
|
|
|
2009-09-21 22:20:44 +00:00
|
|
|
SPIN_CONDITION(0 != *me_out, exit_label2);
|
2005-09-02 12:57:47 +00:00
|
|
|
*me_out = 0;
|
2005-08-23 21:22:00 +00:00
|
|
|
}
|
|
|
|
|
2005-09-02 12:57:47 +00:00
|
|
|
/* Send to my children */
|
2005-08-23 21:22:00 +00:00
|
|
|
|
2005-09-02 12:57:47 +00:00
|
|
|
for (i = 0; i < num_children; ++i) {
|
|
|
|
children[i * uint_control_size * 4] = 1;
|
2005-08-23 21:22:00 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* All done! End state of the control segment:
|
|
|
|
|
2005-09-02 12:57:47 +00:00
|
|
|
me_in: 0
|
|
|
|
me_out: 0
|
2005-08-23 21:22:00 +00:00
|
|
|
*/
|
|
|
|
|
|
|
|
return OMPI_SUCCESS;
|
2004-08-21 00:49:07 +00:00
|
|
|
}
|