2004-08-21 04:49:07 +04:00
|
|
|
/*
|
2005-11-05 22:57:48 +03:00
|
|
|
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
2004-11-28 23:09:25 +03:00
|
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
|
|
* University of Stuttgart. All rights reserved.
|
2005-03-24 15:43:37 +03:00
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
2004-11-22 04:38:40 +03:00
|
|
|
* $COPYRIGHT$
|
|
|
|
*
|
|
|
|
* Additional copyrights may follow
|
|
|
|
*
|
2004-08-21 04:49:07 +04:00
|
|
|
* $HEADER$
|
|
|
|
*/
|
2005-09-02 16:57:47 +04:00
|
|
|
/** @file */
|
2004-08-21 04:49:07 +04:00
|
|
|
|
|
|
|
#include "ompi_config.h"
|
|
|
|
|
2009-05-23 03:47:49 +04:00
|
|
|
#ifdef HAVE_STRING_H
|
|
|
|
#include <string.h>
|
|
|
|
#endif
|
|
|
|
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 08:56:31 +04:00
|
|
|
#include "opal/datatype/opal_convertor.h"
|
2006-02-12 04:33:29 +03:00
|
|
|
#include "ompi/constants.h"
|
2005-09-02 16:57:47 +04:00
|
|
|
#include "ompi/communicator/communicator.h"
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 08:56:31 +04:00
|
|
|
#include "ompi/datatype/ompi_datatype.h"
|
2005-09-02 16:57:47 +04:00
|
|
|
#include "ompi/mca/coll/coll.h"
|
2006-02-12 04:33:29 +03:00
|
|
|
#include "opal/sys/atomic.h"
|
2004-08-21 04:49:07 +04:00
|
|
|
#include "coll_sm.h"
|
|
|
|
|
2005-09-02 16:57:47 +04:00
|
|
|
/**
|
|
|
|
* Shared memory broadcast.
|
|
|
|
*
|
2005-09-07 19:46:33 +04:00
|
|
|
* For the root, the general algorithm is to wait for a set of
|
|
|
|
* segments to become available. Once it is, the root claims the set
|
|
|
|
* by writing the current operation number and the number of processes
|
|
|
|
* using the set to the flag. The root then loops over the set of
|
|
|
|
* segments; for each segment, it copies a fragment of the user's
|
|
|
|
* buffer into the shared data segment and then writes the data size
|
|
|
|
* into its childrens' control buffers. The process is repeated until
|
2005-09-02 16:57:47 +04:00
|
|
|
* all fragments have been written.
|
2004-08-21 04:49:07 +04:00
|
|
|
*
|
2005-09-07 19:46:33 +04:00
|
|
|
* For non-roots, for each set of buffers, they wait until the current
|
|
|
|
* operation number appears in the in-use flag (i.e., written by the
|
|
|
|
* root). Then for each segment, they wait for a nonzero to appear
|
|
|
|
* into their control buffers. If they have children, they copy the
|
|
|
|
* data from their parent's shared data segment into their shared data
|
|
|
|
* segment, and write the data size into each of their childrens'
|
|
|
|
* control buffers. They then copy the data from their shared [local]
|
|
|
|
* data segment into the user's output buffer. The process is
|
|
|
|
* repeated until all fragments have been received. If they do not
|
|
|
|
* have children, they copy the data directly from the parent's shared
|
|
|
|
* data segment into the user's output buffer.
|
2004-08-21 04:49:07 +04:00
|
|
|
*/
|
2005-01-30 04:42:57 +03:00
|
|
|
int mca_coll_sm_bcast_intra(void *buff, int count,
|
|
|
|
struct ompi_datatype_t *datatype, int root,
|
2007-08-19 07:37:49 +04:00
|
|
|
struct ompi_communicator_t *comm,
|
2008-07-29 02:40:57 +04:00
|
|
|
mca_coll_base_module_t *module)
|
2004-08-21 04:49:07 +04:00
|
|
|
{
|
2005-09-02 16:57:47 +04:00
|
|
|
struct iovec iov;
|
2007-08-19 07:37:49 +04:00
|
|
|
mca_coll_sm_module_t *sm_module = (mca_coll_sm_module_t*) module;
|
Fixes trac:1988. The little bug that turned out to be huge. Yoinks.
* Various cosmetic/style updates in the btl sm
* Clean up concept of mpool module (I think that code was written way
back when the concept of "modules" was fuzzy)
* Bring over some old fixes from the /tmp/timattox-sm-coll/ tree to
fix potential segv's when mmap'ed regions were at different
addresses in different processes (thanks Tim!).
* Change sm coll to no longer use mpool as its main source of shmem;
rather, just mmap its own segment (because it's fixed size --
there was nothing to be gained by using mpool; shedding the use of
mpool saved a lot of complexity in the sm coll setup). This
effectively made Tim's fixes moot (because now everything is an
offset into the mmap that is computed locally; there are no global
pointers). :-)
* Slightly updated common/sm to allow making mmap's for a specific
set of procs (vs. ''all'' procs in the process). This potentially
allows for same-host-inter-proc mmaps -- yay!
* Fixed many, many things in the coll sm (particularly in reduce):
* Fixed handling of MPI_IN_PLACE in reduce and allreduce
* Fixed handling of non-contiguous datatypes in reduce
* Changed the order of reductions to go from process (n-1)'s data
to process 0's data, because that's how all other OMPI coll
components work
* Fixed lots of usage of ddt functions
* When using a non-contiguous datatype, if the root process is not
(n-1), now we used a 2nd convertor to copy from shmem to the rbuf
(saves a memory copy vs. what was done before)
* Lots and lots of little cleanups, clarifications, and minor
optimizations (although still more could be done -- e.g., I think
the use of write memory barriers is fairly sub-optimal; they
could be ganged together at the root, for example)
I'm marking this as "fixes trac:1988" and closing the ticket; if something
is still broken, we can re-open the ticket.
This commit was SVN r21967.
The following Trac tickets were found above:
Ticket 1988 --> https://svn.open-mpi.org/trac/ompi/ticket/1988
2009-09-15 04:25:21 +04:00
|
|
|
mca_coll_sm_comm_t *data = sm_module->sm_comm_data;
|
2005-09-07 17:33:43 +04:00
|
|
|
int i, ret, rank, size, num_children, src_rank;
|
2005-09-07 01:41:55 +04:00
|
|
|
int flag_num, segment_num, max_segment_num;
|
2005-09-07 17:33:43 +04:00
|
|
|
int parent_rank;
|
2005-09-02 16:57:47 +04:00
|
|
|
size_t total_size, max_data, bytes;
|
2005-09-07 01:41:55 +04:00
|
|
|
mca_coll_sm_in_use_flag_t *flag;
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 08:56:31 +04:00
|
|
|
opal_convertor_t convertor;
|
2005-09-02 16:57:47 +04:00
|
|
|
mca_coll_sm_tree_node_t *me, *parent, **children;
|
Fixes trac:1988. The little bug that turned out to be huge. Yoinks.
* Various cosmetic/style updates in the btl sm
* Clean up concept of mpool module (I think that code was written way
back when the concept of "modules" was fuzzy)
* Bring over some old fixes from the /tmp/timattox-sm-coll/ tree to
fix potential segv's when mmap'ed regions were at different
addresses in different processes (thanks Tim!).
* Change sm coll to no longer use mpool as its main source of shmem;
rather, just mmap its own segment (because it's fixed size --
there was nothing to be gained by using mpool; shedding the use of
mpool saved a lot of complexity in the sm coll setup). This
effectively made Tim's fixes moot (because now everything is an
offset into the mmap that is computed locally; there are no global
pointers). :-)
* Slightly updated common/sm to allow making mmap's for a specific
set of procs (vs. ''all'' procs in the process). This potentially
allows for same-host-inter-proc mmaps -- yay!
* Fixed many, many things in the coll sm (particularly in reduce):
* Fixed handling of MPI_IN_PLACE in reduce and allreduce
* Fixed handling of non-contiguous datatypes in reduce
* Changed the order of reductions to go from process (n-1)'s data
to process 0's data, because that's how all other OMPI coll
components work
* Fixed lots of usage of ddt functions
* When using a non-contiguous datatype, if the root process is not
(n-1), now we used a 2nd convertor to copy from shmem to the rbuf
(saves a memory copy vs. what was done before)
* Lots and lots of little cleanups, clarifications, and minor
optimizations (although still more could be done -- e.g., I think
the use of write memory barriers is fairly sub-optimal; they
could be ganged together at the root, for example)
I'm marking this as "fixes trac:1988" and closing the ticket; if something
is still broken, we can re-open the ticket.
This commit was SVN r21967.
The following Trac tickets were found above:
Ticket 1988 --> https://svn.open-mpi.org/trac/ompi/ticket/1988
2009-09-15 04:25:21 +04:00
|
|
|
mca_coll_sm_data_index_t *index;
|
2005-09-02 16:57:47 +04:00
|
|
|
|
|
|
|
/* Setup some identities */
|
|
|
|
|
|
|
|
rank = ompi_comm_rank(comm);
|
|
|
|
size = ompi_comm_size(comm);
|
|
|
|
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 08:56:31 +04:00
|
|
|
OBJ_CONSTRUCT(&convertor, opal_convertor_t);
|
2005-09-07 01:41:55 +04:00
|
|
|
iov.iov_len = mca_coll_sm_component.sm_fragment_size;
|
|
|
|
bytes = 0;
|
|
|
|
|
|
|
|
me = &data->mcb_tree[(rank + size - root) % size];
|
2005-09-02 16:57:47 +04:00
|
|
|
parent = me->mcstn_parent;
|
|
|
|
children = me->mcstn_children;
|
|
|
|
num_children = me->mcstn_num_children;
|
|
|
|
|
2005-09-07 01:41:55 +04:00
|
|
|
/* Only have one top-level decision as to whether I'm the root or
|
|
|
|
not. Do this at the slight expense of repeating a little logic
|
|
|
|
-- but it's better than a conditional branch in every loop
|
|
|
|
iteration. */
|
|
|
|
|
|
|
|
/*********************************************************************
|
|
|
|
* Root
|
|
|
|
*********************************************************************/
|
2005-09-02 16:57:47 +04:00
|
|
|
|
|
|
|
if (root == rank) {
|
2005-09-07 01:41:55 +04:00
|
|
|
|
|
|
|
/* The root needs a send convertor to pack from the user's
|
|
|
|
buffer to shared memory */
|
|
|
|
|
2005-09-02 16:57:47 +04:00
|
|
|
if (OMPI_SUCCESS !=
|
|
|
|
(ret =
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 08:56:31 +04:00
|
|
|
opal_convertor_copy_and_prepare_for_send(ompi_mpi_local_convertor,
|
|
|
|
&(datatype->super),
|
2005-09-02 16:57:47 +04:00
|
|
|
count,
|
|
|
|
buff,
|
2006-03-17 21:46:48 +03:00
|
|
|
0,
|
2005-09-07 01:41:55 +04:00
|
|
|
&convertor))) {
|
2005-09-02 16:57:47 +04:00
|
|
|
return ret;
|
|
|
|
}
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 08:56:31 +04:00
|
|
|
opal_convertor_get_packed_size(&convertor, &total_size);
|
2005-09-07 01:41:55 +04:00
|
|
|
|
|
|
|
/* Main loop over sending fragments */
|
|
|
|
|
|
|
|
do {
|
|
|
|
flag_num = (data->mcb_operation_count++ %
|
|
|
|
mca_coll_sm_component.sm_comm_num_in_use_flags);
|
|
|
|
|
2005-09-07 17:33:43 +04:00
|
|
|
FLAG_SETUP(flag_num, flag, data);
|
|
|
|
FLAG_WAIT_FOR_IDLE(flag);
|
|
|
|
FLAG_RETAIN(flag, size - 1, data->mcb_operation_count - 1);
|
2005-09-07 01:41:55 +04:00
|
|
|
|
|
|
|
/* Loop over all the segments in this set */
|
|
|
|
|
2005-10-01 03:12:23 +04:00
|
|
|
segment_num =
|
|
|
|
flag_num * mca_coll_sm_component.sm_segs_per_inuse_flag;
|
|
|
|
max_segment_num =
|
|
|
|
(flag_num + 1) * mca_coll_sm_component.sm_segs_per_inuse_flag;
|
2005-09-07 01:41:55 +04:00
|
|
|
do {
|
Fixes trac:1988. The little bug that turned out to be huge. Yoinks.
* Various cosmetic/style updates in the btl sm
* Clean up concept of mpool module (I think that code was written way
back when the concept of "modules" was fuzzy)
* Bring over some old fixes from the /tmp/timattox-sm-coll/ tree to
fix potential segv's when mmap'ed regions were at different
addresses in different processes (thanks Tim!).
* Change sm coll to no longer use mpool as its main source of shmem;
rather, just mmap its own segment (because it's fixed size --
there was nothing to be gained by using mpool; shedding the use of
mpool saved a lot of complexity in the sm coll setup). This
effectively made Tim's fixes moot (because now everything is an
offset into the mmap that is computed locally; there are no global
pointers). :-)
* Slightly updated common/sm to allow making mmap's for a specific
set of procs (vs. ''all'' procs in the process). This potentially
allows for same-host-inter-proc mmaps -- yay!
* Fixed many, many things in the coll sm (particularly in reduce):
* Fixed handling of MPI_IN_PLACE in reduce and allreduce
* Fixed handling of non-contiguous datatypes in reduce
* Changed the order of reductions to go from process (n-1)'s data
to process 0's data, because that's how all other OMPI coll
components work
* Fixed lots of usage of ddt functions
* When using a non-contiguous datatype, if the root process is not
(n-1), now we used a 2nd convertor to copy from shmem to the rbuf
(saves a memory copy vs. what was done before)
* Lots and lots of little cleanups, clarifications, and minor
optimizations (although still more could be done -- e.g., I think
the use of write memory barriers is fairly sub-optimal; they
could be ganged together at the root, for example)
I'm marking this as "fixes trac:1988" and closing the ticket; if something
is still broken, we can re-open the ticket.
This commit was SVN r21967.
The following Trac tickets were found above:
Ticket 1988 --> https://svn.open-mpi.org/trac/ompi/ticket/1988
2009-09-15 04:25:21 +04:00
|
|
|
index = &(data->mcb_data_index[segment_num]);
|
2005-09-07 01:41:55 +04:00
|
|
|
|
|
|
|
/* Copy the fragment from the user buffer to my fragment
|
|
|
|
in the current segment */
|
Fixes trac:1988. The little bug that turned out to be huge. Yoinks.
* Various cosmetic/style updates in the btl sm
* Clean up concept of mpool module (I think that code was written way
back when the concept of "modules" was fuzzy)
* Bring over some old fixes from the /tmp/timattox-sm-coll/ tree to
fix potential segv's when mmap'ed regions were at different
addresses in different processes (thanks Tim!).
* Change sm coll to no longer use mpool as its main source of shmem;
rather, just mmap its own segment (because it's fixed size --
there was nothing to be gained by using mpool; shedding the use of
mpool saved a lot of complexity in the sm coll setup). This
effectively made Tim's fixes moot (because now everything is an
offset into the mmap that is computed locally; there are no global
pointers). :-)
* Slightly updated common/sm to allow making mmap's for a specific
set of procs (vs. ''all'' procs in the process). This potentially
allows for same-host-inter-proc mmaps -- yay!
* Fixed many, many things in the coll sm (particularly in reduce):
* Fixed handling of MPI_IN_PLACE in reduce and allreduce
* Fixed handling of non-contiguous datatypes in reduce
* Changed the order of reductions to go from process (n-1)'s data
to process 0's data, because that's how all other OMPI coll
components work
* Fixed lots of usage of ddt functions
* When using a non-contiguous datatype, if the root process is not
(n-1), now we used a 2nd convertor to copy from shmem to the rbuf
(saves a memory copy vs. what was done before)
* Lots and lots of little cleanups, clarifications, and minor
optimizations (although still more could be done -- e.g., I think
the use of write memory barriers is fairly sub-optimal; they
could be ganged together at the root, for example)
I'm marking this as "fixes trac:1988" and closing the ticket; if something
is still broken, we can re-open the ticket.
This commit was SVN r21967.
The following Trac tickets were found above:
Ticket 1988 --> https://svn.open-mpi.org/trac/ompi/ticket/1988
2009-09-15 04:25:21 +04:00
|
|
|
max_data = mca_coll_sm_component.sm_fragment_size;
|
2005-09-15 06:18:16 +04:00
|
|
|
COPY_FRAGMENT_IN(convertor, index, rank, iov, max_data);
|
2005-09-07 01:41:55 +04:00
|
|
|
bytes += max_data;
|
|
|
|
|
|
|
|
/* Wait for the write to absolutely complete */
|
|
|
|
opal_atomic_wmb();
|
|
|
|
|
2005-09-07 17:33:43 +04:00
|
|
|
/* Tell my children that this fragment is ready */
|
|
|
|
PARENT_NOTIFY_CHILDREN(children, num_children, index,
|
|
|
|
max_data);
|
2005-09-07 01:41:55 +04:00
|
|
|
|
|
|
|
++segment_num;
|
|
|
|
} while (bytes < total_size && segment_num < max_segment_num);
|
|
|
|
} while (bytes < total_size);
|
2005-09-02 16:57:47 +04:00
|
|
|
}
|
|
|
|
|
2005-09-07 01:41:55 +04:00
|
|
|
/*********************************************************************
|
|
|
|
* Non-root
|
|
|
|
*********************************************************************/
|
2005-09-02 16:57:47 +04:00
|
|
|
|
|
|
|
else {
|
2005-09-07 01:41:55 +04:00
|
|
|
|
|
|
|
/* Non-root processes need a receive convertor to unpack from
|
|
|
|
shared mmory to the user's buffer */
|
|
|
|
|
2005-09-02 16:57:47 +04:00
|
|
|
if (OMPI_SUCCESS !=
|
|
|
|
(ret =
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 08:56:31 +04:00
|
|
|
opal_convertor_copy_and_prepare_for_recv(ompi_mpi_local_convertor,
|
|
|
|
&(datatype->super),
|
2005-09-02 16:57:47 +04:00
|
|
|
count,
|
|
|
|
buff,
|
2006-03-17 21:46:48 +03:00
|
|
|
0,
|
2005-09-07 01:41:55 +04:00
|
|
|
&convertor))) {
|
2005-09-02 16:57:47 +04:00
|
|
|
return ret;
|
|
|
|
}
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 08:56:31 +04:00
|
|
|
opal_convertor_get_packed_size(&convertor, &total_size);
|
2005-09-02 16:57:47 +04:00
|
|
|
|
2005-09-07 01:41:55 +04:00
|
|
|
/* Loop over receiving (and possibly re-sending) the
|
|
|
|
fragments */
|
|
|
|
|
|
|
|
do {
|
|
|
|
flag_num = (data->mcb_operation_count %
|
|
|
|
mca_coll_sm_component.sm_comm_num_in_use_flags);
|
2005-09-02 16:57:47 +04:00
|
|
|
|
2005-09-07 01:41:55 +04:00
|
|
|
/* Wait for the root to mark this set of segments as
|
|
|
|
ours */
|
2005-09-07 17:33:43 +04:00
|
|
|
FLAG_SETUP(flag_num, flag, data);
|
|
|
|
FLAG_WAIT_FOR_OP(flag, data->mcb_operation_count);
|
2005-09-07 01:41:55 +04:00
|
|
|
++data->mcb_operation_count;
|
2005-09-03 15:49:46 +04:00
|
|
|
|
2005-09-07 01:41:55 +04:00
|
|
|
/* Loop over all the segments in this set */
|
2005-09-02 16:57:47 +04:00
|
|
|
|
2005-10-01 03:12:23 +04:00
|
|
|
segment_num =
|
|
|
|
flag_num * mca_coll_sm_component.sm_segs_per_inuse_flag;
|
|
|
|
max_segment_num =
|
|
|
|
(flag_num + 1) * mca_coll_sm_component.sm_segs_per_inuse_flag;
|
2005-09-07 01:41:55 +04:00
|
|
|
do {
|
2005-09-02 16:57:47 +04:00
|
|
|
|
2005-09-07 17:33:43 +04:00
|
|
|
/* Pre-calculate some values */
|
2005-09-07 01:41:55 +04:00
|
|
|
parent_rank = (parent->mcstn_id + root) % size;
|
Fixes trac:1988. The little bug that turned out to be huge. Yoinks.
* Various cosmetic/style updates in the btl sm
* Clean up concept of mpool module (I think that code was written way
back when the concept of "modules" was fuzzy)
* Bring over some old fixes from the /tmp/timattox-sm-coll/ tree to
fix potential segv's when mmap'ed regions were at different
addresses in different processes (thanks Tim!).
* Change sm coll to no longer use mpool as its main source of shmem;
rather, just mmap its own segment (because it's fixed size --
there was nothing to be gained by using mpool; shedding the use of
mpool saved a lot of complexity in the sm coll setup). This
effectively made Tim's fixes moot (because now everything is an
offset into the mmap that is computed locally; there are no global
pointers). :-)
* Slightly updated common/sm to allow making mmap's for a specific
set of procs (vs. ''all'' procs in the process). This potentially
allows for same-host-inter-proc mmaps -- yay!
* Fixed many, many things in the coll sm (particularly in reduce):
* Fixed handling of MPI_IN_PLACE in reduce and allreduce
* Fixed handling of non-contiguous datatypes in reduce
* Changed the order of reductions to go from process (n-1)'s data
to process 0's data, because that's how all other OMPI coll
components work
* Fixed lots of usage of ddt functions
* When using a non-contiguous datatype, if the root process is not
(n-1), now we used a 2nd convertor to copy from shmem to the rbuf
(saves a memory copy vs. what was done before)
* Lots and lots of little cleanups, clarifications, and minor
optimizations (although still more could be done -- e.g., I think
the use of write memory barriers is fairly sub-optimal; they
could be ganged together at the root, for example)
I'm marking this as "fixes trac:1988" and closing the ticket; if something
is still broken, we can re-open the ticket.
This commit was SVN r21967.
The following Trac tickets were found above:
Ticket 1988 --> https://svn.open-mpi.org/trac/ompi/ticket/1988
2009-09-15 04:25:21 +04:00
|
|
|
index = &(data->mcb_data_index[segment_num]);
|
2005-09-07 17:33:43 +04:00
|
|
|
|
|
|
|
/* Wait for my parent to tell me that the segment is ready */
|
|
|
|
CHILD_WAIT_FOR_NOTIFY(rank, index, max_data);
|
2005-09-07 01:41:55 +04:00
|
|
|
|
|
|
|
/* If I have children, send the data to them */
|
|
|
|
if (num_children > 0) {
|
2005-09-07 17:33:43 +04:00
|
|
|
/* Copy the fragment from the parent's portion in
|
|
|
|
the segment to my portion in the segment. */
|
|
|
|
COPY_FRAGMENT_BETWEEN(parent_rank, rank, index, max_data);
|
2005-09-07 01:41:55 +04:00
|
|
|
|
|
|
|
/* Wait for the write to absolutely complete */
|
|
|
|
opal_atomic_wmb();
|
|
|
|
|
|
|
|
/* Tell my children that this fragment is ready */
|
2005-09-07 17:33:43 +04:00
|
|
|
PARENT_NOTIFY_CHILDREN(children, num_children, index,
|
|
|
|
max_data);
|
2005-09-07 01:41:55 +04:00
|
|
|
|
|
|
|
/* Set the "copy from buffer" to be my local
|
|
|
|
segment buffer so that we don't potentially
|
|
|
|
incur a non-local memory copy from the parent's
|
|
|
|
fan out data segment [again] when copying to
|
|
|
|
the user's buffer */
|
2005-09-07 17:33:43 +04:00
|
|
|
src_rank = rank;
|
2005-09-07 01:41:55 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* If I don't have any children, set the "copy from
|
|
|
|
buffer" to be my parent's fan out segment to copy
|
|
|
|
directly from my parent */
|
|
|
|
|
|
|
|
else {
|
2005-09-07 17:33:43 +04:00
|
|
|
src_rank = parent_rank;
|
2005-09-07 01:41:55 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Copy to my output buffer */
|
Fixes trac:1988. The little bug that turned out to be huge. Yoinks.
* Various cosmetic/style updates in the btl sm
* Clean up concept of mpool module (I think that code was written way
back when the concept of "modules" was fuzzy)
* Bring over some old fixes from the /tmp/timattox-sm-coll/ tree to
fix potential segv's when mmap'ed regions were at different
addresses in different processes (thanks Tim!).
* Change sm coll to no longer use mpool as its main source of shmem;
rather, just mmap its own segment (because it's fixed size --
there was nothing to be gained by using mpool; shedding the use of
mpool saved a lot of complexity in the sm coll setup). This
effectively made Tim's fixes moot (because now everything is an
offset into the mmap that is computed locally; there are no global
pointers). :-)
* Slightly updated common/sm to allow making mmap's for a specific
set of procs (vs. ''all'' procs in the process). This potentially
allows for same-host-inter-proc mmaps -- yay!
* Fixed many, many things in the coll sm (particularly in reduce):
* Fixed handling of MPI_IN_PLACE in reduce and allreduce
* Fixed handling of non-contiguous datatypes in reduce
* Changed the order of reductions to go from process (n-1)'s data
to process 0's data, because that's how all other OMPI coll
components work
* Fixed lots of usage of ddt functions
* When using a non-contiguous datatype, if the root process is not
(n-1), now we used a 2nd convertor to copy from shmem to the rbuf
(saves a memory copy vs. what was done before)
* Lots and lots of little cleanups, clarifications, and minor
optimizations (although still more could be done -- e.g., I think
the use of write memory barriers is fairly sub-optimal; they
could be ganged together at the root, for example)
I'm marking this as "fixes trac:1988" and closing the ticket; if something
is still broken, we can re-open the ticket.
This commit was SVN r21967.
The following Trac tickets were found above:
Ticket 1988 --> https://svn.open-mpi.org/trac/ompi/ticket/1988
2009-09-15 04:25:21 +04:00
|
|
|
max_data = mca_coll_sm_component.sm_fragment_size;
|
2005-09-07 17:33:43 +04:00
|
|
|
COPY_FRAGMENT_OUT(convertor, src_rank, index, iov, max_data);
|
2005-09-02 16:57:47 +04:00
|
|
|
|
2005-09-07 01:41:55 +04:00
|
|
|
bytes += max_data;
|
|
|
|
++segment_num;
|
|
|
|
} while (bytes < total_size && segment_num < max_segment_num);
|
2005-09-02 16:57:47 +04:00
|
|
|
|
2005-09-07 01:41:55 +04:00
|
|
|
/* Wait for all copy-out writes to complete before I say
|
|
|
|
I'm done with the segments */
|
|
|
|
opal_atomic_wmb();
|
2005-09-02 16:57:47 +04:00
|
|
|
|
2005-09-07 01:41:55 +04:00
|
|
|
/* We're finished with this set of segments */
|
2005-09-07 17:33:43 +04:00
|
|
|
FLAG_RELEASE(flag);
|
2005-09-07 01:41:55 +04:00
|
|
|
} while (bytes < total_size);
|
2005-09-02 16:57:47 +04:00
|
|
|
}
|
2005-09-07 01:41:55 +04:00
|
|
|
|
|
|
|
/* Kill the convertor */
|
|
|
|
|
|
|
|
OBJ_DESTRUCT(&convertor);
|
2005-09-02 16:57:47 +04:00
|
|
|
|
|
|
|
/* All done */
|
|
|
|
|
|
|
|
return OMPI_SUCCESS;
|
2004-08-21 04:49:07 +04:00
|
|
|
}
|