1
1

6380 Коммитов

Автор SHA1 Сообщение Дата
Mikhail Kurnosov
f6e2d4ab04 coll: Add Rabenseifner's algorithm for Reduce and Allreduce
A component with implementation of R. Rabenseifner's algorithm for Reduce and Allreduce.
This algorithm is a combination of a reduce-scatter implemented with recursive vector halving
and recursive distance doubling, followed either by a gather or an allgather.

Current limitations:
  -- count >= 2^{\floor{\log_2 p}}
  -- commutative operations only
  -- intra-communicators onl

Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>

coll/spacc: Modify implementation to use `ompi_coll_base_sendrecv()`

Replace irecv() + isend() + ompi_request_wait() to ompi_coll_base_sendrecv().

Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
2017-05-26 14:33:35 +07:00
Gilles Gouaillardet
0f79259b94 osc/rdma: use extent of the appropriate datatype in ompi_osc_rdma_rget_accumulate_internal()
origin_datatype and target_datatype might be different and hence have different extent,
so use either origin_extent or target_extent when appropriate.

Refs open-mpi/ompi#3569

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-05-26 13:59:38 +09:00
Geoff Paulsen
93078ad824 Merge pull request #3551 from markalle/1sided_some_single
1sided with some hosts single rank -- Fixes #3548
2017-05-25 13:59:48 -05:00
Mark Allen
df14cbf039 fix for buffer length check (rdma osc w/ odd datatypes)
The osc_rdma_get_remote_segment() has the 3rd and 4th args as
* target_disp
* length
which it uses to determine if the rdma falls within the bounds of
the window or not (actually it only checks the upper bound, but I'm
okay with that).

Anyway the caller previously was passing in the length argument as
    target_datatype->super.size * target_count
which which doesn't really represent the number of bytes after target_disp
for which data exists. In particular I could create a datatype as
    { disp -4, len 4 } and use target_disp 4
and that would be bytes 0-3 of the window where the original code
would think it was bytes 4-7 and could abort at the range check.

Ive changed it to use the opal_datatype_span() function.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2017-05-24 19:10:39 -04:00
Mark Allen
36f51bca26 yalla with irregular contig datatype -- Fixes 3566
Yalla has a macro PML_YALLA_INIT_MXM_REQ_DATA that checks if a datatype
is contiguous via opal_datatype_is_contiguous_memory_layout(dt,count)
and if so it selects a size and lb that presumably is what will rdma, as
            ompi_datatype_type_size(_dtype, &size); \
            ompi_datatype_type_lb(_dtype, &lb); \

This failed when I gave it a datatype constructed as [ ...] with extent 4.
What I mean by that datatype is
    lens[0] = 3;
    disps[0] = 1;
    types[0] = MPI_CHAR;
    MPI_Type_struct(1, lens, disps, types, &tmpdt);
    MPI_Type_create_resized(tmpdt, 0, 4, &mydt);
So there are 3 chars at offset 1, and the LB is 0 and the UB is 4.

So that macro decides that size=4 and lb=0 and later I suppose size is getting
updated to 3 for the final rdma, and so a send of a buffer
[ 0 1 2 3 ] gets recved as [ 0 1 2 _ ]. I think it should use the true lb
and the true extent.

For "regular" contig datatypes it would be the same, and for the irregular
ones that are still deemed contiguous by that utility function it should
still be the right thing to use.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2017-05-23 20:56:12 -04:00
Mark Allen
c9f31a8d39 fix for 1sided with some hosts single rank
See bug report
    https://github.com/open-mpi/ompi/issues/3548

If a 1sided test is launched -host hostA:2,hostB:1 some of the ranks
call allocate_state_single() and others call allocate_state_shared().
These functions were producing different values for module->state_size
but that's used when they lookup peer info from each other in
ompi_osc_rdma_peer_setup() so they need to all have matching
module->state_offset values.

This change adds a few unused bytes in the memory allocate_state_single()
creates so it matches.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2017-05-22 15:10:49 -04:00
Geoff Paulsen
50f9287c03 Merge pull request #2941 from markalle/pr/mpi-info-update2
Finally Merging this in.  MPI_*_get_info/set_info().
Targeting v3.1 release.  @hjelmn were you interested in switching some internal pieces to begin using this?  Should we target v3.1 (or whatever we call the Oct 15th release?)
2017-05-22 09:22:04 -05:00
Mark Allen
482d84b6e5 fixes for Dave's get/set info code
The expected sequence of events for processing info during object creation
is that if there's an incoming info arg, it is opal_info_dup()ed into the obj
at obj->s_info first. Then interested components register callbacks for
keys they want to know about using opal_infosubscribe_infosubscribe().

Inside info_subscribe_subscribe() the specified callback() is called with
whatever matching k/v is in the object's info, or with the default. The
return string from the callback goes into the new k/v stored in info, and
the input k/v is saved as __IN_<key>/<val>. It's saved the same way
whether the input came from info or whether it was a default. A null return
from the callback indicates an ignored key/val, and no k/v is stored for
it, but an __IN_<key>/<val> is still kept so we still have access to the
original.

At MPI_*_set_info() time, opal_infosubscribe_change_info() is used. That
function calls the registered callbacks for each item in the provided info.
If the callback returns non-null, the info is updated with that k/v, or if
the callback returns null, that key is deleted from info. An __IN_<key>/<val>
is saved either way, and overwrites any previously saved value.

When MPI_*_get_info() is called, opal_info_dup_mpistandard() is used, which
allows relatively easy changes in interpretation of the standard, by looking
at both the <key>/<val> and __IN_<key>/<val> in info. Right now it does
  1. includes system extras, eg k/v defaults not expliclty set by the user
  2. omits ignored keys
  3. shows input values, not callback modifications, eg not the internal values

Currently the callbacks are doing things like
    return some_condition ? "true" : "false"
that is, returning static strings that are not to be freed. If the return
strings start becoming more dynamic in the future I don't see how unallocated
strings could support that, so I'd propose a change for the future that
the callback()s registered with info_subscribe_subscribe() do a strdup on
their return, and we change the callers of callback() to free the strings
it returns (there are only two callers).

Rough outline of the smaller changes spread over the less central files:
  comm.c
    initialize comm->super.s_info to NULL
    copy into comm->super.s_info in comm creation calls that provide info
    OBJ_RELEASE comm->super.s_info at free time
  comm_init.c
    initialize comm->super.s_info to NULL
  file.c
    copy into file->super.s_info if file creation provides info
    OBJ_RELEASE file->super.s_info at free time
  win.c
    copy into win->super.s_info if win creation provides info
    OBJ_RELEASE win->super.s_info at free time

  comm_get_info.c
  file_get_info.c
  win_get_info.c
    change_info() if there's no info attached (shouldn't happen if callbacks
      are registered)
    copy the info for the user

The other category of change is generally addressing compiler warnings where
ompi_info_t and opal_info_t were being used a little too interchangably. An
ompi_info_t* contains an opal_info_t*, at &(ompi_info->super)

Also this commit updates the copyrights.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2017-05-17 01:12:49 -04:00
Todd Kordenbrock
27ee862964 mtl-portals4: in rendezvous, reissue PtlGet() if it fails
This commit fixes a race condition in the rendezvous protocol.  The
race occurs because the sender does not wait for the link event on the
send buffer.  Even though this has not been seen in the wild, it is
possible for the receiver to issue the PtlGet() before the ME is
linked which causes a NAK at the receiver.  This commit resolves this
race by reissuing the PtlGet() when a NAK occurs.

Signed-off-by: Todd Kordenbrock <thkgcode@gmail.com>
2017-05-15 13:11:13 -05:00
David Solt
50aa143ab6 Major structural changes to data types: .super infosubscriber
ompi_communicator_t, ompi_win_t, ompi_file_t all have a super class of type opal_infosubscriber_t instead of a base/super type of opal_object_t (in previous code comm used c_base, but file used super).  It may be a bit bold to say that being a subscriber of MPI_Info is the foundational piece that ties these three things together, but if you object, then I would prefer to turn infosubscriber into a more general name that encompasses other common features rather than create a different super class.  The key here is that we want to be able to pass comm, win and file objects as if they were opal_infosubscriber_t, so that one routine can heandle all 3 types of objects being passed to it.

MPI_INFO_NULL is still an ompi_predefined_info_t type since an MPI_Info is part of ompi but the internal details of the underlying information concept is part of opal.

An ompi_info_t type still exists for exposure to the user, but it is simply a wrapper for the opal object.

Routines such as ompi_info_dup, etc have all been moved to opal_info_dup and related to the opal directory.

Fortran to C translation tables are only used for MPI_Info that is exposed to the application and are therefore part of the ompi_info_t and not the opal_info_t

The data structure changes are primarily in the following files:

    communicator/communicator.h
    ompi/info/info.h
    ompi/win/win.h
    ompi/file/file.h

The following new files were created:

    opal/util/info.h
    opal/util/info.c
    opal/util/info_subscriber.h
    opal/util/info_subscriber.c

This infosubscriber concept is that communicators, files and windows can have subscribers that subscribe to any changes in the info associated with the comm/file/window.  When xxx_set_info is called, the new info is presented to each subscriber who can modify the info in any way they want.  The new value is presented to the next subscriber and so on until all subscribers have had a chance to modify the value.  Therefore, the order of subscribers can make a difference but we hope that there is generally only one subscriber that cares or modifies any given key/value pair.  The final info is then stored and returned by a call to xxx_get_info.

The new model can be seen in the following files:

    ompi/mpi/c/comm_get_info.c
    ompi/mpi/c/comm_set_info.c
    ompi/mpi/c/file_get_info.c
    ompi/mpi/c/file_set_info.c
    ompi/mpi/c/win_get_info.c
    ompi/mpi/c/win_set_info.c

The current subscribers where changed as follows:

    mca/io/ompio/io_ompio_file_open.c
    mca/io/ompio/io_ompio_module.c
    mca/osc/rmda/osc_rdma_component.c (This one actually subscribes to "no_locks")
    mca/osc/sm/osc_sm_component.c (This one actually subscribes to "blocking_fence" and "alloc_shared_contig")

Signed-off-by: Mark Allen <markalle@us.ibm.com>

Conflicts:
	AUTHORS
	ompi/communicator/comm.c
	ompi/debuggers/ompi_mpihandles_dll.c
	ompi/file/file.c
	ompi/file/file.h
	ompi/info/info.c
	ompi/mca/io/ompio/io_ompio.h
	ompi/mca/io/ompio/io_ompio_file_open.c
	ompi/mca/io/ompio/io_ompio_file_set_view.c
	ompi/mca/osc/pt2pt/osc_pt2pt.h
	ompi/mca/sharedfp/addproc/sharedfp_addproc.h
	ompi/mca/sharedfp/addproc/sharedfp_addproc_file_open.c
	ompi/mca/topo/treematch/topo_treematch_dist_graph_create.c
	ompi/mpi/c/lookup_name.c
	ompi/mpi/c/publish_name.c
	ompi/mpi/c/unpublish_name.c
	opal/mca/mpool/base/mpool_base_alloc.c
	opal/util/Makefile.am
2017-05-12 14:41:05 -04:00
Matias A Cabral
644641d06f PSM and PSM2 MTLs check on the max message size allowed by API.
OMPI send and receive mesages use size_t for the lenght while PSM and PSM2
psm(2)mq_send/receive use uint32_t. Type size_t is 64 bits in 64 bits arch.
Therefore, this patch adds a sanity check on the lenght of the message
and fails gracefully.

Signed-off-by: Matias Cabral <matias.a.cabral@intel.com>
2017-05-10 12:45:11 -07:00
Gilles Gouaillardet
a66909b8b4 Merge pull request #3488 from ggouaillardet/topic/romio314_ad_nfs
romio314: ad_nfs fixes for large files from upstream mpich
2017-05-09 16:58:02 +09:00
Gilles Gouaillardet
26f44da429 coll/base: fix mca_coll_base_alltoallv_intra_basic_inplace()
correctly handle the case when a MPI task has no data to send/recv

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-05-09 15:19:14 +09:00
Gilles Gouaillardet
eaf050cfe1 romio314: adio/ad_nfs: fix buffer overflows in ADIOI_NFS_{Read,Write}Strided
Refs: models/mpich#2338
Refs: models/mpich#2617

Signed-off-by: Rob Latham <robl@mcs.anl.gov>

(back-ported from upstream commit pmodels/mpich@642db57648)

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-05-09 11:11:12 +09:00
Gilles Gouaillardet
02af10ce6e romio314: update NFS read/write routines for large xfers
When we updated UFS and others we left NFS alone.  HDF group would like
a fix, so here we go.

Signed-off-by: Ken Raffenetti <raffenet@mcs.anl.gov>

(back-ported from upstream commit pmodels/mpich@684df9f4c9)

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-05-09 11:07:47 +09:00
Jeff Squyres
7185567d50 Merge pull request #3455 from jsquyres/pr/fix-lustre-configure
Lustre configure fixes
2017-05-08 16:49:23 -04:00
Ralph Castain
ef0e0171c9 Implement the changes required to support cross-library coordination. Update PMIx to support intra-process notifications and ensure that we always notify ourselves for events. Add a new ompi/interlib directory where cross-lib coordination code can go, and put the code to declare ourselves there (called from ompi_mpi_init.c).
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-08 10:04:50 -07:00
Jeff Squyres
c81bc50198 fs/lustre: remove redundant/dead code
We check for liblustreapi.h in OMPI_CHECK_LUSTRE, so this code was
commented out here.  Might as well fully delete it, since it's
redundant and dead.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-05-05 05:28:33 -07:00
Yossi
f56847542e Merge pull request #3347 from alinask/topic/ucx-sync-send
PML UCX: handle a synchronous send.
2017-04-26 18:02:09 +03:00
Alina Sklarevich
49913c692a PML UCX: unite the code for all the sending modes.
Signed-off-by: Alina Sklarevich <alinas@mellanox.com>
2017-04-26 13:17:06 +03:00
Howard Pritchard
462342d148 Merge pull request #3311 from hppritcha/topic/libfabric_moves_to_ofi
common/libfabric: move libfabric to ofi
2017-04-21 07:50:38 -06:00
Howard Pritchard
841192645b common/libfabric: move libfabric to ofi
This PR renames the common library for OFI libfabric from
libfabric to ofi.  There are a number of reasons this
is good to do:

1) its shorter and replaces 9 characters with three for
   function names for what may eventually be a fairly extensive interface
2) OFI is the term used for MTL and RML components that use
   the OFI libfabric interface
3) A planned OSC component will also use the OFI term.
4) Other HPC libraries that can use OFI libfabric tend to use
   the term "ofi" internally and also in their configure options
   relevant to OFI libfabric (i.e. MPICH/CH4, Intel MPI, Sandia SHMEM)

There seem to be comments in places in the Open MPI source
code that indicate that this common library will be going away.
Far from it as we will want to be able to share things like
AV objects between OMPI and possibly OSHMEM components that
use the OFI libfabric interface.

This PR also adds a synonym to the --with-libfabric(-libdir)
configury options: --with-ofi and with-ofi-libdir.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2017-04-20 13:07:16 -06:00
Gilles Gouaillardet
ded63c5e0c ompi: use ompi_coll_base_sendrecv_actual() whenever possible
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-04-20 10:01:28 +09:00
Gilles Gouaillardet
52551d96c1 Merge pull request #3285 from ggouaillardet/topic/coll_zerobyte_messages
coll/base: always send/recv zero-byte messages
2017-04-20 09:22:47 +09:00
Gilles Gouaillardet
fa5cd0dbe5 use ptrdiff_t instead of OPAL_PTRDIFF_TYPE
since Open MPI now requires a C99, and ptrdiff_t type is part of C99,
there is no more need for the abstract OPAL_PTRDIFF_TYPE type.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-04-19 13:41:56 +09:00
Yossi
9ebcafd6d6 Merge pull request #3260 from derbeyn/fix_yalla
Fix yalla PML: MPI_Recv does not return MPI_ERR_TRUNCATE upon overflow
2017-04-18 11:37:48 +03:00
Alina Sklarevich
d93b67257b PML UCX: handle a synchronous send.
MCA_PML_BASE_SEND_SYNCHRONOUS

Signed-off-by: Alina Sklarevich <alinas@mellanox.com>
2017-04-13 18:11:55 +03:00
Alina Sklarevich
eec310c99c PML/UCX/YALLA: Fix the message release call.
Set message to MPI_MESSAGE_NULL.

Signed-off-by: Alina Sklarevich <alinas@mellanox.com>
2017-04-13 14:41:13 +03:00
Nathan Hjelm
12b52b2b2c osc/pt2pt: fix infinite frag allocation loop
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-04-10 16:30:47 -06:00
Howard Pritchard
f5942ff23c Merge pull request #3304 from hppritcha/topic/de-ortization-of-ompi
de-ORTEfy the ompi tree
2017-04-07 14:14:41 -06:00
Noah Evans
ef29fb13cb de-ORTEfy the ompi tree
The ompi tree should be runtime independent, but over time a few
ORTE depedent definitions and functions have escaped into the ompi
tree. I'm working on my own runtime so I've used this as an opportunity
to get rid of ORTE dependencies in the ompi/ tree. I still need to go
back and change orte to conform to the new world and these changes are
untested, but I can now compile (but not link) without orte so I'm
commiting this changeset.

Signed-off-by: Noah Evans <noah.evans@gmail.com>
2017-04-07 12:35:58 -06:00
Nadia Derbey
f918d88c3e Fix yalla PML: Update previous commit after Yossofe's review
Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
2017-04-06 07:58:26 +02:00
Gilles Gouaillardet
f3581c8259 coll/base: have alltoallv send/recv zero-bytes messages
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-04-05 13:44:17 +09:00
Gilles Gouaillardet
5492edd71e coll/base: have ompi_coll_base_sendrecv() send/recv zero-bytes messages
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-04-05 13:44:05 +09:00
Nathan Hjelm
1322e5dee8 Merge pull request #3274 from hjelmn/osc_rdma_fix
osc/rdma: fix typo in atomic code
2017-04-04 00:20:42 -06:00
Gilles Gouaillardet
5dfd4ab6ca coll/tuned: remove set-but-not-used variables
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-04-04 13:18:11 +09:00
Nathan Hjelm
fad0803920 osc/rdma: fix typo in atomic code
Fixes #3267

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-04-03 15:54:28 -06:00
Nadia Derbey
b6de94e449 Fix yalla PML: MPI_Recv does not return MPI_ERR_TRUNCATE upon overflow
Signed-off-by: Nadia Derbey <Nadia.Derbey@atos.net>
2017-03-30 15:18:31 +02:00
Xin Zhao
ee952fcccd Passing estimated_num_procs to UCX init in PML and SPML.
Signed-off-by: Xin Zhao <xinz@mellanox.com>
2017-03-27 20:36:52 +03:00
Nathan Hjelm
c72fb30eb5 osc/pt2pt: fix typo
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2017-03-23 09:00:21 -06:00
Xin Zhao
6a99c60fbd Add multithreading support in PML UCX framework.
Signed-off-by: Xin Zhao <xinz@mellanox.com>
2017-03-20 19:55:00 +02:00
Jeff Squyres
760db0d5ce osc/pt2pt: fix compiler warning
Remove unused variable.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-03-16 05:46:11 -07:00
Jeff Squyres
1947280865 topo/treematch: squash some compiler warnings
Only define MIN/MAX if they are not already defined.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-03-16 05:44:26 -07:00
Nathan Hjelm
37214eda09 Merge pull request #3164 from hjelmn/ob1_pinned
pml/ob1: do not cache leave_pinned
2017-03-14 13:22:18 -06:00
Nathan Hjelm
3e7ef48c13 pml/ob1: do not cache leave_pinned
This commit fixes a bug that disabled both the RDMA pipeline and RDMA
protocols in ob1. ob1 was internally caching the values of
opal_leave_pinned and opal_leave_pinned_pipeline at init time. This is
no longer valid as opal_leave_pinned may be set by any call to a btl's
add_procs.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-14 09:00:40 -06:00
Valentin Petrov
fe069c9570 Fixes the coll_allgather usage bug
One should use the correct module object when calling
      c_coll.coll_allgather. Otherwise there will be a segfault in the
      case, for example, when hcoll is used. In that case
      c_coll.coll_allgather = mca_coll_hcoll_allgather while
      c_coll.coll_gather_module = tuned.

Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
2017-03-14 09:47:39 +02:00
Alex Mikheev
c081239f88
ompi: pml ucx: fix persistant request init CR changes
Signed-off-by: Alex Mikheev <alexm@mellanox.com>
2017-03-08 13:26:29 +02:00
Alex Mikheev
c113c37a7a
ompi: pml ucx: fix persistant request initialization
Signed-off-by: Alex Mikheev <alexm@mellanox.com>
2017-03-08 10:59:41 +02:00
Nathan Hjelm
0195d15401 osc/pt2pt: flush pending fragments on lock ack
This commit addresses an issue that can occur in cases where a lot of
fragments are outstanding.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-03-06 13:58:46 -07:00
Edgar Gabriel
607dc2c039 Merge pull request #3103 from edgargabriel/pr/sharedfp-name-collision-fix
sharedfp/lockedfile and sm: fix the namecollision
2017-03-05 14:46:20 -06:00