1
1

5417 Коммитов

Автор SHA1 Сообщение Дата
Sergey Oblomov
1944295da3 COMMON/UCX: removed ucs stuff
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit ebc457baf5ded5dd46cd73918a2f69555f408c54)
2019-05-17 09:58:20 +03:00
Sergey Oblomov
fa0a0b1597 COMMON/UCX: init memhooks infra on external hooks only
- initialize memory hooks infrastructure only in case
  if external memory hooks are requested

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit a0a93060668cd11a783cc94c753efb3129df9dde)
2019-05-17 09:58:12 +03:00
George Bosilca
4946570b24 Remove few warnings identified by @rhc in #5514.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

(cherry picked from commit open-mpi/ompi@6d11a45f44)
2019-05-11 16:38:31 +09:00
Gilles Gouaillardet
70a864fce3 btl/vader: fix finalize sequence
free the component mpool in mca_btl_vader_component_close()
and after freeing soem objects that depend on it such as
mca_btl_vader_component.vader_frags_user

Thanks Christoph Niethammer for reporting this.

Refs. open-mpi/ompi#6524

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@77060cad07)
2019-05-11 13:04:23 +09:00
George Bosilca
48f824327c Fix the leak of fragments for persistent sends.
The rdma_frag attached to the send request was not correctly released
upon request completion, leaking until MPI_Finalize. A quick solution
would have been to add RDMA_FRAG_RETURN at different locations on the
send request completion, but it would have unnecessarily made the
sendreq completion path more complex. Instead, I added the length to
the RDMA fragment so that it can be completed during the remote ack.

Be more explicit on the comment.

The rdma_frag can only be freed once when the peer forced a protocol
change (from RDMA GET to send/recv). Otherwise the fragment will be
returned once all data pertaining to it has been trasnferred.

NOTE: Had to add a typedef for "opal_atomic_size_t" from master into
opal/threads/thread_usage.h into this cherry pick (it is in
opal/include/opal_stdatomic.h on master, but that file does not exist
here on the v4.0.x branch).

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
(cherry picked from commit a16cf0e4dd6df4dea820fecedd5920df632935b8)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-05-03 06:20:02 -07:00
Mikhail Brinskii
e4ee56d1f3 SPML/UCX: Add shmemx_alltoall_global_nb routine to shmemx.h
The new routine transfers the data asynchronously from the source PE to all
PEs in the OpenSHMEM job. The routine returns immediately. The source and
target buffers are reusable only after the completion of the routine.
After the data is transferred to the target buffers, the counter object
is updated atomically. The counter object can be read either using atomic
operations such as shmem_atomic_fetch or can use point-to-point synchronization
routines such as shmem_wait_until and shmem_test.

Signed-off-by: Mikhail Brinskii <mikhailb@mellanox.com>
(cherry picked from commit 2ef5bd8b3671f1e10caf00d06d66d120eac9c5be)
2019-05-02 21:25:59 +03:00
Xin Zhao
69a80fce9f ompi/oshmem/spml/ucx: use lockfree array to optimize spml_ucx_progress/delete oshmem_barrier in shmem_ctx_destroy
ompi/oshmem/spml/ucx: optimize spml ucx progress

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
(cherry picked from commit 9c3d00b144641d2929f830279dcc9d163c38e9e1)
2019-03-21 23:59:58 +02:00
Xin Zhao
596997c194 ompi/oshmem/spml/ucx: defer clean up shmem_ctx to shmem_finalize
Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
(cherry picked from commit e1c1ab020227fc18d145379ab29ea86a3cdb66b1)
2019-03-21 23:58:23 +02:00
Howard Pritchard
15cfba5347
Merge pull request #6503 from jjhursey/v4x-rm-hash-pmix3
Do not force 'hash' gds on direct modex
2019-03-19 17:58:26 -05:00
Joshua Hursey
45526fadee Do not force 'hash' gds on direct modex
* Forcing the 'hash' gds component should not be necessary any more.

Port of PR #6498 (component names changed so a cherry-pick would not work)

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2019-03-19 10:52:17 -05:00
Nysal Jan K.A
1329cef213 opal/atomics: Add acquire semantics back for spinlocks
This was introduced in commit 9d0b3fe9

Signed-off-by: Nysal Jan K.A <jnysal@in.ibm.com>
(cherry picked from commit 00f27a80fc63053db1aeb42140148d7a3d1379b3)
2019-03-19 19:45:20 +05:30
Gilles Gouaillardet
8da4605589 btl/openib: immediately release the device when no port is allowed
Many thanks to Sergey Oblomov for reporting this issue
and the countless traces provided when troubleshooting it.

This is a one-off commit for the v4.0.x branch since btl/openib has been removed
 from master.

Refs. open-mpi/ompi#6137

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-03-19 09:26:11 +09:00
Gilles Gouaillardet
c58c774981 btl/openib: have add_proc() return immediately when the port is disabled.
Fixes an issue introduced in open-mpi/ompi@0a2ce58040

This is a one-off commit for the v4.0.x branch since btl/openib has been removed from master.

Refs. open-mpi/ompi#6137

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-03-19 09:24:25 +09:00
Gilles Gouaillardet
d7053a306a btl/openib: delay UCX warning to add_procs()
If UCX is available, then pml/ucx will be used instead of
pml/ob1 + btl/openib, so there is no need to warn about
btl/openib not supporting Infiniband.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@0a2ce58040)
2019-03-19 09:24:00 +09:00
Howard Pritchard
ceb93d7c03
Merge pull request #6491 from bosilca/v4.0.x
v4.0.x: Cherry-pick fixes for issue #6258 from master (vader fixes)
2019-03-15 17:08:52 -06:00
Howard Pritchard
27899b0e8f
Merge pull request #6486 from hoopoepg/topic/check-ucx-params-v4.0
PML/SPML/UCX: added evaluation of mmap events - v4.0
2019-03-14 17:02:46 -06:00
Nathan Hjelm
3df8ed9cc0
btl/vader: fix fragment sizes used by free lists
This commit fixes a bug introduced in
f62d26ddbc8cda4d985cceee531a2ec32406d1f6. That commit changed how
vader allocates fragment memory from the shared memory
segment. Unfortunately, the values used for the fragment sizes did not
include space for the fragment header. This can cause an overrun of
data from one fragment to the header of the next fragment.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2019-03-14 17:25:31 -04:00
Nathan Hjelm
20017d345e
btl/vader: use basic mpool type to handle frag/fbox allocation
This commit updates btl/vader to use an mpool for handling all shared
memory allocations (frags, fboxes).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2019-03-14 17:21:12 -04:00
Nathan Hjelm
bac6024b5a
mpool: add new base module type "basic"
This commit adds a new mpool base module type: basic. This module can
be used with an opal_free_list_t to allocate space from a
pre-allocated block (such as a shared memory region). The new module
only supports allocation and is not meant for more dynamic use cases
at this time.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2019-03-14 17:20:30 -04:00
Mark Allen
fcf53becc3 opal_hwloc_base_cset2str() off-by-1 in its strncat()
I think the strncat() calls here need to be of the form
    strncat(str, new_str_to_add, len - strlen(new_str_to_addstr) - 1);
since in the OMPI calls len is being used as total number of bytes
in str.

strncat(dest,src,n) on the other hand is documented as writing up to
n chars from the incoming string plus 1 for the null, for n+1 total
bytes it can write.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
(cherry picked from commit 30d60994d258f5f0b7c432efd284d1b6b8333faf)

Conflicts:
	opal/mca/hwloc/base/hwloc_base_util.c
2019-03-14 13:08:25 -04:00
Sergey Oblomov
bed8141088 COMMON/UCX: rewording of hooks suggestion
- also updated output macro

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit c319cf9adefb69c78a73eb4a83a40dee5b697a53)
2019-03-14 16:48:36 +02:00
Sergey Oblomov
14c271f993 PML/SPML/UCX: added evaluation of mmap events
- there was a set of UCX related issues reported which caused
  by mmap API hooks conflicts. We added diagnostic of such
  problems to simplify bug-resolving pipeline

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
(cherry picked from commit d8e3562bae700d84873c1d5ca9c45c846d7387ed)
2019-03-14 16:48:25 +02:00
Aurelien Bouteiller
cf34de33eb Avoid a double lock interlock when calling pmix_finalize
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2019-03-08 15:33:17 -05:00
Howard Pritchard
8b2d804661
Merge pull request #6450 from ggouaillardet/topic/v4.0.x/usnic_test
v4.0.x: btl/usnic: fix usnic_btl_run_tests CPPFLAGS
2019-03-06 09:26:09 -07:00
George Bosilca
e4aae6b5c8
Add a test for very large data.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-03-05 19:43:31 -05:00
Gilles Gouaillardet
320a839be9
opal/datatype: correctly handle large datatypes
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.

Thanks Ben Menadue for the initial bug report

Refs open-mpi/ompi#6016

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-03-05 19:41:39 -05:00
Jeff Squyres
8c4c982271 btl/usnic: amend Makefile.am fix from b4097626ab
Use $(AM_CPPFLAGS) in $(usnic_btl_run_tests_CPPFLAGS) so that we don't
have to replicate hard-coded values.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 14563770a1d64c465ee1f205c9981de39970bb33)
2019-03-05 09:42:03 -08:00
Gilles Gouaillardet
a9ba07b04e btl/usnic: fix usnic_btl_run_tests CPPFLAGS
do define the OMPI_LIBMPI_NAME macro via the CPPFLAGS.
The issue occurs when Open MPI is configured with
--enable-opal-btl-usnic-unit-tests

Thanks George Marselis for reporting this issue

Refs. open-mpi/ompi#6441

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@b4097626ab)
2019-03-05 11:02:13 +09:00
Ralph Castain
0322ad028d Update slurm pmi configury to account for pmix
When Slurm is built against PMIx, some installations place a copy of the
PMIx library that Slurm is linking against in the Slurm PMI location.
Current configury ignores that location. The desired behavior is to look
for a PMIx lib in that location when --with-pmi is given. If the user
also specifies --with-pmix and gives a different location, then override
anything previously found and look for it where the user directed.

Signed-off-by: Ralph Castain <rhc@pmix.org>
(cherry picked from commit cd1b5641beca7f158360983cd31f7297548b0a3c)
2019-03-01 08:39:49 -08:00
Ben Menadue
8bf3a86cb0 Hold off running hwloc:external feature tests until after we decide if we're using the internal or external component. This fixes #6430.
Signed-off-by: Ben Menadue <ben.menadue@nci.org.au>
(cherry picked from commit 17dcc7041ac272c65eb727f45c5628459cbb6055)
2019-02-25 15:09:48 -08:00
Howard Pritchard
0b915b7e56
Merge pull request #6333 from jsquyres/pr/v4.0.x/hwloc-macro-conflict-fixes
v4.0.x: Various minor hwloc cleanups
2019-02-12 09:13:19 -07:00
Howard Pritchard
6513b855cf
Merge pull request #6249 from hjelmn/v4.0.x_fix_issue_6201_in_the_v4.0.x_branch
v4.0.x: btl/vader: don't try to set reachabilty in add_procs if not requested
2019-02-11 13:16:15 -07:00
Howard Pritchard
5dd63405ce
Merge pull request #6368 from jsquyres/pr/v4.0.x/fix-ofi-configury
v4.0.x: fix OFI configury
2019-02-11 13:15:52 -07:00
Howard Pritchard
9e306cee49
Merge pull request #6336 from jsquyres/pr/v4.0.x/fix-datatype-destructor-leak
v4.0.x: opal/datatype: plug a memory leak in opal_datatype_t destructor
2019-02-11 13:14:06 -07:00
Jeff Squyres
7fd62cf745 Remove opal/mca/common/ofi.
It never lived up to its purpose (and has caused amorphous indirect
errors such as https://github.com/open-mpi/ompi/issues/2519), so
delete it.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit dd20174532928e0c9cdbe7b206868e6e4bea9d0b)
2019-02-07 06:39:22 -08:00
Jeff Squyres
9ad871fc38 ofi: revamp OPAL_CHECK_OFI configury
Update the OPAL_CHECK_OFI configury macro:

- Make it safe to call the macro multiple times:
  - The checks only execute the first time it is invoked
  - Subsequent invocations, it just emits a friendly "checking..."
    message so that configure output is sensible/logical
- With the goal of ultimately removing opal/mca/common/ofi, rename the
  output variables from OPAL_CHECK_OFI to be
  opal_ofi_{happy|CPPFLAGS|LDFLAGS|LIBS}.
- Update btl/usnic and mtl/ofi for these new conventions.
- Also, don't use AC_REQUIRE to invoke OPAL_CHECK_OFI because that
  causes the macro to be invoked at a fairly random time, which makes
  configure stdout confusing / hard to grok.
- Remove a little left-over kruft in OPAL_CHECK_OFI, too (which
  resulted in an indenting change, making the change to
  opal_check_ofi.m4 look larger than it really is).

Thanks Alastair McKinstry for the report and initial fix.
Thanks Rashika Kheria for the reminder.

Updated from master cherry pick: the OFI BTL does not exist on the
v4.0.x branch.  Therefore, did not include the OFI BTL changes on
master in this cherry pick.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit f5e1a672ccd5db127e85e1e8f6bcfeb8a8b04527)
2019-02-07 06:36:35 -08:00
Gilles Gouaillardet
0ae48475a1 opal/datatype: reset ptypes in opal_datatype_clone()
Reset ptypes when cloning a datatype in order to prevent
a double free() in the opal_datatype_t destructor.

This fixes a bug introduced in open-mpi/ompi@7c938f070f

Fixes open-mpi/ompi#6346

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@b395342c9f)
2019-02-01 14:39:49 +09:00
George Bosilca
8acdc53892 Provide a better fix for #6285.
The issue was a little complicated due to the internal stack used in the
convertor. The main issue was that in the case where we run out of iov
space to save the raw description of the data while hanbdling a
repetition (loop), instead of saving the current position and bailing out
directly we reading of the next predefined type element. It worked in
most cases, except the one identified by the HDF5 test. However, the
biggest issue here was the drop in performance for all ensuing calls to
the convertor pack/unpack, as instead of handling contiguous loops as a
whole (and minimizing the number of memory copies) we copied data
description by data description.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

(back-ported from commit open-mpi/ompi@5a82c4fd07)
2019-02-01 09:28:52 +09:00
Gilles Gouaillardet
f7327735a0 opal/datatype: fix opal_convertor_raw
correctly handle the case in which iovec is full and the
last accessed element of the datatype is the beginning of a loop

Refs. open-mpi/ompi#6285

Thanks Axel Huebl for reporting this

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(back-ported from commit open-mpi/ompi@0832ab5acc)
2019-02-01 09:26:30 +09:00
Gilles Gouaillardet
90a9c12fdb opal/datatype: plug a memory leak in opal_datatype_t destructor
correctly free ptypes if the datatype is not pre-defined.

Thanks Axel Huebl for reporting this.

Refs. open-mpi/ompi#6291

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit 7c938f070fa8c906918507dbc78fdadcde324610)
2019-01-30 10:41:14 -08:00
Gilles Gouaillardet
c85fd35f27 opal: remove unnecessary #include file
opal_config_bottom.h can only be #include'd in opal_config.h,
so there is no need to #include "opal_config.h" inside.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit c8790d29de5ef399fd805f27366af6c2dc87ce9e)
2019-01-30 07:33:32 -05:00
Gilles Gouaillardet
f79f14ad93 hwloc/base: fix some off-by-one errors
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
(cherry picked from commit 73d104f6959c95f676a779826f511826e0b17f6a)
2019-01-30 07:33:32 -05:00
Jeff Squyres
788c92b1ce hwloc/external.h: fix a clash with external HWLOC_VERSION[*]
Some macros defined by the embedded hwloc ends up in opal_config.h
because hwloc configury m4 files are slurped into Open MPI.  These
macros are not required here, and they might conflict with an external
hwloc install, so simply #undef them in hwloc/external/external.h
after including <opal_config.h> but before including the external
<hwloc.h>.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit f22b7d4f46b03554add3ff2254d1c893359aff84)
2019-01-30 07:33:32 -05:00
Howard Pritchard
fb39c7f7e6
Merge pull request #6271 from rhc54/cmr401/pmix3
v4.0.1: Update to PMIx 3.1.2
2019-01-28 15:09:03 -06:00
Ralph Castain
335f8c5100 Update to PMIx 3.1.2
Update the OPAL glue configure code to correctly link the opal/pmix3
component to the hwloc used by OMPI instead of defaulting to the
system-level hwloc. Required a corresponding update to the PMIx hwloc
configure code so we treat hwloc the same way we handle libevent in
embedded scenarios. Roll to PMIx v3.1.2 for plugging of memory leaks and
addition of faster PMIx_Get response

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-01-25 15:58:53 +09:00
Howard Pritchard
3ef8a8b253
Merge pull request #6280 from hppritcha/topic/mpool_comp_fix_v40x
Update mpool_hugepage_component.c
2019-01-18 07:34:02 -06:00
heasterday
2fc5ab7c8f Update mpool_hugepage_component.c
Signed-off-by: Hunter Easterday <heasterday@lanl.gov>
(cherry picked from commit ad0d2c451e63301e5a3b595f9df67bd5c813955e)
(cherry picked from commit 509380d99fc7e293f18a2dbb495ef73f9f4cbfef)
2019-01-15 08:10:46 -07:00
Nathan Hjelm
b7fbdeb759 btl/vader: minor correction to match ompi coding style
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit edaf08bf6d6ed24573187376921fe67a449851b2)
2019-01-07 16:40:01 -07:00
Nathan Hjelm
140842668f btl/vader: don't try to set reachabilty in add_procs if not requested
This commit fixes a bug where add_procs can incorrectly return an
error when going through the dynamic add_procs path. This doesn't
happen normally, only when pml/ob1 is not in use.

References #6201

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 30b8336cb40e586e5d926b2b52cd78bf3751e5d3)
2019-01-07 14:20:33 -07:00
Gilles Gouaillardet
61108b6228 pmix/ext3x: fix support for external PMIx v3.1
The PMIX_MODEX and PMIX_INFO_ARRAY macros were removed from the PMIx 3.1 standard.
Open MPI does not really need them (they are only used to be reported as not supported),
so smply #ifdef protect them to support an external PMIx v3.1

The change only need to be done in ext3x/ext3x.c.
But since this file is automatically generated from pmix3x/pmix3x.c, we have to update
the latter file.

Refs. open-mpi/ompi#6247

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(back-ported from commit open-mpi/ompi@950ba16aa1)
2019-01-07 20:27:34 +09:00