1
1
Граф коммитов

5573 Коммитов

Автор SHA1 Сообщение Дата
KAWASHIMA Takahiro
f6b39452f6 opal/datatype: Support short float
The type `short float` is proposed for the C language in ISO/IEC JTC
1/SC 22 WG 14 (C WG) for mainly IEEE 754-2008 binary16, a.k.a.
half-precision floating point or FP16.

By this commit, `short float` and `short float _Complex` are detected
in `configure` and used in Open MPI internal code. `MPI_SHORT_FLOAT`
and its complex number version are not added yet.

This commit changes values of existing `OPAL_DATATYPE_*` macros.
This change does not affect ABI compatibility of `libmpi.so` and the
like because these values are only used in OPAL and OMPI internal code.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
2019-02-01 12:40:14 +09:00
Gilles Gouaillardet
b395342c9f opal/datatype: reset ptypes in opal_datatype_clone()
Reset ptypes when cloning a datatype in order to prevent
a double free() in the opal_datatype_t destructor.

This fixes a bug introduced in open-mpi/ompi@7c938f070f

Fixes open-mpi/ompi#6346

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-02-01 11:20:13 +09:00
bosilca
2cf6944e70
Merge pull request #6326 from bosilca/fix/convertor_raw
Provide a better fix for #6285.
2019-01-31 18:20:46 -05:00
George Bosilca
5a82c4fd07
Provide a better fix for #6285.
The issue was a little complicated due to the internal stack used in the
convertor. The main issue was that in the case where we run out of iov
space to save the raw description of the data while hanbdling a
repetition (loop), instead of saving the current position and bailing out
directly we reading of the next predefined type element. It worked in
most cases, except the one identified by the HDF5 test. However, the
biggest issue here was the drop in performance for all ensuing calls to
the convertor pack/unpack, as instead of handling contiguous loops as a
whole (and minimizing the number of memory copies) we copied data
description by data description.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-01-31 10:01:48 -05:00
Thananon Patinyasakdikul
58244b36d1
Merge pull request #6320 from thananon/pr/wait_sync_fix
opal/threads: reverted #6199
2019-01-30 15:31:27 -05:00
Nathan Hjelm
2c8f745d8d
Merge pull request #6337 from hjelmn/btl_vader_fix_a_stupid_error_in_the_fragment_sizes_used_by_the_free_lists_that_can_cause_weird_results
btl/vader: fix fragment sizes used by free lists
2019-01-30 13:26:57 -07:00
Nathan Hjelm
b51c8f888c btl/vader: fix fragment sizes used by free lists
This commit fixes a bug introduced in
f62d26ddbc. That commit changed how
vader allocates fragment memory from the shared memory
segment. Unfortunately, the values used for the fragment sizes did not
include space for the fragment header. This can cause an overrun of
data from one fragment to the header of the next fragment.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2019-01-30 12:31:34 -07:00
Jeff Squyres
2203f8d900
Merge pull request #6185 from ggouaillardet/topic/hwloc_macros
hwloc: remove public hwloc macros from opal_config.h
2019-01-30 07:32:22 -05:00
bosilca
29915fc943
Merge pull request #6292 from ggouaillardet/topic/opal_datatype_destruct
opal/datatype: plug a memory leak in opal_datatype_t destructor
2019-01-29 17:33:18 -05:00
Thananon Patinyasakdikul
56d3e0a43d opal/threads: reverted #6199
This commit reverted pr #6199 as it introduced deadlock in some cases.
Also removed the assert as the condition is obsoleted.

Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
2019-01-29 13:34:44 -05:00
Gilles Gouaillardet
c8790d29de opal: remove unnecessary #include file
opal_config_bottom.h can only be #include'd in opal_config.h,
so there is no need to #include "opal_config.h" inside.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-01-29 07:40:54 -08:00
Gilles Gouaillardet
73d104f695 hwloc/base: fix some off-by-one errors
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-01-29 07:36:56 -08:00
Jeff Squyres
f22b7d4f46 hwloc/external.h: fix a clash with external HWLOC_VERSION[*]
Some macros defined by the embedded hwloc ends up in opal_config.h
because hwloc configury m4 files are slurped into Open MPI.  These
macros are not required here, and they might conflict with an external
hwloc install, so simply #undef them in hwloc/external/external.h
after including <opal_config.h> but before including the external
<hwloc.h>.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-01-29 07:36:01 -08:00
Gilles Gouaillardet
0832ab5acc opal/datatype: fix opal_convertor_raw
correctly handle the case in which iovec is full and the
last accessed element of the datatype is the beginning of a loop

Refs. open-mpi/ompi#6285

Thanks Axel Huebl for reporting this

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-01-23 15:38:43 +09:00
Gilles Gouaillardet
7c938f070f opal/datatype: plug a memory leak in opal_datatype_t destructor
correctly free ptypes if the datatype is not pre-defined.

Thanks Axel Huebl for reporting this.

Refs. open-mpi/ompi#6291

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-01-22 10:57:57 +09:00
Nathan Hjelm
fe5d5c75f9
Merge pull request #6279 from hjelmn/ompi_memory_usage
mpool: add new base module type "basic"
2019-01-15 09:48:38 -07:00
Nathan Hjelm
f62d26ddbc btl/vader: use basic mpool type to handle frag/fbox allocation
This commit updates btl/vader to use an mpool for handling all shared
memory allocations (frags, fboxes).

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2019-01-14 15:54:16 -07:00
Nathan Hjelm
6ffc7cc96c mpool: add new base module type "basic"
This commit adds a new mpool base module type: basic. This module can
be used with an opal_free_list_t to allocate space from a
pre-allocated block (such as a shared memory region). The new module
only supports allocation and is not meant for more dynamic use cases
at this time.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2019-01-14 15:48:46 -07:00
Ralph Castain
1e70ea73f4 Update to latest PMIx master (v4.0)
Update the OPAL glue configure code to correctly link the opal/pmix4
component to the hwloc used by OMPI instead of defaulting to the
system-level hwloc. Required a corresponding update to the PMIx hwloc
configure code so we treat hwloc the same way we handle libevent in
embedded scenarios.

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-01-10 09:37:56 -08:00
Nathan Hjelm
c7867ffc4f opal/runtime: fix teardown ordering
This commit fixes the ordering of the teardown for
opal_finalize_util. The installdirs and if frameworks need to come
down before the MCA system.

Fixes #6259

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2019-01-09 12:38:12 -07:00
Gilles Gouaillardet
950ba16aa1 pmix/ext3x: fix support for external PMIx v3.1
The PMIX_MODEX and PMIX_INFO_ARRAY macros were removed from the PMIx 3.1 standard.
Open MPI does not really need them (they are only used to be reported as not supported),
so smply #ifdef protect them to support an external PMIx v3.1

Refs. open-mpi/ompi#6247

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-01-08 10:11:05 +09:00
Nathan Hjelm
edaf08bf6d btl/vader: minor correction to match ompi coding style
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2019-01-07 14:24:58 -07:00
Nathan Hjelm
30b8336cb4 btl/vader: don't try to set reachabilty in add_procs if not requested
This commit fixes a bug where add_procs can incorrectly return an
error when going through the dynamic add_procs path. This doesn't
happen normally, only when pml/ob1 is not in use.

References #6201

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2019-01-07 13:00:48 -07:00
Gilles Gouaillardet
78fffa25f2 mpool/hugepage: plug a memory leak
Refs. open-mpi/ompi#6242

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-01-07 11:50:40 +09:00
Howard Pritchard
ee26ed9e8f
Merge pull request #6214 from heasterday/topic/fixmpool
Update mpool_hugepage_component.c
2019-01-02 16:24:39 -07:00
Gilles Gouaillardet
0203531695 pmix/pmi4x: refresh to latest PMIx
refresh to pmix/pmix@fae0ee7d94

Refs. open-mpi/ompi#6228

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-12-28 12:57:57 +09:00
Ralph Castain
529e17e0d9
Merge pull request #6223 from ggouaillardet/topic/pmix_refresh
pmix/pmi4x: refresh to latest PMIx
2018-12-24 10:51:00 -08:00
bosilca
182a2db2a4
Merge pull request #6029 from ggouaillardet/topic/large_datatypes
opal/datatype: correctly handle large datatypes
2018-12-24 12:49:52 -05:00
Jeff Squyres
a30864c634 opal: fix compiler warning
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-12-23 12:10:23 -08:00
Hunter Easterday
509380d99f Fix white extra whitespace
Signed-off-by: Hunter Easterday <heasterday@lanl.gov>
2018-12-20 14:32:10 -07:00
heasterday
ad0d2c451e Update mpool_hugepage_component.c
Signed-off-by: Hunter Easterday <heasterday@lanl.gov>
2018-12-20 14:28:19 -07:00
bosilca
f96305f96b
Merge pull request #6199 from aravindksg/opal-threads-fix
opal/sync: Fix assert during multi-threaded progress invocation
2018-12-20 13:27:07 -05:00
Nathan Hjelm
cf49957af6
Merge pull request #6207 from hjelmn/opal_cleanup
opal: fix common symbols introduced in opal cleanup
2018-12-19 15:39:39 -07:00
Jeff Squyres
9b88e60fc8 info_get: ensure to copy all requested characters
When querying an info value, copy out exactly as many characters as
the caller asked for -- do not artificially truncate the target just
to ensure that it is \0-terminated.

Specifically: do not use opal_string_copy() to copy info values,
because opal_string_copy() will guarantee to \0-terminate the target,
even if it means truncating the target.  E.g., if the caller calls
opal_info_get_nolock() with valuelen=5, opal_string_copy() will return
"1234\0" -- which is wrong.  This commit fixes the behavior to return
"12345".

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-12-19 13:01:45 -08:00
Nathan Hjelm
fd852c8f63 opal: fix common symbols introduced in opal cleanup
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-12-19 12:27:14 -07:00
Nathan Hjelm
4944508603
Merge pull request #6136 from hjelmn/opal_cleanup
opal: clean up init/finalize
2018-12-18 15:23:32 -07:00
Nathan Hjelm
0edfd328f8 opal: clean up init/finalize
This commit contains the following changes:

 - Remove the unused opal_test_init/opal_test_finalize
   functions. These functions are not used by anything in the code
   base or MTT. Tests use opal_init_util/opal_finalize_util instead.

 - Get rid of gotos in opal_init_util and opal_init. Replaced them
   with a cleaner solution.

 - Automatically register cleanup functions in init functions. The
   cleanup functions are executed in the reverse order of the
   initialization functions. The cleanup functions are run in
   opal_finalize_util() before tearing down the class system.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-12-18 14:37:04 -07:00
Gilles Gouaillardet
b0a668457c pmix/pmi4x: refresh to latest PMIx
refresh to pmix/pmix@2d4c2874fd

Refs. open-mpi/ompi#6222

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-12-18 17:11:01 +09:00
Ralph Castain
ece7696c2c Fix external PMIx v3 support
Don't erase the source files during "make clean"!

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-12-17 15:54:56 -08:00
Aravind Gopalakrishnan
2f3c5218ff opal/sync: Fix assert during multi-threaded progress invocation
PR #5241 provided an MCA variable to allow multi-threaded opal_progress.
However, it allowed to update the linked list even when multiple threads was
allowed to call opal_progress. This caused a scenario when a more recent thread
could complete it's progress and fail the assert(sync ==
wait_sync_list).

Allowing to update the linked list only for the case when the number of threads
exceeds the threshold fixes the problem.

Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
2018-12-17 12:14:27 -08:00
Gilles Gouaillardet
b89deeb1bb btl/uct: fix a typo in configure.m4
remove whitespace around '=' when setting btl_uct_LIBS

Thanks Ake Sandgren for reporting this

Refs. open-mpi/ompi#6173

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-12-11 14:23:53 +09:00
Gilles Gouaillardet
78aa6fdd1d btl/uct: fix a warning
Use the PRIsize_t macro to correctly print a size_t

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-12-07 16:16:35 +09:00
George Bosilca
88a693bf71 Add a test for very large data.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-12-06 13:30:58 +09:00
Gilles Gouaillardet
fbb5bb8860 opal/datatype: correctly handle large datatypes
Always use size_t (instead of converting to an uint32_t) in order to
correctly support large datatypes.

Thanks Ben Menadue for the initial bug report

Refs open-mpi/ompi#6016

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-12-06 13:30:58 +09:00
Nathan Hjelm
e07a64c52d btl/uct: fix some issues when using UCX over ugni
Though not a recommended configuration it is possible to use Open MPI
over UCX over uGNI. This configuration had some issues related to the
connection management and tl selection. This commit fixes those
issues.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-12-05 16:30:54 -07:00
Gilles Gouaillardet
fccb3e7514
Merge pull request #6137 from ggouaillardet/topic/ucx_warning
btl/openib: delay UCX warning to add_procs()
2018-12-05 11:50:10 +09:00
Yossi Itigin
83cca9d52a ucx: add owner.txt for components
Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
2018-12-01 17:14:03 +02:00
Gilles Gouaillardet
0a2ce58040 btl/openib: delay UCX warning to add_procs()
If UCX is available, then pml/ucx will be used instead of
pml/ob1 + btl/openib, so there is no need to warn about
btl/openib not supporting Infiniband.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-11-29 16:20:55 +09:00
Ralph Castain
f609542bbf Implement process set name support
Add the --pset option for app_contexts so the user can provide a string
name for each app_context. Use the new PMIx pset attribute to store the
names in the PMIx local storage for retrieval

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-11-27 11:07:58 -08:00
Josh Hursey
3339e2aa33
Merge pull request #5986 from jjhursey/vpid-unpack
Add OPAL_VPID to unpacking
2018-11-21 11:42:02 -06:00
KAWASHIMA Takahiro
cacd6f389c datatype: Remove #if HAVE_[TYPE] for C99 types
Now Open MPI requires a C99 compiler. Checking availability of
the following types is no more needed.

- `long long` (`signed` and `unsigned`)
- `long double`
- `float _Complex`
- `double _Complex`
- `long double _Complex`

Furthermore, the `#if HAVE_[TYPE]` style checking is not correct.
Availability of C types is checked by `AC_CHECK_TYPES` in `configure.ac`.
`AC_CHECK_TYPES` defines macro `HAVE_[TYPE]` as `1` in `opal_config.h`
if the `[TYPE]` is available. But it does not define `HAVE_[TYPE]`
(instead of defining as `0`) if it is not available. So even if we
need `HAVE_[TYPE]` checking, it should be `#if defined(HAVE_[TYPE])`.

I didn't remove `AC_CHECK_TYPES` for these types in `configure.ac`
since someone may use `HAVE_[TYPE]` macros somewhere.

Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
2018-11-14 09:32:52 +09:00
Thananon Patinyasakdikul
3dc1629771
Merge pull request #5241 from thananon/opal_progress
Add MCA param for multithread opal_progress().
2018-11-09 12:30:07 -05:00
Gilles Gouaillardet
efe72d3d92
Merge pull request #6055 from ggouaillardet/topic/c11_atomics
Misc C11 atomics related fixes
2018-11-08 09:11:42 +09:00
Gilles Gouaillardet
07970f192a atomic/c11: fix include header path
simply #include "opal_stdint.h" in order to work with --devel-headers

Refs open-mpi/ompi#6053

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-11-07 13:23:28 +09:00
Thananon Patinyasakdikul
d9bd54c628 btl/ofi: fixed compiler warning on OSX.
This commit closes #6049

Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
2018-11-06 15:37:25 -05:00
Nathan Hjelm
30119ee339 opal/asm: work around possible gcc compiler bug
It seems in some cases (gcc older than v6.0.0) the __atomic_thread_fence is a
no-op with __ATOMIC_ACQUIRE. This appears to be the case with X86_64 so go
ahead and use __ATOMIC_SEQ_CST for the x86_64 read memory barrier. This should
not cause any performance issues as it is equivalent to the memory barrier
in the hand-written atomics.

References #6014

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-11-05 11:45:42 -07:00
Thananon Patinyasakdikul
4e23fedccb opal progress: Added MCA for multithreaded opal_progress.
This commit added MCA param `opal_max_thread_in_progress` to set the
number of threads allowed to do opal_progress concurrently. The default
value is 1.

Component with multithreaded design can benefit from this change to
parallelize their component progress function.

Signed-off-by: Thananon Patinyasakdikul <tpatinya@utk.edu>
2018-11-01 13:34:20 -04:00
Howard Pritchard
8126779a35 btl/openib: fix a problem with ib query
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.

Fixes: #5810
Fixes: #5914

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-10-31 14:50:15 -06:00
Brian Barrett
a1e85b03aa btl tcp: Fix compile error in IPv6
In 457f058 I broke the TCP BTL with --enable-ipv6.  This patch
fixes the compile error, so IPv6 works again.

Fixed #5996

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-30 12:31:04 -07:00
Joshua Hursey
a557c4130c Add OPAL_VPID to unpacking
* Needed to properly read PMIx job data like the following
   - `OPAL_PMIX_LOCALLDR`
   - `OPAL_PMIX_RANK`
   - `OPAL_PMIX_GLOBAL_RANK`
   - `OPAL_PMIX_APPLDR`
   - `OPAL_PMIX_APP_RANK`

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2018-10-29 14:31:47 -05:00
Gilles Gouaillardet
b205039205 event/external: fix version requirement
Only default to the external component if its version is
greater or equal than the internal libevent (2.0.22)

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-29 10:12:56 +09:00
Gilles Gouaillardet
35e77a286c event/external: misc configury fixes
- Always use the external component when configure'd with --with-libevent=external
 - Fix the external libevent library version detection
   by testing _EVENT_NUMERIC_VERSION and EVENT__NUMERIC_VERSION macros
 - Use the event2/event.h header (event.h is deprecated since libevent 2.0

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-29 10:12:56 +09:00
Aurelien Bouteiller
f2e6d7891e
Merge pull request #5975 from ICLDisco/export/pmix-fini-threadinterlock
Avoid a double lock interlock when calling pmix_finalize
2018-10-26 10:01:23 -04:00
Gilles Gouaillardet
b715dd2657 btl/uct: fix AC_CHECK_DECLS usage
AC_CHECK_DECLS take a comma separated list of macros/symbols,
so replace the whitespace separator with a comma.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-26 15:36:02 +09:00
Aurelien Bouteiller
50cf707f3f
Avoid a double lock interlock when calling pmix_finalize
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2018-10-25 16:36:58 -04:00
Nathan Hjelm
7b34c9b3fe
Merge pull request #5958 from hjelmn/btl_uct_sync
btl/uct: update for UCT_CB_FLAG_SYNC removal
2018-10-22 23:17:27 -06:00
Yossi Itigin
4c442f2601
Merge pull request #5934 from hoopoepg/topic/suppressed-cov-warn-added-log-msg
COMMON/UCX: suppressed coverity warnings
2018-10-22 11:00:47 +03:00
Nathan Hjelm
1b37328ba8 btl/uct: update for UCT_CB_FLAG_SYNC removal
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2018-10-21 18:57:42 -06:00
bosilca
690024917d
Merge pull request #5938 from bwbarrett/bugfix/tcp-multi-address
Always use same IP address for module and modex
2018-10-21 09:19:43 -07:00
Sergey Oblomov
1099d5f023 COMMON/UCX: added error code to log output
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-10-21 11:37:25 +03:00
George Bosilca
dc972f0b92
Fix the PML monitoring.
The monitoring PML hides it's existence from the OMPI infrastructure by
removing itself from the list of PML loaded components, remaining hidden
until MPI_Finalize.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-10-18 00:29:23 -04:00
Brian Barrett
457f058e73 btl tcp: Simplify modex address selection
Simplify selection of the address to publish for a given BTL TCP
module in the module exchange code.  Rather than looping through
all IP addresses associated with a node, looking for one that
matches the kindex of a module, loop over the modules and
use the address stored in the module structure.  This also
happens to be the address that the source will use to bind()
in a connect() call, so this should eliminate any confusion
(read: bugs) when an interface has multiple IPs associated with
it.

Refs #5818

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-18 02:23:21 +00:00
Howard Pritchard
7730db9982
Merge pull request #5937 from hppritcha/topic/remove_crs_components
remove some dead crs components
2018-10-17 16:06:36 -06:00
Jeff Squyres
c2608fb597
Merge pull request #5939 from jsquyres/pr/coverity-cid-1440332
opal/os_path: fix minor string overrun
2018-10-17 14:24:42 -07:00
Howard Pritchard
6564d3d217 remove some dead crs components
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-10-17 10:29:00 -06:00
Brian Barrett
4f19221af2 btl tcp: Simplify module address storage
Today, a btl tcp module is associated with exactly one IP
address (IPv4 or IPv6).  There's no need to reserve space
for both an IPv4 and IPv6 address in the module structure,
since the module will only be associated with one or the
other.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-17 16:21:17 +00:00
Jeff Squyres
290d1c0534 opal/os_path: fix minor string overrun
This is a follow on to 54ca3310ea to fix
a minor bug identified by Coverity.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-17 08:57:30 -07:00
Sergey Oblomov
df765595e3 COMMON/UCX: suppressed coverity warnings
- suppressed coverity warnings - added log messages on failed calls

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-10-17 16:11:03 +03:00
Brian Barrett
2acc4b7e7f btl tcp: Add workaround for "dropped connection" issue
Work around a race condition in the TCP BTL's proc setup code.
The Cisco MTT results have been failing on TCP tests due to a
"dropped connection" message some percentage of the time.
Some digging shows that the issue happens in a combination of
multiple NICs and multiple threads.  The race is detailed in
https://github.com/open-mpi/ompi/issues/3035#issuecomment-429500032.

This patch doesn't fix the race, but avoids it by forcing
the MPI layer to complete all calls to add_procs across the
entire job before any process leaves MPI_INIT.  It also
reduces the scalability of the TCP BTL by increasing start-up
time, but better than hanging.

The long term fix is to do all endpoint setup in the first
call to add_procs for a given remote proc, removing the
race.  THis patch is a work around until that patch can
be developed.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-16 18:33:30 -07:00
Nathan Hjelm
37a3f32e47
Merge pull request #5913 from hjelmn/uct_btl_fixes
btl/uct: fix deadlock in connection code
2018-10-16 19:11:54 -06:00
Nathan Hjelm
707d35deeb btl/uct: fix deadlock in connection code
This commit fixes a deadlock that can occur when using a TL that
supports the connect to endpoint model. The deadlock was occurring
while processing an incoming connection requests. This was done from
an active-message callback. For some unknown reason (at this time)
this callback was sometimes hanging. To avoid the issue the connection
active-message is saved for later processing.

At the same time I cleaned up the connection code to eliminate
duplicate messages when possible.

This commit also fixes some bugs in the active-message send path:

 - Correctly set all fragment fields in prepare_src.

 - Fix bug when using buffered-send. We were not reading the return
   code correctly (which is in bytes). This resulted in a message
   getting sent multiple times.

 - Don't try to progress sends from the btl_send function when in an
   active-message callback. It could lead to deep recursion and an
   eventual crash if we get a trace like
   send->progress->am_complete->ob1_callback->send->am_complete...

Closes #5820
Closes #5821

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 18:28:47 -06:00
Nathan Hjelm
1c001319d4
Merge pull request #5922 from hjelmn/opal_free_list_really_old_race_we_should_really_fix_now
opal/free_list: fix race condition
2018-10-16 15:27:05 -06:00
Brian Barrett
da1189d771
Merge pull request #5916 from bwbarrett/revert/6acebc4
Revert "Handle error cases in TCP BTL"
2018-10-16 13:54:18 -07:00
Nathan Hjelm
5c770a7bec opal/free_list: fix race condition
There was a race condition in opal_free_list_get. Code throughout the
Open MPI codebase was assuming that a NULL return from this function
was due to an out-of-memory condition. In some cases this can lead to
a fatal condition (MPI_Irecv and MPI_Isend in pml/ob1 for
example). Before this commit opal_free_list_get_mt looked like this:

```c
static inline opal_free_list_item_t *opal_free_list_get_mt (opal_free_list_t *flist)
{
    opal_free_list_item_t *item =
        (opal_free_list_item_t*) opal_lifo_pop_atomic (&flist->super);

    if (OPAL_UNLIKELY(NULL == item)) {
        opal_mutex_lock (&flist->fl_lock);
        opal_free_list_grow_st (flist, flist->fl_num_per_alloc);
        opal_mutex_unlock (&flist->fl_lock);
        item = (opal_free_list_item_t *) opal_lifo_pop_atomic (&flist->super);
    }

    return item;
}
```

The problem is in a multithreaded environment is *is* possible for the
free list to be grown successfully but the thread calling
opal_free_list_get_mt to be left without an item. The happens if
between the calls to opal_lifo_push_atomic in opal_free_list_grow_st
and the call to opal_lifo_pop_atomic other threads pop all the items
added to the free list.

This commit fixes the issue by ensuring the thread that successfully
grew the free list **always** gets a free list item.

Fixes #2921

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 13:17:09 -06:00
Nathan Hjelm
9ea5dfa799 class/opal_fifo: fix warning
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-15 19:18:31 -06:00
Jeff Squyres
cef6cf0ac5 opal: update some string handling
Make various string handling locations a little more robust.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-14 16:04:28 -07:00
Jeff Squyres
e6de42c379 opal/util/info.c: fix compiler warning
Remove unused variable.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-14 16:03:28 -07:00
Brian Barrett
5162011428 Revert "Handle error cases in TCP BTL"
This reverts commit 6acebc40a1.

This patch is causing numerous "Socket closed" messages which are
causing most of the failures on Cisco's MTT run.  See
https://github.com/open-mpi/ompi/issues/5849 for more information.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-12 15:01:54 -07:00
Brian Barrett
8feff49f6c pmix: Fix filename typo
Looks like a filename was missed when pmix sucked in the installdirs
framework.  Fixing the typo fixes "make ctags" and "make cscope".

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-12 17:05:01 +00:00
Nathan Hjelm
6ed68da870 btl/uct: use the correct tl interface attributes
It is apparently possible for different instances of the same UCT
transport to have different limits (max short put for example). To
account for this we need to store the attributes per TL context not
per TL. This commit fixes the issue.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-11 11:33:17 -06:00
Brian Barrett
902b57919c btl tcp: Print number of endpoints on error
While trying to debug #3035, it's not clear whether there is
an issue with the modex data or printing the address list.
Print the number of endpoints on the error, which will help
determine which case is happening to Cisco.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-10 08:29:48 -07:00
Brian Barrett
bd07cc6707 btl tcp: Improve initialization debugging
When creating TCP BTL modules, print more information about the
module's ethernet association, including the first address associated
with the device, as debug output.

Fix a flipped output string for IPv4 and IPv6 addresses in the
modex send code.

Add the addresses being published in the modex to the debugging
output in modex send, to help match failures in endpoint match.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-10 08:29:48 -07:00
Gilles Gouaillardet
ca9009e3d0 pmix/ext3x: fix minor typos
refer PMIx 3x instead of previous 2x

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-10 13:18:36 +09:00
Gilles Gouaillardet
b3bd785ce0 pmix/ext3x: add missing header files into the dist tarball
Refs. open-mpi/ompi#5871

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-10 13:17:54 +09:00
Gilles Gouaillardet
67402f5f19 pmix/ext2x: add missing header files into the dist tarball
Refs. open-mpi/ompi#5871

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-10-10 13:17:42 +09:00
Nathan Hjelm
39be6ec15c btl/uct: bug fixes and general improvements
This commit updates the uct btl to change the transports parameter
into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the
priority list. This will give better out of the box performance for
multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5
lock inside libmlx5_rdmav2.

This commit also fixes a number of leaks and a possible deadlock when
using RDMA.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-09 15:15:45 -06:00
Brian Barrett
e9e4d2a4bc Handle asprintf errors with opal_asprintf wrapper
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error.  However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined.  Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-08 16:43:53 -07:00
Brian Barrett
c087365ead opal/util: Always support BSD behavior of asprintf
Open MPI's developers like to assume that asprintf() always sets
the ptr to NULL on error, but the standard (and Linux glibc) do
not guarantee this.  As a result, we're making opal_asprintf()
always available for developers, which will guarantee that
ptr is set to NULL on error.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-08 16:43:53 -07:00
Nathan Hjelm
8291f6722d btl/vader: fix race condition in writing header
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2018-10-05 16:30:06 -06:00
Ralph Castain
76c11a1496 Remove build product - fix autogen.pl mode
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-05 13:38:39 -07:00
Ralph Castain
c498a7e77a Protect PMIx from bad configure entry
Ignore with-hwloc=internal or external as those are meaningless to pmix
(will upstream)

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-05 07:07:05 -07:00
Ralph Castain
f379ba9c8e Fail configure if pmix won't build
If we are using the internal PMIx component and the embedded library fails to configure, then fail - don't silently fail to build and then fail in execution

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-05 06:29:36 -07:00
Geoff Paulsen
56a55cdb4b
Merge pull request #5786 from jsquyres/pr/string-madness
Replace strcpy() and strncpy() with (new) opal_string_copy()
2018-10-04 16:12:46 -05:00
Ralph Castain
86702b71bc Correctly notify upon process failure
We only need to pass a custom range if the target is a single process.
Otherwise, we let the range be "session".

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-10-03 20:19:42 -07:00
Nathan Hjelm
dfa8d3a81a btl/vader: work around Oracle compiler bug
This commit works around an Oracle C compiler bug in 5.15 (not sure
when it was introduced). The bug is triggered when we chain
assignments of atomic variables. Ex:

_Atomic intptr x, y;
intptr_t z = 0;

x = y = z;

Will produce a compiler error of the form:

operand cannot have void type: op "="
assignment type mismatch:
	long "=" void

To work around the issue we are removing the chain assignment and
setting the head and tail on different lines.

Fixes #5814

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-03 08:55:51 -06:00
Nathan Hjelm
66a7dc4c72 btl/vader: ensure the fast box tag is always read first
On some platfoms reading a 64-bit value is non-atomic and it is
possible that the two 32-bit values are read in the wrong order. To
ensure the tag is always read first this commit reads the tag before
reading the full 64-bit value.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-03 07:17:32 -06:00
Ralph Castain
31f6c75498
Merge pull request #5819 from bosilca/fix/local_bind
Fix/local bind
2018-10-02 11:27:36 -07:00
Brian Barrett
a25df3f29e opal: Remove outdated MacOS workaround
Remove the pack/unpack pragma around net/if.h on MacOS, which
was added to fix a bug in MacOS X 10.4.x on 64-bit platforms.
The bug was fixed in Mac OS X 10.5.0 and, sometime in the last
11 years, compilers started emitting warnings about the fact
that the Apple header stomped over the pragma pack settings
from the workaround.  We already don't support versions of MacOS
earlier than 10.5, so there's no point in keeping the workaround.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-02 13:35:46 -04:00
Brian Barrett
19e16d5fd0 opal: Disable memory patcher component on MacOS
Open MPI doesn't support any transports on MacOS which require
memory manager hooks.  The memory patcher component uses the
syscall interface, which has been deprecated in recent versions
of MacOS.  Since we don't need it and it emits warnings about
deprecation, disable the memory patcher component on MacOS.

Fixes #5671

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-02 13:35:15 -04:00
George Bosilca
a3a492b42c
Small pedantic fixes.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-10-02 12:08:18 -04:00
George Bosilca
9164e26e2f
Provide the correct socklen to bind.
Get Brian's patch from #5825 and his log message:
Fix a failure in binding the initiating side of a connection
on MacOS. MacOS doesn't like passing the size of the storage
structure (sockaddr_storage) instead of the expected size of
the structure (sockaddr_in or sockaddr_in6), which was causing
bind() failures. This patch simply changes the structure size
to the expected size.

Add a more clear error message in debug mode.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-10-02 12:06:40 -04:00
Ralph Castain
fcc1d30ab3
Merge pull request #5775 from karasevb/check_old_topo_key
pmix: check the old topo key to keep compatibility with old RMs
2018-10-02 08:12:42 -07:00
Jeff Squyres
5dae086f7e btl/tcp: output the IP address correctly
Per
https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673,
it looks like the IP address for a given interface is being stashed in
two places: on the endpoint and on the module.

1. On the endpoint, it is storing the moral equivalent of a
   (struct sockaddr_in.sin_addr).
2. On the module, it is storing a full (struct sockaddr_storage).

The call to opal_net_get_hostname() expects a full (struct sockaddr*)
-- not just the stripped-down (struct sockaddr_in.sin_addr).  Hence,
when the original code was passing in the endpoint's (struct
sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it
like a (struct sockaddr), hilarity ensued (i.e., we got the wrong
output).

This commit eliminates the call to opal_net_get_hostname() and just
calls inet_ntop() directly to convert the (struct
sockaddr_in.sin_addr) to a string.

NOTE: Per the github comment cited above, there can be a disparity
between the IP address cached on the endpoint vs. the IP address
cached on the module.  This only happens with interfaces that have
more than one IP address.  This commit does not fix that issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-01 16:12:57 -07:00
Jeff Squyres
293a938d29 string_copy: assert() fail if the copy is too long
Add a hueristic: if the string copy is "too long", fail an assert().
This is based on the premise that Open MPI doesn't do large string
copies.  So if we see a dest_len that is over a certain threshhold
(currently set at 128K), this is likely a programmer error, and on
debug builds, we should fail an assert().  In production builds, it
will work just fine (assuming that it's not a programmer error).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-10-01 13:34:15 -07:00
Ralph Castain
172e770154 Sync to PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-28 12:43:32 -07:00
Ralph Castain
c836c4282c Update to PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-27 20:57:23 -07:00
Jeff Squyres
28e51ad4d4 opal: convert from strncpy() -> opal_string_copy()
In many cases, this was a simple string replace.  In a few places, it
entailed:

1. Updating some comments and removing now-redundant foo[size-1]='\0'
   statements.
2. Updating passing (size-1) to (size) (because opal_string_copy()
   wants the entire destination buffer length).

This commit actually fixes a bunch of potential (yet quite unlikely)
bugs where we could have ended up with non-null-terminated strings.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 11:56:18 -07:00
Jeff Squyres
e1cf8757ae opal/util/string_copy: add "safe" string copy function
Do essentially the same thing as strncpy(3), but a) ensure to always
terminate the destination buffer with a \0, and b) do not \0-pad to
the right.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 10:48:04 -07:00
Jeff Squyres
38fefcf1cf opal/util/Makefile.am: fix whitespace (remove tabs)
No code or logic changes

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 09:43:03 -07:00
Jeff Squyres
6bb356ab87 Squash a bunch of harmless compiler warnings.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-26 12:15:21 -07:00
Boris Karasev
ed42f568ae pmix: check the old topo key to keep compatibility with old RMs
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-09-25 18:13:54 +03:00
Gilles Gouaillardet
db65dbd9a8 ucx: use the c99 __func__ macro instead
__FUNCTION__ macro was never standardized and should not be used.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-09-25 11:19:18 +09:00
Jeff Squyres
176da51aec mca_base_var: fix output bug about settable vars
Fix the test that determined whether we output "writeable" or
"read-only" for MCA vars (it was checking the wrong flag).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-22 07:18:45 -07:00
Nathan Hjelm
c6230657a2 opal: fix warning
Fixes #5713

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2018-09-21 00:51:40 -06:00
Brian Barrett
9344afd485 openib: Disable CUDA async by default
Disable async receive for CUDA under OpenIB.  While a performance
optimization, it also causes incorrect results for transfers
larger than the GPUDirect RDMA limit.  This change has been validated
and approved by Akshay.

References #3972

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-09-20 07:58:10 -07:00
Howard Pritchard
dccf780546
Merge pull request #5737 from hppritcha/topic/remove_scif_support
SCIF: remove it
2018-09-19 11:38:27 -06:00
Howard Pritchard
b9ac3d8931 SCIF: remove it
KNC is effectively dead.  Remove corresponding SCIF
support in Open MPI.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2018-09-19 10:39:52 -06:00
bosilca
b0e2975f86
Merge pull request #5701 from ICLDisco/export/errors_ib
Adding error handling in OpenIB BTL
2018-09-19 09:45:58 -04:00
bosilca
464e1abbab
Merge pull request #5700 from ICLDisco/export/tcp_errors
Handle error cases in TCP BTL
2018-09-19 09:44:38 -04:00
Michael Kuron
17b0f1fcc3 Deal with EOPNOTSUPP returned from getsockopt()
This can be returned when running on QEMU user-mode emulation,
which does not support getsockopt with SO_RCVTIMEO.

Signed-off-by: Michael Kuron <mkuron@icp.uni-stuttgart.de>
2018-09-16 14:55:28 +02:00
Jeff Squyres
3970b06134 misc: passing a bool to va_start() is undefined
According to clang on MacOS, passing a bool parameter -- which
undergoes default parameter promotion -- to va_start() results in
undefined behavior.  So just change these params to int and avoid the
issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-15 06:04:13 -07:00
Jeff Squyres
8f2620d3af misc: compiler warning fixes
A variety of small compiler warning fixes.  The 2 PMIx fixes are
already committed upstream.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-15 06:04:13 -07:00
Nathan Hjelm
6363889cd0
Merge pull request #5696 from hjelmn/vader_5375
btl/vader: ensure that the send tag is always written last
2018-09-14 12:33:35 -06:00
Nathan Hjelm
1071d72130
Merge pull request #5445 from hjelmn/asm_type
Update opal to use C11 atomics if available
2018-09-14 12:32:56 -06:00
Nathan Hjelm
fe6528b0d5 opal/atomic: always use C11 atomics if available
This commit disables the use of both the builtin and hand-written
atomics if proper C11 atomic support is detected. This is the first
step towards requiring the availability of C11 atomics for the C
compiler used to build Open MPI.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-14 10:51:05 -06:00
Nathan Hjelm
000f9eed4d opal: add types for atomic variables
This commit updates the entire codebase to use specific opal types for
all atomic variables. This is a change from the prior atomic support
which required the use of the volatile keyword. This is the first step
towards implementing support for C11 atomics as that interface
requires the use of types declared with the _Atomic keyword.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-14 10:48:55 -06:00
Nathan Hjelm
850fbff441 btl/vader: ensure that the send tag is always written last
To ensure fast box entries are complete when processed by the
receiving process the tag must be written last. This includes a zero
header for the next fast box entry (in some cases). This commit fixes
two instances where the tag was written too early. In one case, on
32-bit systems it is possible for the tag part of the header to be
written before the size. The second instance is an ordering issue. The
zero header was being written after the fastbox header.

Fixes #5375, #5638

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-14 10:35:35 -06:00
Ralph Castain
7facb3f3e9 Pickup and deploy network-specific envars
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-13 09:24:19 -07:00
Ralph Castain
c7939aeebc Pickup some minor debugging changes
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-13 09:11:28 -07:00
Ralph Castain
466cad6cb2 Update master to PMIx v4
Retain ext3x for PMIx 3  compatibility
Get the blasted permissions correct on config files

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-13 08:24:17 -07:00
Ralph Castain
fb67b1703f
Merge pull request #5689 from rhc54/topic/psm2
Silence warning of unused function
2018-09-12 16:40:23 -07:00
Ralph Castain
853dc96c5d Silence warning of unused function
Requires protection for HAVE___CLEAR_CACHE

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-12 15:03:59 -07:00
Jeff Squyres
78c1469f93
Merge pull request #5682 from selvintxavier/brcm_hcas
Add support for different Broadcom HCAs
2018-09-12 11:59:21 -04:00
Selvin Xavier
a53a6f7650 Add support for different Broadcom HCAs
Adds device ids of different Broadcom adapters from
BCM57XXX and BCM58XXX family of HCAs.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
2018-09-12 02:22:34 -07:00
Ralph Castain
83ba589084 Silence warning
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-09-11 18:01:12 -07:00
Nathan Hjelm
9a514c6794
Merge pull request #5647 from hjelmn/clear_cache
patcher/base: improve instruction cache flush for aarch64
2018-09-10 15:43:35 -06:00
Jeff Squyres
a1b879d176
Merge pull request #5651 from jsquyres/pr/dont-let-openib-yell-about-no-nics
btl/openib: don't complain about no NICs
2018-09-07 15:00:24 -04:00
Howard Pritchard
dc02e54320
Merge pull request #5516 from thananon/ofi_send
btl/ofi: Added 2 sided communication support.
2018-09-06 18:39:23 -06:00
Jeff Squyres
098ec55e37 btl/openib: don't complain about no NICs
Since openib is on its long, slow way out the door, don't let it
complain about not being able to find any NICs at run time.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-06 11:26:58 -07:00
Nathan Hjelm
1cdbceb095 patcher/base: improve instruction cache flush for aarch64
This commit updates the patcher component to either use the
__clear_cache intrinsic or the correct assembly to flush the
instruction cache.

Fixes #5631

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-06 09:47:53 -06:00
Nathan Hjelm
5b9a24f13b
Merge pull request #5648 from hjelmn/btl_uct_fix
btl/uct: ad missing opal_mem_hooks_unregister_release call
2018-09-05 13:59:22 -06:00
Nathan Hjelm
36c206d2d6 btl/uct: add missing opal_mem_hooks_unregister_release call
This commit fixes a bug when using the UCT btl with the UCX memory
hooks disabled. We were misssing a call to
opal_mem_hooks_unregister_release to remove the btl memory hook
callback.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-09-05 13:01:45 -06:00
Jeff Squyres
4173ac6dd0 mca/base: enforce max string lengths
Ensure that the project, framework, component, and variable names are
lower than max lengths.  This is a follow-on to
992a8e8297, per discussion on
https://github.com/open-mpi/ompi/pull/5642.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-05 08:42:00 -07:00
George Bosilca
992a8e8297
Use less memory for the MCA variable names.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-08-31 11:36:40 +02:00
Jeff Squyres
05e5f61fe1 common/verbs-usnic: check that it will actually compile
If someone specifies --with-verbs-usnic, actually do a configury check
to ensure that it will compile (vs. assuming that it will compile if
someone asks for it).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-30 13:10:49 -07:00
Gilles Gouaillardet
ad2c207a7e mpool/memkind: plug a memory leak in mca_mpool_memkind_close()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-08-30 10:07:17 +09:00
Gilles Gouaillardet
7556dd0abb opal/util: plug a memory leak in the opal_infosubscriber_t destructor
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-08-30 10:07:17 +09:00
Gilles Gouaillardet
aeddd2f249 pmix/pmix3x: plug a memory leak in external_register()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-08-30 10:07:16 +09:00
Gilles Gouaillardet
6e47c5708e pmix/base: plug a memory leak in opal_pmix_base_select()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-08-30 10:07:16 +09:00
Yossi Itigin
68206a5635
Merge pull request #5569 from hoopoepg/topic/optimize-blocked-calls
PML/UCX: blocked calls optimizations
2018-08-29 14:19:09 +03:00
Yossi Itigin
4bb6845888
Merge pull request #5570 from hoopoepg/topic/opal-mem-hooks-syno
MCA/COMMON/UCX: added synonym to opal_mem_hook variable
2018-08-29 14:16:33 +03:00
Artem Polyakov
00d12c058f
Merge pull request #5596 from karasevb/fix_hwloc_numa_obj
Fixed the NUMA obj detection for hwloc ver >= 2.0.0
2018-08-27 22:15:41 -07:00
Sergey Oblomov
2cd9e04166 PML/UCX: optimization of mprobe call - renamed vars
- renamed of internal variable names
- used unsigned datatypes

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-27 09:50:39 +03:00
Sergey Oblomov
38e908f83e PML/UCX: optimization of mprobe call
- refactoring of opal/UCX progress calls

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-27 09:50:38 +03:00
Sergey Oblomov
b0f87f2235 PML/UCX: blocked calls optimizations
- added UCX progress priority

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-27 09:50:38 +03:00
Boris Karasev
beb0697f24 Fixed copyrights of prev commit.
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-08-27 09:50:11 +03:00
Gilles Gouaillardet
1665b8db8f
Merge pull request #5519 from extrowerk/haiku_patches
fcntl include bugfix
2018-08-27 09:46:43 +09:00
Sergey Oblomov
b72dd83f05 MCA/COMMON/UCX: added synonims for common ucx variables
- added synonims for atomic/osc modules

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-26 18:25:21 +03:00
Boris Karasev
e5291ccc34 Fixed the NUMA obj detection for hwloc ver >= 2.0.0
Since version hwloc 2.0.0 has a new organization of NUMA nodes on the
topology tree. This commit adds the detection of local NUMA object for
hwloc => 2.0.0, which fixes the procs bindings policy for rmaps mindist
component.

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-08-24 19:11:52 +03:00
Jeff Squyres
fe0852bcb4 Miscellaneous compiler warning stomps.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-24 07:39:14 -07:00
Nathan Hjelm
0a9c358ef8
Merge pull request #5577 from hjelmn/btl_vader_sc_emu_warnings
btl/vader: clean up debuging and squash warning
2018-08-23 09:41:13 -06:00
Nathan Hjelm
c74cf666a9 btl/vader: clean up debuging and squash warning
References #5512

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-22 10:57:32 -06:00
Nathan Hjelm
a99dba6cb7
Merge pull request #5553 from markalle/info_snprintf2
snprintf() length fix for info
2018-08-22 10:27:14 -06:00
Sergey Oblomov
6a7f66d9c2 MCA/COMMON/UCX: renamed synonim to opal_mem_hook variable
Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-22 14:12:33 +03:00
Sergey Oblomov
e00f7a68ba MCA/COMMON/UCX: added synonim to opal_mem_hook variable
- added synonim to opal_mem_hook variable to allow
  to print it in opal_info -a

Signed-off-by: Sergey Oblomov <sergeyo@mellanox.com>
2018-08-21 15:05:12 +03:00
Ralph Castain
5cfa2a7fca Complete integration of job_control
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-08-20 16:10:50 -07:00
Ralph Castain
9948084130 Update to PMIx v3.0.1
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-08-20 15:05:24 -07:00
Aurélien Bouteiller
e46c907468
Adding error handling in OpenIB BTL
bugfix: major: openib send credits returned correctly after a fault for pending frags to dead processes; also tweak the default IB retry timeouts tomake this happen faster

Make it compile in non-debug builds

Mark the IB endpoint as failed when invoking an error; this resolves UDCM connection deadlocks

Changing the default IB retry timeouts is not a good idea.
We'll need to find another way to speedup credit recovery in failure cases.

Remove ULFM specific cases

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2018-08-16 16:53:17 -04:00
Mark Allen
ee6eb9b12b snprintf() length fix for info
The important part of this fix is a couple places 5 was hard-coded that needed to be
strlen(OPAL_INFO_SAVE_PREFIX).

But also this contains a fix for a gcc 7.3.0 compiler warning about snprintf(). There
was an "if" statement making sure all the arguments had appropriate strlen(), but gcc
still complained about the following snprintf() because the size of the struct element
is iterator->ie_key[OPAL_MAX_INFO_KEY + 1].

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2018-08-16 15:32:04 -04:00
Aurelien Bouteiller
6acebc40a1
Handle error cases in TCP BTL
When an error is returned by the socket operations, trigger the
appropriate error path in the PML to give an opportunity for
rerouting/error handling.

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2018-08-14 15:35:24 -04:00
Nathan Hjelm
dca3516765 btl/vader: move memory barrier to where it belongs
The write memory barrier was intended to precede setting a fast-box
header but instead follows it. This commit moves the memory barrier to
the intended location.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-13 10:14:34 -06:00
Nathan Hjelm
e0f73866ef
Merge pull request #5525 from hjelmn/event_threading
opal/progress: protect against multiple threads in event base
2018-08-13 09:21:44 -06:00
Jeff Squyres
01e4570af7 hwloc201/configure.m4: make it safe when used with hwloc:external
The Autoconf AC_CONFIG_* macros can only be instantiated exacly once
for any given file, *and* they must be in a code execution path at run
time for the target file to be generated at the end of configure.

For example, if you want to generate file ABC at the end of configure,
you must invoke the AC_CONFIG_FILES(ABC) macro in a code path that
will get executed when configure is run.

That's pretty straightforward.

What's not straightforward is two corner cases:

1. You cannot invoke the AC_CONFIG_FILES(ABC) macro for the same file
   more than once.  If you do, autoreconf will fail (even before you
   can run configure).
2. If AC_CONFIG_FILES(ABC) is not in a code path that is executed by
   configure, the file ABC is not registered properly, and ABC will
   not be generated at the end of configure.

This applies to hwloc because hwloc's HWLOC_SETUP_CORE macro calls
both AC_CONFIG_FILES and AC_CONFIG_HEADER to setup its Makefiles
(etc.) so that targets like "make distclean" and "make distcheck" will
work properly.  Hence, we *have* to invoke HWLOC_SETUP_CORE.

However, the MCA_opal_hwloc_hwloc201_CONFIG macro has a few side
effects.  It would be nice to do able to do something like this:

```
    if hwloc:extern is going to be used:
        Invoke minimal HWLOC_SETUP_CORE (with no side effects)
    else
        Invoke full HWLOC_SETUP_CORE (with side effects)
    fi
```

But we can't, because autoreconf will detect that AC_CONFIG_FILES has
been invoked on the same files more than once (regardless of whether
those code paths will be executed at run time or not).  Kaboom.

Similarly, we can't do this:

```
    if hwloc:extern is not going to be used:
        Invoke full HWLOC_SETUP_CORE (with side effects)
    fi
```

Because then hwloc's AC_CONFIG_FILES won't be registered properly when
hwloc:external *is* used (i.e., when the HWLOC_SETUP_CORE macro is not
in a code path that is executed at run time), and targets like "make
distclean" will fail because hwloc's Makefiles won't have been setup.
Kaboom.

But remember that the hwloc framework is a bit special: there will
only ever be 2 comoponents: external and internal.  External is
guaranteed to be configured first because of its priority.  So the
internal component (i.e., this component) immediately knows if it is
going to be used or not based on whether the external component
configuration succeeded or failed.

Specifically: regardless of whether the internal component (i.e., this
component) is going to be used, we have to invoke HWLOC_SETUP_CORE.
But we can manage the side effects: allow the side effects when
this/internal component is going to be used, and avoid the side
effects when this/internal component is not going to be used.

This is a little less clean than I would have liked, but because of
Autoconf's oddity about its AC_CONFIG_* macros, this is the only
solution I could come up with.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-11 11:05:23 -07:00
Jeff Squyres
69aa46e167 libevent2022/configure.m4: always invoke sub-configure
In order to make "make distclean" (and friends) work, we need to
*always* invoke the embedded configure script -- even if we know that
we're not going to use this component.

But in cases where we know we're not going to use this component, we
also need to avoid the side effects of the code path that is used when
we *do* want to use this component.  So split the two possibilities
into two different macros:

1. MCA_opal_event_libevent2022_FAKE_CONFIG: which does almost nothing
   except invoke the underlying "configure" script.
2. MCA_opal_event_libevent2022_REAL_CONFIG: which does all the real
   work (including invoking the underlying "configure" script).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-11 06:03:46 -07:00
Jeff Squyres
80df3f040b libevent2022/configure.m4: trivial cleanup
Put argument to AM_CONDITIONAL inside [].  No code or logic changes.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-10 10:30:45 -07:00
Jeff Squyres
17aa64e438 libevent2022/configure.m4: minor comment cleanup
Change # -> dnl.  No code or logic changes.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-10 10:30:04 -07:00
Jeff Squyres
b063cb6b0f libevent2022: only configure if event:external fails
We know that event:external will be configured first (because of its
priority).  Take advantage of that here in libevent2022 by having it
refuse to configure / politely fail if event:external succeeded.

Also print out some additional lines in configure output indicating
what is going on (i.e., event:external succeeded, so this component
will be skipped, or event:external failed, so this component will be
used).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-08 10:22:38 -07:00
Jeff Squyres
4e5f432786 hwloc201: only configure if hwloc:external fails
We know that hwloc:external will be configured first (because of its
priority).  Take advantage of that here in hwloc201 by having it
refuse to configure / politely fail if hwloc:external succeeded.

Also print out some additional lines in configure output indicating
what is going on (i.e., hwloc:external succeeded, so this component
will be skipped, or hwloc:external failed, so this component will be
used).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-08-08 10:22:38 -07:00
Nathan Hjelm
bdbb853461 opal/progress: protect against multiple threads in event base
libevent does not support multiple threads calling the event loop on
the same event base. This causes external libevent's to print out
re-entrant warning messages. This commit fixes the issue by protecting
the call to the event loop with an atomic swap check.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-07 16:13:43 -06:00
Nathan Hjelm
eeae3f9b93
Merge pull request #5517 from bosilca/topic/treematch_warnings
Remove few warnings identified by @rhc in #5514.
2018-08-06 13:25:07 -06:00
Zoltán Mizsei
ac3f8a16ed fcntl include bugfix
Signed-off-by: Zoltán Mizsei <zmizsei@extrowerk.com>
2018-08-06 19:45:59 +02:00
Boris Karasev
57683366ca pmix: added check for pmix fence status
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-08-06 15:01:57 +06:00
George Bosilca
6d11a45f44
Remove few warnings identified by @rhc in #5514.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2018-08-03 16:21:06 -04:00
Thananon Patinyasakdikul
080115d440 btl/ofi: Added 2 side communication support.
The 2 sided communication support is added for non-tagmatching provider
to take advantage of this BTL and PML OB1. The current state is
"functional" and not optimized for performance.

Two sided support is disabled by default and can be turned on by mca
parameter: "mca_btl_ofi_mode".

Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
2018-08-03 12:30:03 -07:00
Ralph Castain
f7a537cf04 Fix typo
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-25 20:05:33 -07:00
Ralph Castain
bcdb1f45ac Fix the multiple pe/proc option
Things got a little out of whack and we weren't actually processing the map-by modifiers, plus an error crept into the display of the binding report. So clean those up.

Thanks to @tonyreina for the error report

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-25 18:47:39 -07:00
Ralph Castain
55cefedf9b Cleanup pmix selection check
Allow for versions > 3

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-25 16:11:32 -07:00
Nathan Hjelm
47ed8e8830 btl/uct: fix compile warnings/errors
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-07-23 14:04:38 -06:00
Ralph Castain
5cab823979 Leave opal_event_external_support exposed as global var
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-23 09:18:48 -07:00
Jeff Squyres
a70ecf5267 event/external: prefer external event component
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-07-23 09:20:27 +09:00