1
1

4193 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
b7e937fea5 Merge pull request #1778 from thananon/usnic_thread_safe
Added MPI_THREAD_MULTIPLE support for btl/usnic.
2016-06-14 18:43:04 -04:00
Jeff Squyres
5071602c59 PSM/PSM2: Disable signal handler hijacking by default
Per discussion on https://github.com/open-mpi/ompi/pull/1767 (and some
subsequent phone calls and off-issue email discussions), the PSM
library is hijacking signal handlers by default.  Specifically: unless
the environment variables `IPATH_NO_BACKTRACE=1` (for PSM / Intel
TrueScale) is set, the library constructor for this library will
hijack various signal handlers for the purpose of invoking its own
error reporting mechanisms.

This may be a bit *surprising*, but is not a *problem*, per se.  The
real problem is that older versions of at least the PSM library do not
unregister these signal handlers upon being unloaded from memory.
Hence, a segv can actually result in a double segv (i.e., the original
segv and then another segv when the now-non-existent signal handler is
invoked).

This PSM signal hijacking subverts Open MPI's own signal reporting
mechanism, which may be a bit surprising for some users (particularly
those who do not have Intel TrueScale).  As such, we disable it by
default so that Open MPI's own error-reporting mechanisms are used.

Additionally, there is a typo in the library destructor for the PSM2
library that may cause problems in the unloading of its signal
handlers.  This problem can be avoided by setting `HFI_NO_BACKTRACE=1`
(for PSM2 / Intel OmniPath).

This is further compounded by the fact that the PSM / PSM2 libraries
can be loaded by the OFI MTL and the usNIC BTL (because they are
loaded by libfabric), even when there is no Intel networking hardware
present.  Having the PSM/PSM2 libraries behave this way when no Intel
hardware is present is clearly undesirable (and is likely to be fixed
in future releases of the PSM/PSM2 libraries).

This commit sets the following two environment variables to disable
this behavior from the PSM/PSM2 libraries (if they are not already
set):

* IPATH_NO_BACKTRACE=1
* HFI_NO_BACKTRACE=1

If the user has set these variables before invoking Open MPI, we will
not override their values (i.e., their preferences will be honored).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-06-14 11:45:23 -07:00
Thananon Patinyasakdikul
ee85204c12 Added MPI_THREAD_MULTIPLE support for btl/usnic. 2016-06-13 13:47:06 -07:00
Nathan Hjelm
253c91972e arm64: add atomic swap function
This commit adds the opal_atomic_swap_32 and opal_atomic_swap_64
functions. This should improve the performance of btl/vader.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-06-11 09:46:29 -06:00
Nathan Hjelm
109389dce2 Merge pull request #1634 from hjelmn/cma
cma: add support for MIPS and ARM
2016-06-11 09:20:28 -06:00
Ralph Castain
d58da99dbc Shift to memcpy to avoid Solaris issues 2016-06-09 12:07:17 -07:00
Gilles Gouaillardet
1f651d17c1 opal/util/ethtool: fix (infamous) strncpy usage
the infamous strncpy does not NULL terminate the destination when the buffer is truncated
do it ourself !

fix CID 1362576
2016-06-09 09:54:50 +09:00
Ralph Castain
8fa935534b Abstract the strnlen function for environments that do not have it (e.g., Solaris 10) 2016-06-08 10:12:43 -07:00
Nathan Hjelm
f8957f24af Merge pull request #1768 from hjelmn/cq_fix
btl/openib: fix cq resize calculation
2016-06-07 21:34:36 -06:00
Nathan Hjelm
dd519c55b1 btl/openib: fix cq resize calculation
Before dynamic add_procs the openib_btl_size_queues was called exactly
once for non-dynamic jobs. Now the function is called on each new
connection so the calculation was wrong. Re-wrote the function to
correctly calculate the CQ size and only attempt to adjust the CQ if
the requested size has changed. This fixes a bug when using the openib
btl on psm2 hardware that is caused by the time needed to resize a
CQ. The overhead was causing udcm to timeout and fail.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-07 16:05:56 -06:00
Nathan Hjelm
e082ed752a opal/progress: fix warnings
This commit fixes several warning introduced by
open-mpi/ompi@fc26d9c69f .

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-06 22:18:24 -06:00
Nathan Hjelm
4a2bd83302 opal/cma: improve Linux CMA detection
This commit improves the CMA detection when the installed glibc doesn't
have support for CMA. In this case we need to verify that the syscall
numbers in opal/include/opal/sys/cma.h are valid for the architecture.
This verification is done by attempting to use CMA while including the
internal header.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-06-05 22:29:07 -06:00
Gilles Gouaillardet
b707d138fe pmix114/pmix1_client: fix misc memory leaks
Fixes CID 1325146-1325149
2016-06-06 09:33:35 +09:00
Nathan Hjelm
0084ad0d1b opal: add armv8 support
This commit adds assembly support for aarch64.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-06-03 10:32:21 -06:00
Nathan Hjelm
6169d03ea3 btl: adjust values of new atomic flags
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-06-02 19:21:34 -06:00
Nathan Hjelm
9f43b23725 Merge pull request #1710 from hjelmn/ugni_atomics
Additional ugni atomics
2016-06-02 18:25:49 -06:00
George Bosilca
e1c6b0e4a7 Some compilers are more than picky. 2016-06-03 09:04:34 +09:00
Nathan Hjelm
d9fc855955 Merge pull request #1743 from hjelmn/gcc_atomics_fix
atomic/gcc: add check for 128-bit CAS being lock-free
2016-06-02 16:55:31 -06:00
Nathan Hjelm
d86e41ea13 atomic/gcc: add check for 128-bit CAS being lock-free
Compiler implementations are free to include support for atomics that
use locks. Unfortunately lock-free and lock atomics do not mix. Older
versions of llvm on OS X use locks to provide
__atomic_compare_exchange on 128-bit values but are lock-free on
64-bit values. This screws up our lifo implementation which mixes
64-bit and 128-bit atomics on the same values to improve
performance. This commit adds a configure-time check if 128-bit
atomics are lock free. If they are not then the 128-bit __atomic CAS
is disabled and we check for the __sync version as a fallback.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-02 15:59:05 -06:00
Nathan Hjelm
5aab4b2d51 Merge pull request #1662 from ggouaillardet/topic/amd64_atomic
amd64/atomic: silence warnings
2016-06-02 14:10:20 -06:00
George Bosilca
87b1d17e7e Remove warnings.
clang 7.0 with the picky option on is extremely verbose, and complains
about almost everything. Trying to make him happy, at least regarding
the datatype engine.
2016-06-03 00:56:24 +09:00
rhc54
483b9c370a Merge pull request #1741 from rhc54/topic/pmix114
Update to 1.1.4rc3
2016-06-02 06:57:37 -07:00
Nathan Hjelm
fc26d9c69f Merge pull request #1734 from hjelmn/progress_threading
opal/progress: make progress function registration mt safe
2016-06-02 06:35:59 -06:00
Ralph Castain
ecea1e3bb5 Update to 1.1.4rc3 2016-06-01 20:56:07 -07:00
Nathan Hjelm
2fad3b9bc6 opal/progress: make progress function registration mt safe
This commit fixes a bug in opal progress registration that can cause
crashes when a progress function is registered while another thread is
in opal_progress(). Before this commit realloc is used to allocate
more space for progress functions but it is possible for a thread in
opal_progress() to try to read from the array that is freed by realloc
before the array is re-assigned when realloc returns. To prevent this
race use malloc + memcpy to fill the new array and atomically swap out
the old and new array pointers.

Per suggestion we now allocate a default of 8 slots for callbacks and
double the current number when we run out of space.

This commit also fixes leaking the callbacks_lp array.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-01 20:57:19 -06:00
George Bosilca
d9fb59bea5 Update the synchronization primitive
Add comments and make sure we correctly return the status of the
synchronization primitive, especially if it was completed with error.
2016-06-02 11:53:56 +09:00
Nathan Hjelm
f33bbfd381 atomic: add support for __atomic builtins (#1735)
* atomic: add support for __atomic builtins

This commit adds support for the gcc __atomic builtins. The __sync
builtins are deprecated and have been replaced by these atomics. In
addition, the new atomics support atomic exchange which was not
supported by __sync.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

* atomic: add support for transactional memory

This commit adds support for using transactional memory when using
opal atomic locks. This feature is enabled if the __HLE__ feature is
available and the gcc builtin atomics are in use.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-01 21:23:47 -04:00
rhc54
b85a5e62ab Merge pull request #1739 from rhc54/topic/pmix
Split the pmix external component into one for the 1.1.4 release, and…
2016-06-01 16:24:44 -07:00
Ralph Castain
12ecf972af Split the pmix external component into one for the 1.1.4 release, and another for the upcoming 2.0 release. Clean up the configury so the components look for a series-specific function instead of running a program.
NOTE: the changes for the 2.0 series are not yet in the PMIx master.
2016-06-01 14:15:24 -07:00
Nathan Hjelm
ceb2912838 Merge pull request #1736 from hjelmn/ugni_fixes
ugni BTL fixes
2016-06-01 14:59:55 -06:00
Jeff Squyres
d175fd692d README.ompi: track patches added to hwloc
Track post-v1.11.3-release patches applied to the hwloc copy embedded
in Open MPI.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-06-01 07:17:05 -07:00
Jeff Squyres
3867bd3640 hwloc.m4: only check for valgrind in non-embedded mode
This fixes https://github.com/open-mpi/ompi/issues/1732: i.e., the
case where the outer project has its own check for
<valgrind/valgrind.h>, but also supplements CPPFLAGS (to find
Valgrind's header files) before doing that check.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

Ideally, we would tell OMPI to disable autoconf's caching of our
valgrind check result so that its check gets the right result after
adding CPPFLAGS. Not sure if we can do that.

For now, just disable our Valgrind code in embedded mode.
This will keep the x86 backend enabled under Valgrind but
it will auto-disable itself when finding identical APIC ids anyway
(because CPUID returns same outputs for all PUs).

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>

Fixes open-mpi/ompi#1732

(cherry picked from commit open-mpi/hwloc@8b44fb1c81)
2016-06-01 06:58:53 -07:00
Gilles Gouaillardet
57978a75d0 Merge pull request #1717 from ggouaillardet/topic/lex_cleanup
configury: clean the flex generated .c files
2016-06-01 13:06:21 +09:00
Nathan Hjelm
5d4bcce042 Merge pull request #1700 from shamisp/topic/cma_config
CMA: Fixing logic for CMA system call detection
2016-05-31 20:33:48 -06:00
Nathan Hjelm
340152a635 Merge pull request #1720 from shamisp/topic/vader/max_addr
VADER: Adjusting VADER_MAX_ADDRESS for non x86 platforms.
2016-05-31 20:33:28 -06:00
Gilles Gouaillardet
5f565dfec3 configury: clean the flex generated .c files 2016-06-01 11:13:31 +09:00
Nathan Hjelm
bf10d79914 btl/ugni: remove erroneous unlock
The endpoint lock was being released twice in mca_btl_ugni_get_ep.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-31 16:52:53 -06:00
Nathan Hjelm
cc96097873 btl/ugni: fix bug when attempting unaligned get on aries
This commit fixes a programming error when using an aries nic. The
documentation of ugni shows that only the local alignment restriction
for get was lifted on aries. There is still a remote address alignment
restriction.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-31 16:52:09 -06:00
Jeff Squyres
5cfee95ea4 hwloc1113: add missing file to Makefile.am
Lack of this file causes a failure when you run autogen.pl on a
distribution tarball.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-31 09:57:50 -07:00
Nathan Hjelm
60519c2b4e cma: add support for MIPS and ARM
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-05-30 12:13:20 -06:00
George Bosilca
d2abff583e Fix race condition during BTL TCP tear-down.
bot🏷️bug
bot:assign:@hjelmn
2016-05-30 10:47:14 -05:00
Jeff Squyres
e126d2cd18 Merge pull request #1584 from bgoglin/master
Update hwloc to v1.11.3
2016-05-28 11:01:54 -04:00
Ralph Castain
55923eacd3 Stealing some pieces of Josh Hursey's PR #1583 and modifying a bit, allow the opal/pmix external component to handle both PMIx 1.1.4 and PMIx 2.0 versions. Automatically detect the version of the target external library and adjust the only two APIs that changed (PMIx_Init and PMIx_Finalize)
Rename temp vars in .m4 to avoid conflict with Travis
2016-05-27 08:06:31 -07:00
Nathan Hjelm
28dfa36a3f btl/ugni: fix bug when attempting unaligned get on aries
This commit fixes a programming error when using an aries nic. The
documentation of ugni shows that only the local alignment restriction
for get was lifted on aries. There is still a remote address alignment
restriction.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-27 08:22:13 -06:00
Nathan Hjelm
c19426ac1b btl/ugni: add support for additional atomic operations
This commit adds support for Cray Aries atomic operations. This
includes 32-bit and floating point support.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-27 08:22:13 -06:00
Nathan Hjelm
23fe19a956 btl: add support for more atomics
This commit add support for more atomic operations and type. The
operations added are logical and, logical or, logical xor, swap, min,
and max. New types are 32-bit int by using the
MCA_BTL_ATOMIC_FLAG_32BIT flag, 64-bit float by using the
MCA_BTL_ATOMIC_FLAG_FLOAT flag, and 32-bit float by using both
flags. Floating point numbers are supported by packing the number in
as an int64_t or int32_t. We will update the btl interface in the
future to make this less confusing.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-27 08:22:13 -06:00
Nathan Hjelm
d25b846c01 Merge pull request #1704 from hpcraink/pr/configure_framework
Fix configure for FreePGI on OSX
2016-05-26 17:01:08 -06:00
Nathan Hjelm
8c9292d5d1 Merge pull request #1721 from hjelmn/xrc_fix
btl/openib: fix XRC WQE calculation
2016-05-26 17:00:31 -06:00
Nathan Hjelm
56bdcd0888 btl/openib: fix XRC WQE calculation
Before dynamic add_procs support was committed to master we called
add_procs with every proc in the job. The XRC code in the openib btl
was taking advantage of this and setting the number of work queue
entries (WQE) based on all the procs on a remote node. Since that is
no longer the case we can not simply increment the sd_wqe field on the
queue pair. To fix the issue a new field has been added to the xrc
queue pair structure to keep track of how many wqes there are total on
the queue pair. If a new endpoint is added that increases the number
of wqes and the xrc queue pair is already connected the code will
attempt to modify the number of wqes on the queue pair. A failure is
ignored because all that will happen is the number of active send work
requests on an XRC queue pair will be more limited.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-26 15:58:31 -06:00
Aurelien Bouteiller
49bd28d0ac Merge pull request #1714 from hjelmn/scif_exclusivity
btl/scif: reduce default exclusivity
2016-05-26 17:53:11 -04:00