1
1
Граф коммитов

4302 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
044c561cba Roll to latest PMIx master 2016-06-16 17:30:30 -07:00
Nathan Hjelm
9c709966f7 opal/asm: fix syntax of timer code for ia32
Thanks to Paul Hargrove for pointing this out.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-16 16:55:01 -06:00
rhc54
702a982271 Merge pull request #1767 from rhc54/topic/pmix2
Enable the PMIx event notification capability
2016-06-16 15:27:43 -07:00
Nathan Hjelm
7349ddc937 patcher/overwrite: use OPAL_ASSEMBLY_ARCH to determine architecture
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-16 10:00:00 -06:00
Nathan Hjelm
dbd8369485 opal/progress: use 32-bit atomics for call counter
This commit fixes a compile error on 32-bit platforms. The
low-priority call counter was always using 64-bit atomics which will
not work if 64-bit atomic math is not available. Updated to use 32-bit
instead.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-16 09:01:19 -06:00
Thananon Patinyasakdikul
7bd18214a7 Fix btl/usnic deadlock when the connectivity check is turned off. 2016-06-15 07:42:55 -07:00
Jeff Squyres
b7e937fea5 Merge pull request #1778 from thananon/usnic_thread_safe
Added MPI_THREAD_MULTIPLE support for btl/usnic.
2016-06-14 18:43:04 -04:00
Ralph Castain
5d330d5220 Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler.
Add PMIx 2.0

Remove PMIx 1.1.4

Cleanup copying of component

Add missing file

Touchup a typo in the Makefile.am

Update the pmix ext114 component

Minor cleanups and resync to master

Update to latest PMIx 2.x

Update to the PMIx event notification branch latest changes
2016-06-14 13:08:41 -07:00
Jeff Squyres
5071602c59 PSM/PSM2: Disable signal handler hijacking by default
Per discussion on https://github.com/open-mpi/ompi/pull/1767 (and some
subsequent phone calls and off-issue email discussions), the PSM
library is hijacking signal handlers by default.  Specifically: unless
the environment variables `IPATH_NO_BACKTRACE=1` (for PSM / Intel
TrueScale) is set, the library constructor for this library will
hijack various signal handlers for the purpose of invoking its own
error reporting mechanisms.

This may be a bit *surprising*, but is not a *problem*, per se.  The
real problem is that older versions of at least the PSM library do not
unregister these signal handlers upon being unloaded from memory.
Hence, a segv can actually result in a double segv (i.e., the original
segv and then another segv when the now-non-existent signal handler is
invoked).

This PSM signal hijacking subverts Open MPI's own signal reporting
mechanism, which may be a bit surprising for some users (particularly
those who do not have Intel TrueScale).  As such, we disable it by
default so that Open MPI's own error-reporting mechanisms are used.

Additionally, there is a typo in the library destructor for the PSM2
library that may cause problems in the unloading of its signal
handlers.  This problem can be avoided by setting `HFI_NO_BACKTRACE=1`
(for PSM2 / Intel OmniPath).

This is further compounded by the fact that the PSM / PSM2 libraries
can be loaded by the OFI MTL and the usNIC BTL (because they are
loaded by libfabric), even when there is no Intel networking hardware
present.  Having the PSM/PSM2 libraries behave this way when no Intel
hardware is present is clearly undesirable (and is likely to be fixed
in future releases of the PSM/PSM2 libraries).

This commit sets the following two environment variables to disable
this behavior from the PSM/PSM2 libraries (if they are not already
set):

* IPATH_NO_BACKTRACE=1
* HFI_NO_BACKTRACE=1

If the user has set these variables before invoking Open MPI, we will
not override their values (i.e., their preferences will be honored).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-06-14 11:45:23 -07:00
Thananon Patinyasakdikul
ee85204c12 Added MPI_THREAD_MULTIPLE support for btl/usnic. 2016-06-13 13:47:06 -07:00
Nathan Hjelm
253c91972e arm64: add atomic swap function
This commit adds the opal_atomic_swap_32 and opal_atomic_swap_64
functions. This should improve the performance of btl/vader.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-06-11 09:46:29 -06:00
Nathan Hjelm
109389dce2 Merge pull request #1634 from hjelmn/cma
cma: add support for MIPS and ARM
2016-06-11 09:20:28 -06:00
Ralph Castain
d58da99dbc Shift to memcpy to avoid Solaris issues 2016-06-09 12:07:17 -07:00
Gilles Gouaillardet
1f651d17c1 opal/util/ethtool: fix (infamous) strncpy usage
the infamous strncpy does not NULL terminate the destination when the buffer is truncated
do it ourself !

fix CID 1362576
2016-06-09 09:54:50 +09:00
Ralph Castain
8fa935534b Abstract the strnlen function for environments that do not have it (e.g., Solaris 10) 2016-06-08 10:12:43 -07:00
Nathan Hjelm
f8957f24af Merge pull request #1768 from hjelmn/cq_fix
btl/openib: fix cq resize calculation
2016-06-07 21:34:36 -06:00
Nathan Hjelm
17ae1aceeb btl/openib: fix rdmacm
The rdma_disconnect function specifies that both the server and client
should call rdma_disconnect. The code was not calling rdma_disconnect
on an endpoint if the event came before the endpoint finalization.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-07 17:53:58 -06:00
Nathan Hjelm
dd519c55b1 btl/openib: fix cq resize calculation
Before dynamic add_procs the openib_btl_size_queues was called exactly
once for non-dynamic jobs. Now the function is called on each new
connection so the calculation was wrong. Re-wrote the function to
correctly calculate the CQ size and only attempt to adjust the CQ if
the requested size has changed. This fixes a bug when using the openib
btl on psm2 hardware that is caused by the time needed to resize a
CQ. The overhead was causing udcm to timeout and fail.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-07 16:05:56 -06:00
Nathan Hjelm
e082ed752a opal/progress: fix warnings
This commit fixes several warning introduced by
open-mpi/ompi@fc26d9c69f .

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-06 22:18:24 -06:00
Nathan Hjelm
4a2bd83302 opal/cma: improve Linux CMA detection
This commit improves the CMA detection when the installed glibc doesn't
have support for CMA. In this case we need to verify that the syscall
numbers in opal/include/opal/sys/cma.h are valid for the architecture.
This verification is done by attempting to use CMA while including the
internal header.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-06-05 22:29:07 -06:00
Gilles Gouaillardet
b707d138fe pmix114/pmix1_client: fix misc memory leaks
Fixes CID 1325146-1325149
2016-06-06 09:33:35 +09:00
Nathan Hjelm
0084ad0d1b opal: add armv8 support
This commit adds assembly support for aarch64.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-06-03 10:32:21 -06:00
Nathan Hjelm
6169d03ea3 btl: adjust values of new atomic flags
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-06-02 19:21:34 -06:00
Nathan Hjelm
9f43b23725 Merge pull request #1710 from hjelmn/ugni_atomics
Additional ugni atomics
2016-06-02 18:25:49 -06:00
George Bosilca
e1c6b0e4a7 Some compilers are more than picky. 2016-06-03 09:04:34 +09:00
Nathan Hjelm
d9fc855955 Merge pull request #1743 from hjelmn/gcc_atomics_fix
atomic/gcc: add check for 128-bit CAS being lock-free
2016-06-02 16:55:31 -06:00
Nathan Hjelm
d86e41ea13 atomic/gcc: add check for 128-bit CAS being lock-free
Compiler implementations are free to include support for atomics that
use locks. Unfortunately lock-free and lock atomics do not mix. Older
versions of llvm on OS X use locks to provide
__atomic_compare_exchange on 128-bit values but are lock-free on
64-bit values. This screws up our lifo implementation which mixes
64-bit and 128-bit atomics on the same values to improve
performance. This commit adds a configure-time check if 128-bit
atomics are lock free. If they are not then the 128-bit __atomic CAS
is disabled and we check for the __sync version as a fallback.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-02 15:59:05 -06:00
Nathan Hjelm
5aab4b2d51 Merge pull request #1662 from ggouaillardet/topic/amd64_atomic
amd64/atomic: silence warnings
2016-06-02 14:10:20 -06:00
George Bosilca
87b1d17e7e Remove warnings.
clang 7.0 with the picky option on is extremely verbose, and complains
about almost everything. Trying to make him happy, at least regarding
the datatype engine.
2016-06-03 00:56:24 +09:00
rhc54
483b9c370a Merge pull request #1741 from rhc54/topic/pmix114
Update to 1.1.4rc3
2016-06-02 06:57:37 -07:00
Nathan Hjelm
fc26d9c69f Merge pull request #1734 from hjelmn/progress_threading
opal/progress: make progress function registration mt safe
2016-06-02 06:35:59 -06:00
Ralph Castain
ecea1e3bb5 Update to 1.1.4rc3 2016-06-01 20:56:07 -07:00
Nathan Hjelm
2fad3b9bc6 opal/progress: make progress function registration mt safe
This commit fixes a bug in opal progress registration that can cause
crashes when a progress function is registered while another thread is
in opal_progress(). Before this commit realloc is used to allocate
more space for progress functions but it is possible for a thread in
opal_progress() to try to read from the array that is freed by realloc
before the array is re-assigned when realloc returns. To prevent this
race use malloc + memcpy to fill the new array and atomically swap out
the old and new array pointers.

Per suggestion we now allocate a default of 8 slots for callbacks and
double the current number when we run out of space.

This commit also fixes leaking the callbacks_lp array.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-01 20:57:19 -06:00
George Bosilca
d9fb59bea5 Update the synchronization primitive
Add comments and make sure we correctly return the status of the
synchronization primitive, especially if it was completed with error.
2016-06-02 11:53:56 +09:00
Nathan Hjelm
f33bbfd381 atomic: add support for __atomic builtins (#1735)
* atomic: add support for __atomic builtins

This commit adds support for the gcc __atomic builtins. The __sync
builtins are deprecated and have been replaced by these atomics. In
addition, the new atomics support atomic exchange which was not
supported by __sync.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

* atomic: add support for transactional memory

This commit adds support for using transactional memory when using
opal atomic locks. This feature is enabled if the __HLE__ feature is
available and the gcc builtin atomics are in use.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-01 21:23:47 -04:00
rhc54
b85a5e62ab Merge pull request #1739 from rhc54/topic/pmix
Split the pmix external component into one for the 1.1.4 release, and…
2016-06-01 16:24:44 -07:00
Ralph Castain
12ecf972af Split the pmix external component into one for the 1.1.4 release, and another for the upcoming 2.0 release. Clean up the configury so the components look for a series-specific function instead of running a program.
NOTE: the changes for the 2.0 series are not yet in the PMIx master.
2016-06-01 14:15:24 -07:00
Nathan Hjelm
ceb2912838 Merge pull request #1736 from hjelmn/ugni_fixes
ugni BTL fixes
2016-06-01 14:59:55 -06:00
Jeff Squyres
d175fd692d README.ompi: track patches added to hwloc
Track post-v1.11.3-release patches applied to the hwloc copy embedded
in Open MPI.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-06-01 07:17:05 -07:00
Jeff Squyres
3867bd3640 hwloc.m4: only check for valgrind in non-embedded mode
This fixes https://github.com/open-mpi/ompi/issues/1732: i.e., the
case where the outer project has its own check for
<valgrind/valgrind.h>, but also supplements CPPFLAGS (to find
Valgrind's header files) before doing that check.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

Ideally, we would tell OMPI to disable autoconf's caching of our
valgrind check result so that its check gets the right result after
adding CPPFLAGS. Not sure if we can do that.

For now, just disable our Valgrind code in embedded mode.
This will keep the x86 backend enabled under Valgrind but
it will auto-disable itself when finding identical APIC ids anyway
(because CPUID returns same outputs for all PUs).

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>

Fixes open-mpi/ompi#1732

(cherry picked from commit open-mpi/hwloc@8b44fb1c81)
2016-06-01 06:58:53 -07:00
Gilles Gouaillardet
57978a75d0 Merge pull request #1717 from ggouaillardet/topic/lex_cleanup
configury: clean the flex generated .c files
2016-06-01 13:06:21 +09:00
Nathan Hjelm
5d4bcce042 Merge pull request #1700 from shamisp/topic/cma_config
CMA: Fixing logic for CMA system call detection
2016-05-31 20:33:48 -06:00
Nathan Hjelm
340152a635 Merge pull request #1720 from shamisp/topic/vader/max_addr
VADER: Adjusting VADER_MAX_ADDRESS for non x86 platforms.
2016-05-31 20:33:28 -06:00
Gilles Gouaillardet
5f565dfec3 configury: clean the flex generated .c files 2016-06-01 11:13:31 +09:00
Nathan Hjelm
bf10d79914 btl/ugni: remove erroneous unlock
The endpoint lock was being released twice in mca_btl_ugni_get_ep.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-31 16:52:53 -06:00
Nathan Hjelm
cc96097873 btl/ugni: fix bug when attempting unaligned get on aries
This commit fixes a programming error when using an aries nic. The
documentation of ugni shows that only the local alignment restriction
for get was lifted on aries. There is still a remote address alignment
restriction.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-31 16:52:09 -06:00
Jeff Squyres
5cfee95ea4 hwloc1113: add missing file to Makefile.am
Lack of this file causes a failure when you run autogen.pl on a
distribution tarball.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-31 09:57:50 -07:00
Nathan Hjelm
60519c2b4e cma: add support for MIPS and ARM
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-05-30 12:13:20 -06:00
George Bosilca
d2abff583e Fix race condition during BTL TCP tear-down.
bot🏷️bug
bot:assign:@hjelmn
2016-05-30 10:47:14 -05:00
Jeff Squyres
e126d2cd18 Merge pull request #1584 from bgoglin/master
Update hwloc to v1.11.3
2016-05-28 11:01:54 -04:00
Ralph Castain
55923eacd3 Stealing some pieces of Josh Hursey's PR #1583 and modifying a bit, allow the opal/pmix external component to handle both PMIx 1.1.4 and PMIx 2.0 versions. Automatically detect the version of the target external library and adjust the only two APIs that changed (PMIx_Init and PMIx_Finalize)
Rename temp vars in .m4 to avoid conflict with Travis
2016-05-27 08:06:31 -07:00
Nathan Hjelm
28dfa36a3f btl/ugni: fix bug when attempting unaligned get on aries
This commit fixes a programming error when using an aries nic. The
documentation of ugni shows that only the local alignment restriction
for get was lifted on aries. There is still a remote address alignment
restriction.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-27 08:22:13 -06:00
Nathan Hjelm
c19426ac1b btl/ugni: add support for additional atomic operations
This commit adds support for Cray Aries atomic operations. This
includes 32-bit and floating point support.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-27 08:22:13 -06:00
Nathan Hjelm
23fe19a956 btl: add support for more atomics
This commit add support for more atomic operations and type. The
operations added are logical and, logical or, logical xor, swap, min,
and max. New types are 32-bit int by using the
MCA_BTL_ATOMIC_FLAG_32BIT flag, 64-bit float by using the
MCA_BTL_ATOMIC_FLAG_FLOAT flag, and 32-bit float by using both
flags. Floating point numbers are supported by packing the number in
as an int64_t or int32_t. We will update the btl interface in the
future to make this less confusing.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-27 08:22:13 -06:00
Nathan Hjelm
d25b846c01 Merge pull request #1704 from hpcraink/pr/configure_framework
Fix configure for FreePGI on OSX
2016-05-26 17:01:08 -06:00
Nathan Hjelm
8c9292d5d1 Merge pull request #1721 from hjelmn/xrc_fix
btl/openib: fix XRC WQE calculation
2016-05-26 17:00:31 -06:00
Nathan Hjelm
56bdcd0888 btl/openib: fix XRC WQE calculation
Before dynamic add_procs support was committed to master we called
add_procs with every proc in the job. The XRC code in the openib btl
was taking advantage of this and setting the number of work queue
entries (WQE) based on all the procs on a remote node. Since that is
no longer the case we can not simply increment the sd_wqe field on the
queue pair. To fix the issue a new field has been added to the xrc
queue pair structure to keep track of how many wqes there are total on
the queue pair. If a new endpoint is added that increases the number
of wqes and the xrc queue pair is already connected the code will
attempt to modify the number of wqes on the queue pair. A failure is
ignored because all that will happen is the number of active send work
requests on an XRC queue pair will be more limited.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-26 15:58:31 -06:00
Aurelien Bouteiller
49bd28d0ac Merge pull request #1714 from hjelmn/scif_exclusivity
btl/scif: reduce default exclusivity
2016-05-26 17:53:11 -04:00
Pavel Shamis (Pasha)
60fd25f3fb VADER: Adjusting VADER_MAX_ADDRESS for non x86 platforms.
The original VADER_MAX_ADDRESS was tunned for x86_64 platforms only.
For non x86_64 platforms we can use XPMEM_MAXADDR_SIZE.

Signed-off-by: Pavel Shamis (Pasha) <pasharesearch@gmail.com>
2016-05-26 16:38:04 -05:00
Nathan Hjelm
99627319f0 btl/ugni: reduce overhead of progress function
This commit reduces the overhead of calling the ugni progress
function. It does the following:

 - Check for new connections once every eight calls.

 - Do not call remote smsg progress unless we are connected to at
   least one remote peer.

 - Do not call rdma progress unless at least one rdma fragment is
   outstanding.

 - Check endpoint wait list size before obtaining a lock.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-25 14:27:34 -06:00
Nathan Hjelm
5caf12cd9b btl/scif: reduce default exclusivity
This commit reduces the default exclusivity so that btl/scif is not
used for send/recv over other shared memory transports.

Fixes open-mpi/ompi#1712

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-25 14:25:07 -06:00
Rainer Keller
3727cba9bb Fix compilation for FreePGI on OSX
Our checks and the ones of libevent are somewhat flawed.
If adding multiple "-framework" to CXXFLAGS or CFLAGS, we strip
the keyword from the command-line, not good.
libevent however assumes plain gcc without testing properly
that the compiler supports -Wno-deprecated-declarations.
2016-05-25 09:12:39 +02:00
Nathan Hjelm
461ca1203b Merge pull request #1703 from hjelmn/grdma_cuda_fix
rcache/grdma: fix typo in cuda code
2016-05-24 18:51:22 -06:00
bosilca
b90c83840f Refactor the request completion (#1422)
* Remodel the request.
Added the wait sync primitive and integrate it into the PML and MTL
infrastructure. The multi-threaded requests are now significantly
less heavy and less noisy (only the threads associated with completed
requests are signaled).

* Fix the condition to release the request.
2016-05-24 18:20:51 -05:00
Nathan Hjelm
af52dad8f8 rcache/grdma: fix typo in cuda code
Fixes open-mpi/ompi#1702

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-24 15:56:39 -06:00
Pavel Shamis (Pasha)
d984b4b3f9 CMA: Fixing logic for CMA system call detection
The OPAL_CMA_NEED_SYSCALL_DEFS is always defined/set to 0 or 1.  Therefore
instead of checking if the macro is defined, we have to look at the value
itself.

Signed-off-by: Pavel Shamis (Pasha) <pasharesearch@gmail.com>
2016-05-24 14:53:25 -05:00
Ralph Castain
80f4e3b872 Fix the --tune problem by searching the argv for MCA params in advance of opal_init_util. Only search the first app_context as we historically have done - we can debate whether or not to search all app_contexts 2016-05-23 21:09:44 -07:00
Nathan Hjelm
37e9e2c660 mca/base: fix typo in flag enumeration
This commit fixes a typo in flag enumeration that can cause the parser
to miss valid flags or crash.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-23 12:21:34 -06:00
Artem Polyakov
725eea2819 Fix base64 implementation in pmix framework.
In the commit 80f07b65f1 setting of '-' marker used
as the string termination sign was moved from base64 code:
from: 80f07b65f1 (diff-1b10896c267d2591dc2c08fd0542ab67L491)
to: 80f07b65f1 (diff-1b10896c267d2591dc2c08fd0542ab67R189)

However the decoding function wasn't fixed and still expects on extra
byte at the end of the encoded string which leads to data truncation
during extraction (was noticed on standalone code that was using base64
from OMPI).
2016-05-23 23:30:31 +06:00
Gilles Gouaillardet
d5a2ac6f2f btl/openib: fix #if vs #ifdef 2016-05-23 14:27:33 +09:00
Gilles Gouaillardet
5a8cbe5a8f btl/openib: remove obsolete reference to MEMORY_LINUX_MALLOC_ALIGN_ENABLED macro 2016-05-23 14:12:21 +09:00
Gilles Gouaillardet
8466a3daf3 pmix: update .gitignore
git ignore opal/mca/pmix/pmix114/pmix/include/pmix/autogen/config.h.in
git rm opal/mca/pmix/pmix114/pmix/include/pmix/autogen/config.h.in
git ignore opal/mca/pmix/pmix*/...
2016-05-23 11:58:07 +09:00
Nathan Hjelm
31bfeede82 bml/r2: always add btl progress function
This commit changes the behavior of bml/r2 from conditionally
registering btl progress functions to always registering progress
functions. Any progress function beloning to a btl that is not yet in
use is registered as low-priority. As soon as a proc is added that
will make use of the btl is is re-registered normally.

This works around an issue with some btls. In order to progress a
first message from an unknown peer both ugni and openib need to have
their progress functions called. If either btl is not in use after the
first call to add_procs the callback was never happening. This commit
ensures the btl progress function is called at some point but the
number of progress callbacks is reduced from normal to ensure lower
overhead when a btl is not used. The current ratio is 1 low priority
progress callback for every 8 calls to opal_progress().

Fixes open-mpi/ompi#1676

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-21 15:54:04 -04:00
Ralph Castain
4e0749f03d Remove verbose error messages 2016-05-20 10:04:26 -07:00
Ralph Castain
42ecffb6d0 Move the registration of MCA params out of the init of the var system - put them in with the rest of the OPAL MCA param registrations
Take another shot at untangling the spaghetti

orterun: fix for command line parsing

orte-submit calls opal_init_util () before parsing out MCA command line
options (-mca, -am, etc). This prevents mpirun from setting opal MCA
variables for some frameworks as well as the MCA base. This is because
when a framework is opened all of its variables are set to read-only.
Eventually we want to lift this restriction on some MCA variables but
since -mca is affected we must parse out the MCA command line options
before opal_init_util(). This commit fixes the bug by adding a new
option to opal_cmd_line_parse (ignore unknown option) so orte-submit
can pre-parse the command line for MCA options.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>

Minor cleanups to avoid releasing/recreating the cmd line
2016-05-20 09:59:50 -07:00
Brice Goglin
ca621330a6 Update hwloc to v1.11.3
Remove contrib/windows/
Merge hwlocXYZ/hwloc/README-ompi.txt back into hwlocXYZ/README-ompi.txt instead of having both.
Add README.txt in new automake-required directory contrib/systemd/

Keep the following patches applied since they are not in 1.11.3
    linux: actually enable libudev based on the result of AC_CHECK_LIB
    (cherry picked from open-mpi/hwloc@9549fd59af)
    configure: check the actual may_alias syntax that we use
    (cherry picked from open-mpi/hwloc@0ab7af5e90)
2016-05-20 07:20:16 +02:00
Gilles Gouaillardet
5ec1eedbae Merge pull request #1682 from ggouaillardet/topic/fix-ethtool-again
opal/util/ethtool: use system ethtool_cmd_speed when available
2016-05-20 10:30:43 +09:00
Gilles Gouaillardet
cbbdce05b1 pmix/pmix114: silence a warning 2016-05-20 09:35:26 +09:00
Gilles Gouaillardet
ed3fd1775f rcache/grdma: silence a warning 2016-05-20 09:30:29 +09:00
Gilles Gouaillardet
a01a5487a8 opal/util/ethtool: use system ethtool_cmd_speed when available
Refs: open-mpi/ompi#1679
2016-05-20 09:05:09 +09:00
rhc54
99d3c283f5 Merge pull request #1681 from rhc54/topic/pmixupdate
Update PMIx 114 to current release candidate
2016-05-19 13:50:16 -07:00
Ralph Castain
6f743f81b6 Update PMIx 114 to current release candidate 2016-05-19 12:55:05 -07:00
Jeff Squyres
87233aae49 ethtool: better handle portability
Be sure to handle the case where we don't have ethtool support at all.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-19 10:57:14 -07:00
Gilles Gouaillardet
fd93d236b1 opal/util/ethtool: fix compilation on older Linux when struct ethtool_cmd has no speed_hi field
Refs: open-mpi/ompi#1628
2016-05-19 11:58:04 +09:00
Jeff Squyres
66f53ec29a Merge pull request #1628 from kmroz/wip-btl-tcp-ethtool-speed
btl/tcp: autodetect bandwidth and latency if unset by the user
2016-05-18 12:12:55 -04:00
Nathan Hjelm
9371a6a52d Merge pull request #1673 from hjelmn/fix_rcache_deadlock
rcache: fix deadlock in multi-threaded environments
2016-05-18 08:32:21 -07:00
Karol Mroz
ca6ddf3270 btl/tcp: autodetect bandwidth and latency if unset
Fixes open-mpi/ompi#120

Signed-off-by: Karol Mroz <mroz.karol@gmail.com>
2016-05-18 16:25:52 +02:00
Karol Mroz
b9c6c43c6b btl/tcp: add default defines for bandwidth and latency
Signed-off-by: Karol Mroz <mroz.karol@gmail.com>
2016-05-18 16:25:52 +02:00
Karol Mroz
31e33a64f9 opal/util: add function to obtain interface speed
If kernel ethtool_cmd_speed() is not available, use copies if possible.

Signed-off-by: Karol Mroz <mroz.karol@gmail.com>
2016-05-18 16:25:51 +02:00
Nathan Hjelm
ab8ed177f5 rcache: fix deadlock in multi-threaded environments
This commit fixes several bugs in the registration cache code:

 - Fix a programming error in the grdma invalidation function that can
   cause an infinite loop if more than 100 registrations are
   associated with a munmapped region. This happens because the
   mca_rcache_base_vma_find_all function returns the same 100
   registrations on each call. This has been fixed by adding an
   iterate function to the vma tree interface.

 - Always obtain the vma lock when needed. This is required because
   there may be other threads in the system even if
   opal_using_threads() is false. Additionally, since it is safe to do
   so (the vma lock is recursive) the vma interface has been made
   thread safe.

 - Avoid calling free() while holding a lock. This avoids race
   conditions with locks held outside the Open MPI code.

Fixes open-mpi/ompi#1654.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-17 09:02:40 -06:00
Nathan Hjelm
f6938868bd Merge pull request #1659 from hjelmn/sync_64
sync_builtin: check for 64-bit atomic support
2016-05-17 05:40:04 -07:00
rhc54
8b534e9897 Merge pull request #1668 from rhc54/topic/slurm
When direct launching applications, we must allow the MPI layer to pr…
2016-05-16 12:23:19 -07:00
Howard Pritchard
1a676e5b35 pmix/cray: fix some breakage
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-05-16 12:45:05 -05:00
Gilles Gouaillardet
4e21933a74 memory/patcher: declare __curbrk as extern in order not to generate an (unitialized) common symbol 2016-05-16 09:30:11 +09:00
Ralph Castain
01ba861f2a When direct launching applications, we must allow the MPI layer to progress during RTE-level barriers. Neither SLURM nor Cray provide non-blocking fence functions, so push those calls into a separate event thread (use the OPAL async thread for this purpose so we don't create another one) and let the MPI thread sping in wait_for_completion. This also restores the "lazy" completion during MPI_Finalize to minimize cpu utilization.
Update external as well

Revise the change: we still need the MPI_Barrier in MPI_Finalize when we use a blocking fence, but do use the "lazy" wait for completion. Replace the direct logic in MPI_Init with a cleaner macro
2016-05-14 16:37:00 -07:00
Gilles Gouaillardet
456b73da69 btl/openib: fix error path in init_one_device()
do not explicitly release ib verbs components since they will
be released in the object destructor

Thanks Durga for the report
2016-05-13 09:03:48 +09:00
Gilles Gouaillardet
5dae7a47ff amd64/atomic: silence warnings
Solaris Studio compilers issue (tons of) warnings because one arguments of several __asm__ __volatile__ section is not needed
2016-05-11 11:26:50 +09:00
Jeff Squyres
30f913f217 Merge pull request #1652 from jsquyres/pr/remove-aix-timer
timer/aix: remove stale code
2016-05-10 15:47:02 -04:00
Jeff Squyres
eccf0ff4cd hwloc/external: set WRAPPER_EXTRA_* vars in proper location
WRAPPER_EXTRA flags are checked *before* the POST_CONFIG macro is
invoked.  So set them in the main CONFIG macro.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-10 07:34:56 -07:00
Josh Hursey
44d95cb610 Merge pull request #1657 from bgoglin/hwloc-for-2.0
configure: check the actual may_alias syntax that we use
2016-05-09 13:37:08 -05:00
Ralph Castain
7767882346 Per user request, add some missing data and definitions:
OPAL_PMIX_UNIV_RANK - synonym for OPAL_PMIX_GLOBAL_RANK
OPAL_PMIX_APP_SIZE - #ranks in the application of this proc
2016-05-09 08:39:01 -07:00
Nathan Hjelm
d99a9786b6 sync_builtin: check for 64-bit atomic support
This commit adds an additional check for 64-bit atomic support for __sync
builtins. If 64-bit support is not available the opal_atomic_*_64 atomics
are disabled.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-05-09 03:17:51 -06:00
Brice Goglin
6839d928c2 configure: check the actual may_alias syntax that we use
xlc 13.1.0 crashes because of our may_alias attributes in nolibxml.c
on Power7. libxml.c and nolibxml.c are the only may_alias users for now,
so change our configure check to match the actual code using it.

Thanks to Paul Hargrove for reporting and debugging the issue,
and providing the patch.

https://www.open-mpi.org/community/lists/devel/2016/05/18918.php

(cherry picked from open-mpi/hwloc@0ab7af5e90)
2016-05-08 22:22:30 +02:00
Ralph Castain
7594b95e4b Ensure the hwloc external header is include when --with-devel-headers is given 2016-05-08 10:18:14 -07:00
Jeff Squyres
acbd2c608d memory/patcher: check for <sys/syscall.h>
Thanks to Paul Hargrove for reporting.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-07 09:48:14 -07:00
Jeff Squyres
b4982d7725 timer/aix: remove stale code
Per discussion on the mailing list and with IBM, remove the AIX timer
code (since AIX is no longer supported).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-07 09:31:34 -07:00
Ralph Castain
7e5ef6a240 Fix the env_list support - the MCA param was being set way too early, so provide a "backdoor" way of providing the value 2016-05-06 15:38:39 -07:00
Ralph Castain
58dd41facf Repair the processing of cmd line options that mapped to MCA params. This was responsible for breaking things like map-by <foo>.
Remove debug, let orterun send terminate cmd to DVM

Recover the DVM support
2016-05-06 13:14:03 -07:00
Josh Hursey
35ae7e33d7 Merge pull request #1639 from jjhursey/topic/dl-open-null-fname
dl/dlopen/libltdl: Allow opal_dl_open to take a NULL filename.
2016-05-05 22:15:46 -05:00
Ralph Castain
8ec1891d11 Silence warning 2016-05-05 20:04:10 -07:00
Ralph Castain
08022d7af1 Some minor cleanups of warnings from gcc 6.0.0. Update s1/s2 pmix to get max_procs as required. 2016-05-05 15:28:13 -07:00
Joshua Hursey
677178f206 dl/dlopen/libltdl: Allow opal_dl_open to take a NULL filename. 2016-05-05 17:07:26 -04:00
Nathan Hjelm
80f45925bc Merge pull request #1629 from hjelmn/new_hooks_update
New hooks update
2016-05-04 18:53:25 -06:00
Joshua Hursey
788cf1a9fe asm/powerpc: Fix empty colon list in asm for XL compiler on power
Thanks to Paul Hargrove for reporting the problem, and submitting patch.
 * https://www.open-mpi.org/community/lists/devel/2016/05/18886.php
2016-05-04 14:14:33 -05:00
Nathan Hjelm
ff2a54bd37 patcher/linux: code cleanup
Update based on cleanup made to the upstream version on OpenUCX.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-04 12:53:45 -06:00
Nathan Hjelm
6c9a0e1c55 patcher/overwrite: disable ia64 support for now
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-04 12:53:24 -06:00
Nathan Hjelm
6ad68da407 patcher/linux: disable the linux patcher component
This commit disables the linux patcher component due to a limitation
in loader patching. While this component is effective in patching
calls made within Open MPI and by the application it fails to hook
calls made within glibc. This means the munmap call made by free is
not correctly hooked. Until this problem can be resolved this
component will remain disabled. If it can't be resolved this component
should probably be removed.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-04 12:50:51 -06:00
Nathan Hjelm
71be36d380 patcher: fix ppc32 support
The table of contents (TOC) code only appears to only apply to
ppc64. The code was incorrectly assuming the existence of the TOC on
ppc32. This commit updates the necessary code to only apply to ppc64.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-04 12:50:32 -06:00
Nathan Hjelm
41f00b7465 memory/patcher: initialize patcher framework when needed
This commit moves the patcher framework initialization to the
memory/patcher component.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-04 12:46:42 -06:00
Nathan Hjelm
0f54a95408 Merge pull request #1626 from hjelmn/vader_32
btl/vader: fix compilation on 32-bit systems
2016-05-03 16:39:46 -06:00
Nathan Hjelm
4a740e9f27 Merge pull request #1619 from hjelmn/ext_verbs_fix
btl/openib: fix check for exp verbs struct members
2016-05-03 14:16:17 -06:00
Nathan Hjelm
e7ccbdee27 btl/vader: fix compilation on 32-bit systems
This commit fixes a compile/link issue caused by vader. The vader btl
was using OPAL_THREAD_ADD64 to increment a counter which may not be
available on 32-bit systems. Changed to use OPAL_THREAD_ADD_SIZE_T
which will be 64-bit or 32-bit depending on the system.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-03 10:14:44 -06:00
Nathan Hjelm
2d0e2b6233 patcher: do not clobber ebx
ebx can not be clobbered when using -fPIC so save and restore the
register instead of allowing it to be clobbered.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-05-03 08:24:33 -06:00
Brice Goglin
a2a721f961 linux: actually enable libudev based on the result of AC_CHECK_LIB
instead of doing AC_CHECK_HEADERS+AC_CHECK_LIB and only using the result of the former.

Thanks to Paul Hargrove for reporting the issue (OMPI build with -m32).

(cherry picked from open-mpi/hwloc@9549fd59af)
2016-05-03 10:00:40 +02:00
Nathan Hjelm
da695a6ce6 Merge pull request #1618 from hjelmn/new_hooks_update
More hook updates
2016-05-02 18:12:50 -06:00
Nathan Hjelm
a65af6d079 btl/openib: fix check for exp verbs struct members
This commit fixes a compilation issue with some versions of exp
verbs. In some cases struct ibv_exp_device_attr does not have either
the exp_atom or exp_atomic_cap fields. It is fine to drop one check
and fall back to the non-exp attribute check on the other.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 17:13:33 -06:00
Nathan Hjelm
1ff79656dd patcher: remove debug fprintf
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 17:11:00 -06:00
Nathan Hjelm
581e47c271 patcher: check for clflush
Add a feature check for clflush before trying to use the clflush
instruction. As far as I can tell there is no equivalent before the
SSE2 instruction set.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 17:10:42 -06:00
Nathan Hjelm
67fd6fa6eb Merge pull request #1615 from hjelmn/new_hooks_update
memory/patcher: add #if check for MREMAP_FIXED
2016-05-02 16:26:58 -06:00
Nathan Hjelm
eb14b34f04 memory/patcher: fix compilation on BSDs
The function signature of mremap on BSD (NetBSD, FreeBSD) differs from
the linux version. Added support for the BSD style of mremap.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 14:54:08 -06:00
Nathan Hjelm
52edb43bdc memory/patcher: check for linux/mman.h
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 14:29:46 -06:00
Nathan Hjelm
f8b3be6236 patcher/overwrite: fix ia64 compilation
Fixed a couple of typos in ia64 code.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 14:10:34 -06:00
Nathan Hjelm
14c34ae9f0 memory/patcher: add #if check for MREMAP_FIXED
This commit fixes a compile error when the system has mremap but not
MREMAP_FIXED. In this case we do not care about the value of
new_address as the argument does not exist.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-05-02 13:58:51 -06:00
rhc54
648043597a Merge pull request #1612 from ggouaillardet/poc/pmix_external_configury
pmix/external: revamp external pmix package detection
2016-05-02 09:46:05 -07:00
Jeff Squyres
265e5b9795 Merge pull request #1552 from kmroz/wip-hostname-len-cleanup-1
ompi/opal/orte/oshmem/test: max hostname length cleanup
2016-05-02 09:44:18 -04:00
Gilles Gouaillardet
45f9a47d77 pmix/external: fix typo and silence a warning 2016-05-02 17:15:52 +09:00
Gilles Gouaillardet
08d91b9a03 pmix/external: revamp external pmix package detection 2016-05-02 16:23:31 +09:00
Ralph Castain
6ac7929bd0 Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need.
Cleanups per @jjhursey review
2016-05-01 11:30:25 -07:00
George Bosilca
3445577f4c Avoid race conditions during BTP TCP handshake.
In some rare cases when a process receives the connect ack while
locally updating the peer endpoint structure, we could drop the
incomming connect ack due to the fact that the send handler is
protected with a try lock (on the endpoint) and our initial send
event was not persistent. Making the send event persistent solves
all issues.
2016-05-01 14:19:29 -04:00
George Bosilca
702f80ad7e Remove "signed vs. unsigned" warnings. 2016-05-01 11:45:48 -04:00
Ralph Castain
42d9d861fc Fix minor typo in PMIx packing of pmix_app_t - thanks to Gilles for pointing it out 2016-04-29 08:55:46 -07:00
Howard Pritchard
f52dd511d4 Merge pull request #1600 from hppritcha/topic/pmix_fix_for_finalize
pmix/cray: set fence_nb to NULL
2016-04-28 13:50:15 -06:00
hppritcha
aa1d7b9c50 pmix/cray: set fence_nb to NULL
Rather than have a stub function for the pmix fence_nb
operation, just set to NULL.  Causes fewer problems.

Fixes #1597
Fixes #1527

Signed-off-by: hppritcha <howardp@lanl.gov>
2016-04-28 13:48:54 -05:00
Nysal Jan K.A
18cf65dc24 Remove a stray print statement 2016-04-28 18:00:52 +05:30
Nathan Hjelm
03f4a854cb btl/tcp: fix add_procs race condition
This commit fixes a race between a thread calling the tcp btl's
add_procs and a thread processing an incomming connection. The race
occured because the add_procs thread adds a newly created proc object
to the hash table *before* the object is fully initialized. The
connection thread then attempts to use the object before the endpoints
array on the object has beeen allocation. The fix is to only add the
proc to the hash table after it has been completely initialized.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-27 10:24:39 -06:00
Nathan Hjelm
8f93b15e90 Merge pull request #1580 from hjelmn/new_hooks_update
memory/patcher: cast away const in shmdt hook
2016-04-26 17:48:01 -06:00
Nathan Hjelm
df194087c7 Merge pull request #1591 from hjelmn/rcache_update
rcache: fix leave_pinned failure path
2016-04-26 16:50:06 -06:00
Nathan Hjelm
25a97af695 rcache: fix leave_pinned failure path
This commit fixes an error in the failure path of leave_pinned. When
the rcache tries to enable leave_pinned but leave_pinned was not
specifically requested (opal_leave_pinned == -1) the code was
erroneously printing an error and returning NULL.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-26 14:39:23 -06:00
Ralph Castain
02876564d4 Silence warning of zero-byte malloc 2016-04-26 11:55:59 -07:00
Nathan Hjelm
5612998d21 memory/patcher: cast away const in shmdt hook
The opal_mem_hooks_release_hook does not have const on the pointer
(though it probably should). This commit eliminates a warning by
casting away the const until opal_mem_hooks_release_hook is updated to
use const.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-04-25 15:32:11 -06:00