1
1

4581 Коммитов

Автор SHA1 Сообщение Дата
Gilles Gouaillardet
7601e783cc pmix3x: sec/munge: add a missing include file
(cherry picked from upstream pmix/master@f7cfb11f6b)
2016-10-03 16:09:10 +09:00
Ralph Castain
e773c17cf3 Put show_help thru the PMIx "log" API. This pushes the show_help output from apps into the pmix thread, thus avoiding conflicts in the RML thread, which should help with thread lock situations. 2016-10-02 16:02:23 -07:00
Jeff Squyres
545d8f2e66 usnic cagent: correctly compute the "large" ping message size
The (effective) "+42" computation was, in fact, the incorrect answer
in this case (gasp!).

We should just take the max_msg_size from the command (which came from
the libfabric endpoint max_msg_size attribute in the client) and
subtract off the max header size: 68 (which is explained in the
comment).  This will result in a "large" message size which is likely
slightly smaller than the MTU, but still right up near the MTU, and
therefore good enough.

Note: the old computation (i.e., -(68-42)) worked fine when we asked
for Libfabric API v1.1 because the usnic provider would return a
max_msg_size that was already less than the MTU due to FI_PREFIX
behavior shenanigans.  Once we started asking for Libfabric API v1.4,
the usnic Libfabric provider started returning (MTU + prefix_size),
and the -(68-42) computation started giving a value that was over the
MTU.  This caused sendto() on the connectivity checker UDP socket
to fail.

This commit also removes an old/misleading comment.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-09-30 17:01:05 -07:00
Joshua Hursey
f6f24a4f67 build: Custom libmpi(_FOO) name option in configure
* Add a configure time option to rename libmpi(_FOO).*
   - `--with-libmpi-name=STRING`
 * This commit only impacts the installed libraries.
   Internal, temporary libraries have not been renamed to limit the
   scope of the patch to only what is needed.

For example:
```shell
shell$ ./configure --with-libmpi-name=wookie
...
shell$ find . -name "libmpi*"
shell$ find . -name "libwookie*"
./lib/libwookie.so.0.0.0
./lib/libwookie.so.0
./lib/libwookie.so
./lib/libwookie.la
./lib/libwookie_mpifh.so.0.0.0
./lib/libwookie_mpifh.so.0
./lib/libwookie_mpifh.so
./lib/libwookie_mpifh.la
./lib/libwookie_usempi.so.0.0.0
./lib/libwookie_usempi.so.0
./lib/libwookie_usempi.so
./lib/libwookie_usempi.la
shell$
```
2016-09-29 21:47:24 -05:00
Gilles Gouaillardet
871ade9231 pmix/{cray,s1,s2}: make pmi_opcaddy_t class static
theses three pmix components use the same class name,
declare it as static so Open MPI can be built with --disable-dlopen

Thanks Limin Gu for the report
2016-09-28 09:18:36 +09:00
Jeff Squyres
1a5a5fb400 Merge pull request #1861 from bharatpotnuri/master
btl/openib: Disqualify rdmacm CPC if MPI_THREAD_MULTIPLE
2016-09-27 13:03:35 -04:00
Potnuri Bharat Teja
740b636dbe btl/openib: Disqualify rdmacm CPC if MPI_THREAD_MULTIPLE
The rdmacm CPC in the openib BTL is not thread safe. The rdmacm CPC
should disqualify itself (instead of failing in random ways) if
MPI_THREAD_MULTIPLE is the thread level.

Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
2016-09-27 14:20:59 +05:30
Gilles Gouaillardet
1fbc9a5431 pmix3x: dstore/pmix: flock portability
Using the fcntl-locking instead of the flock

(back-ported from upstream pmix/master@3030a0cca1)
2016-09-27 13:21:03 +09:00
George Bosilca
066370202d Support non-monotonic assembly timers.
If monotonic support has been required by the runtime and the
assembly timers are unable to provide it, fall back to clock_gettime.
2016-09-23 21:51:34 -04:00
George Bosilca
45dcf1f5d7 Always use the best timer available
If we have better timer than clock_gettime use it, even if it an
assembly timer.
2016-09-23 19:32:58 -04:00
George Bosilca
93fa94f96f Re-enable support for local addresses.
This patch is based on the "RFC: Reenabling the TCP BTL over local
interfaces (when specifically requested)". It removes the hardcoded
exception for the local devices that has been enforced by the
TCP BTL. Instead, we exclude the local interface only via the
exclude MCA (both IPv4 and IPv6 local addresses are already in the
default if_exclude), which is also the behavior currently described in
our README file.
2016-09-23 13:04:33 -04:00
Gilles Gouaillardet
362a5886de pmix3x: client: fix PMIx_Finalize() sequence
pmix_progress_thread_finalize() invokes libevent event_base_free,
so all libevent stuff cannot be used after.
Hence, pmix_client_globals.myserver must be PMIX_DESTRUCT'ed
before invoking pmix_progress_thread_finalize()
2016-09-24 00:01:23 +09:00
Gilles Gouaillardet
5479c6cca7 pmix3x: add missing #include
and get Open MPI build on OpenBSD 6.0
2016-09-23 11:23:18 +09:00
Gilles Gouaillardet
eaee1332e1 opal/util/ethtool: add missing headers
and get Open MPI build on OpenBSD 6.0
2016-09-23 11:22:19 +09:00
Ralph Castain
a14ec3bdbc Mucho thanks to Gilles - his patch to reorder the CPPFLAGS solves the problem of inadvertently picking up hwloc and libevent headers from locations in CPPFLAGS while continuing to build the embedded versions. Also silence a minor warning about an uninitialized var. 2016-09-22 07:39:22 -07:00
George Bosilca
131fe42db8 Fix MT wait-sync.
Prevent a race condition between a thread checking count and then
going in cond_wait, and another thread setting the count to 0 and
signaling the condition.
Thanks to Pascal Deveze for catching up the bug and for
the initial patch.
2016-09-21 07:42:48 -04:00
Gilles Gouaillardet
fbf03299c3 Merge pull request #2079 from ggouaillardet/topic/pmix_configury_dlopen
pmix3x: configury: correctly handle --disable-dlopen
2016-09-21 10:59:33 +09:00
Gilles Gouaillardet
6c1e25b76e pmix/ext11: fix pmix1_value_unload() prototype and call
pmix1_value_unload() was added a "key" argument which is unused,
and pmix1_value_unload() was sometimes invoked with two arguments instead of three.

since the "key" argument is unused, simply remove it from the
subroutine prototype and calls.
2016-09-20 14:34:41 +09:00
Gilles Gouaillardet
e6f7facd7d opal/util: improve error message in opal_os_dirpath_create() 2016-09-18 17:10:47 +09:00
Gilles Gouaillardet
4b47daeeb0 opal/util: improve return status of opal_os_dirpath_create() 2016-09-18 12:32:42 +09:00
George Bosilca
295eec7059 Small fix for persistence receives.
A minor optimization, few typos and extra comments
2016-09-16 10:27:32 -04:00
Nathan Hjelm
2edc77b27b asm/ppc: work around apparent PGI 16.9 bug
The add_64, sub_64, and cmpset_64 atomics used "+m" (*addr) to
indicate the asm also writes the memory location. This is better than
using a memory clobber. PGI 16.9 introduced a bug that causes a
compiler failure on the "+m" constraint (input/output). It seems to
work with "=m" (output) which matches the 32-bit atomics.

Fixes open-mpi/ompi#2086

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-09-15 12:43:31 -06:00
Gilles Gouaillardet
041a431966 pmix3x: configury: correctly handle --disable-dlopen
the LT_* macros do overwrite the enable_dlopen variable,
so it must be tested and saved before invoking LT_INIT.
delay the invokation of the LT_* macros and use the
PMIX_ENABLE_DLOPEN_SUPPORT variable to figure out whether
--disable-dlopen was invoked
2016-09-15 13:26:20 +09:00
Nathan Hjelm
4c9e38e8e0 Merge pull request #2077 from hjelmn/tcp_fix
btl/tcp: fix double list remove
2016-09-13 12:21:52 -06:00
Nathan Hjelm
a681837ba8 btl/tcp: fix double list remove
This commit fixes an abort during finalize because pending events were
removed from the list twice.

References #2030

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-09-13 09:23:12 -06:00
Gilles Gouaillardet
628c730196 pkgconfig: define the pkgincludedir variable in *.pc files
this has been made necesarry with open-mpi/ompi@12e796dcaf

Refs open-mpi/ompi#2069
2016-09-13 09:50:14 +09:00
Artem Polyakov
9eba1b0b75 Merge pull request #2042 from artpol84/pmix_sdirs
Several fixes related to session directories:
2016-09-07 14:15:47 +07:00
Gilles Gouaillardet
cd2b5a82ed hwloc: plug memory leak
as reported by Coverity with CID 1270441
2016-09-07 10:08:44 +09:00
Gilles Gouaillardet
44a66e208c threads: fix WAIT_SYNC_INIT with a zero count
WAIT_SYNC_INIT(sync,0); WAIT_SYNC_RELEASE(sync);
hanged because sync->signaled was initialised to true, and
there is no reason to invoke WAIT_SYNC_SIGNALED(sync) before
WAIT_SYNC_RELEASE(sync)
this commit initializes sync->signaled to true unless the count is zero.

Thanks George for the review and guidance.
2016-09-07 10:03:40 +09:00
Nathan Hjelm
27a2509fec Merge pull request #2051 from hjelmn/ppc_asm
opal/asm: updates to powerpc assembly
2016-09-06 15:13:28 -06:00
Jeff Squyres
527efec4fb Merge pull request #2050 from jsquyres/pr/btl-tcp-help-messages
Add a show_help message to TCP BTL when peer unexpectedly disconnects
2016-09-06 09:40:31 -04:00
Jeff Squyres
1953e3406f btl/tcp: add show_help message when peer hangs up
We commonly see messages on the users list where a peer has hung up
because it has crashed.  Instead of having just a BTL_ERROR message,
make this a real opal_show_help() message that tells the user that the
peer unexpectedly hung up, and they should look into *why* that peer
hung up.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-09-06 09:40:03 -04:00
Gilles Gouaillardet
894be7860a gcc_builtin/atomic: Silence numerous warnings from Studio compilers
This commit adds selective use of a compiler-specific pragma to
silence the numerous warnings the Sun/Oracle/Studio compilers emit for
the GNU-style inline asm used in atomic.h.

Thanks Paul Hargrove for the initial patch and the guidance.
2016-09-06 09:07:16 +09:00
Gilles Gouaillardet
4b208e4463 btl/tcp: make mca_btl_tcp_proc_insert re-entrant
otherwise bad things happen with
 --mca btl_tcp_progress_thread 1 (non default)
and
 --mca mpi_add_procs_cutoff 0 (default)
2016-09-05 15:57:34 +09:00
Artem Polyakov
dc0ab674de Add PMIx key to provide RM with ability to indicate that it will cleanup
session directories provided at through OPAL_PMIX_TMPDIR,
OPAL_PMIX_NSDIR, OPAL_PMIX_PROCDIR
2016-09-05 07:48:44 +03:00
Nathan Hjelm
a36bdfe69f opal/asm: updates to powerpc assembly
This commit contains the following changes:

 - There is a bug in the PGI 16.x betas for ppc64 that causes them to
   emit the incorrect instruction for loading 64-bit operands. If not
   cast to void * the operands are loaded with lwz (load word and
   zero) instead of ld. This does not affect optimized mode. The work
   around is to cast to void * and was implemented similar to a
   work-around for a xlc bug.

 - Actually implement 64-bit add/sub. These functions were missing and
   fell back to the less efficient compare-and-swap implementations.

Thanks to @PHHargrove for helping to track this down. With this update
the GCC inline assembly works as expected with pgi and ppc64.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-09-02 23:47:47 -06:00
Jeff Squyres
95c6f6cfc0 btl/tcp: fix help message
It looks like one help message was accidentally pasted in the middle
of another.  Disentangle the two messages from each other, and
slightly tweak the one message to say that the job may also crash (in
addition to hanging).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-09-02 17:14:22 -04:00
Nathan Hjelm
f93c1f2106 btl/ugni: fix erroneous warning message
This commit prevents the connection code from trying to connect an
endpoint if the directed datagram has been posted but not received.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-09-02 09:17:44 -06:00
Ralph Castain
34f04a7924 Remove spurious Makefile.am line 2016-09-01 15:31:09 -07:00
Ralph Castain
0ea1cff733 Implement notification of completion on comm_spawn'd child jobs. Add a configure flag to enable PMIx 3's shared memory datastore, and set it disable by default so that comm_spawn functions again. Will reverse the default once that feature is fully functional 2016-09-01 13:10:10 -07:00
rhc54
39d086e000 Merge pull request #2035 from rhc54/topic/memprofile
Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint
2016-08-31 14:06:48 -05:00
Ralph Castain
39992d1ad7 Silence trivial Coverity warnings 2016-08-31 09:42:33 -07:00
Ralph Castain
c1050bc01e Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint. Setting OMPI_MEMPROFILE=N causes mpirun to set a timer for N seconds. When the timer fires, mpirun will query each daemon in the job to report its own memory usage plus the average memory usage of its child processes. The Proportional Set Size (PSS) is used for this purpose. 2016-08-31 09:32:07 -07:00
Ralph Castain
cfa784c9a6 Since we changed storage to pointers in pmix_value_t, we need to allocate space for those values when unpacking 2016-08-29 20:22:24 -07:00
George Bosilca
a6d515ba9e Fixes opal_atomic_ll_64. Thanks to Paul Hardgrove for
the report and his patch.

This is an addition to #1140 and should go in 2.x
2016-08-27 12:43:48 -04:00
Nathan Hjelm
d33204b0dc Merge pull request #2021 from hjelmn/xlc_fix
opal/patcher: fix xlc support
2016-08-26 18:15:41 -06:00
rhc54
b90a64e734 Merge pull request #2022 from rhc54/topic/nnodes
Provide the number of nodes in the job
2016-08-26 18:15:24 -05:00
Ralph Castain
2f6e0fec90 Provide the number of nodes in the job 2016-08-26 14:50:41 -07:00
Jeff Squyres
09ad7e81eb Merge pull request #2007 from jsquyres/pr/usnic-show-local-udp-ports
usnic: show the local UDP ports
2016-08-26 17:03:16 -04:00
Nathan Hjelm
a9bc692d99 opal/patcher: fix xlc support
The xlc compiler seems to behave in a different way that gcc when it
comes the inline asm. There were two problems with the code with xlc:

 - The TOC read in mca_patcher_base_patch_hook used the syntax
   register unsigned long toc asm("r2") to read $r2 (the TOC
   pointer). With gcc this seems to behave as expected but with xlc
   the result in toc is not the same as $r2. I updated the code to use
   asm volatile ("std 2, %0" : "=m" (toc)) to load the TOC pointer.

 - The OPAL_PATCHER_BEGIN macro is meant to be the first thing in a
   hook. On PPC64 it loads the correct TOC pointer (thanks to
   mca_patcher_base_patch_hook) and saves the old one. The
   OPAL_PATCHER_END macro restores the TOC pointer. Because we *need*
   the TOC to be correct before it is accessed in the hook the
   OPAL_PATCHER_BEGIN macro MUST come first. We did this and all was
   well with gcc. With xlc on the other hand there was a TOC access
   before the assembly inserted by OPAL_PATCHER_BEGIN. To fix this
   quickly I broke each hook into a pair of function with the
   OPAL_PATCHER_* macros on the top level functions. This works around
   the issue but is not a clean way to fix this. In the future we
   should 1) either update overwrite to not need this, or 2) figure
   out why xlc is not inserting the asm before the first TOC read.

This fixes open-mpi/ompi#1854

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-26 14:43:03 -06:00