1
1
Граф коммитов

4452 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
27a2509fec Merge pull request #2051 from hjelmn/ppc_asm
opal/asm: updates to powerpc assembly
2016-09-06 15:13:28 -06:00
Jeff Squyres
527efec4fb Merge pull request #2050 from jsquyres/pr/btl-tcp-help-messages
Add a show_help message to TCP BTL when peer unexpectedly disconnects
2016-09-06 09:40:31 -04:00
Jeff Squyres
1953e3406f btl/tcp: add show_help message when peer hangs up
We commonly see messages on the users list where a peer has hung up
because it has crashed.  Instead of having just a BTL_ERROR message,
make this a real opal_show_help() message that tells the user that the
peer unexpectedly hung up, and they should look into *why* that peer
hung up.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-09-06 09:40:03 -04:00
Gilles Gouaillardet
894be7860a gcc_builtin/atomic: Silence numerous warnings from Studio compilers
This commit adds selective use of a compiler-specific pragma to
silence the numerous warnings the Sun/Oracle/Studio compilers emit for
the GNU-style inline asm used in atomic.h.

Thanks Paul Hargrove for the initial patch and the guidance.
2016-09-06 09:07:16 +09:00
Gilles Gouaillardet
4b208e4463 btl/tcp: make mca_btl_tcp_proc_insert re-entrant
otherwise bad things happen with
 --mca btl_tcp_progress_thread 1 (non default)
and
 --mca mpi_add_procs_cutoff 0 (default)
2016-09-05 15:57:34 +09:00
Artem Polyakov
dc0ab674de Add PMIx key to provide RM with ability to indicate that it will cleanup
session directories provided at through OPAL_PMIX_TMPDIR,
OPAL_PMIX_NSDIR, OPAL_PMIX_PROCDIR
2016-09-05 07:48:44 +03:00
Nathan Hjelm
a36bdfe69f opal/asm: updates to powerpc assembly
This commit contains the following changes:

 - There is a bug in the PGI 16.x betas for ppc64 that causes them to
   emit the incorrect instruction for loading 64-bit operands. If not
   cast to void * the operands are loaded with lwz (load word and
   zero) instead of ld. This does not affect optimized mode. The work
   around is to cast to void * and was implemented similar to a
   work-around for a xlc bug.

 - Actually implement 64-bit add/sub. These functions were missing and
   fell back to the less efficient compare-and-swap implementations.

Thanks to @PHHargrove for helping to track this down. With this update
the GCC inline assembly works as expected with pgi and ppc64.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-09-02 23:47:47 -06:00
Jeff Squyres
95c6f6cfc0 btl/tcp: fix help message
It looks like one help message was accidentally pasted in the middle
of another.  Disentangle the two messages from each other, and
slightly tweak the one message to say that the job may also crash (in
addition to hanging).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-09-02 17:14:22 -04:00
Nathan Hjelm
f93c1f2106 btl/ugni: fix erroneous warning message
This commit prevents the connection code from trying to connect an
endpoint if the directed datagram has been posted but not received.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-09-02 09:17:44 -06:00
Ralph Castain
34f04a7924 Remove spurious Makefile.am line 2016-09-01 15:31:09 -07:00
Ralph Castain
0ea1cff733 Implement notification of completion on comm_spawn'd child jobs. Add a configure flag to enable PMIx 3's shared memory datastore, and set it disable by default so that comm_spawn functions again. Will reverse the default once that feature is fully functional 2016-09-01 13:10:10 -07:00
rhc54
39d086e000 Merge pull request #2035 from rhc54/topic/memprofile
Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint
2016-08-31 14:06:48 -05:00
Ralph Castain
39992d1ad7 Silence trivial Coverity warnings 2016-08-31 09:42:33 -07:00
Ralph Castain
c1050bc01e Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint. Setting OMPI_MEMPROFILE=N causes mpirun to set a timer for N seconds. When the timer fires, mpirun will query each daemon in the job to report its own memory usage plus the average memory usage of its child processes. The Proportional Set Size (PSS) is used for this purpose. 2016-08-31 09:32:07 -07:00
Ralph Castain
cfa784c9a6 Since we changed storage to pointers in pmix_value_t, we need to allocate space for those values when unpacking 2016-08-29 20:22:24 -07:00
George Bosilca
a6d515ba9e Fixes opal_atomic_ll_64. Thanks to Paul Hardgrove for
the report and his patch.

This is an addition to #1140 and should go in 2.x
2016-08-27 12:43:48 -04:00
Nathan Hjelm
d33204b0dc Merge pull request #2021 from hjelmn/xlc_fix
opal/patcher: fix xlc support
2016-08-26 18:15:41 -06:00
rhc54
b90a64e734 Merge pull request #2022 from rhc54/topic/nnodes
Provide the number of nodes in the job
2016-08-26 18:15:24 -05:00
Ralph Castain
2f6e0fec90 Provide the number of nodes in the job 2016-08-26 14:50:41 -07:00
Jeff Squyres
09ad7e81eb Merge pull request #2007 from jsquyres/pr/usnic-show-local-udp-ports
usnic: show the local UDP ports
2016-08-26 17:03:16 -04:00
Nathan Hjelm
a9bc692d99 opal/patcher: fix xlc support
The xlc compiler seems to behave in a different way that gcc when it
comes the inline asm. There were two problems with the code with xlc:

 - The TOC read in mca_patcher_base_patch_hook used the syntax
   register unsigned long toc asm("r2") to read $r2 (the TOC
   pointer). With gcc this seems to behave as expected but with xlc
   the result in toc is not the same as $r2. I updated the code to use
   asm volatile ("std 2, %0" : "=m" (toc)) to load the TOC pointer.

 - The OPAL_PATCHER_BEGIN macro is meant to be the first thing in a
   hook. On PPC64 it loads the correct TOC pointer (thanks to
   mca_patcher_base_patch_hook) and saves the old one. The
   OPAL_PATCHER_END macro restores the TOC pointer. Because we *need*
   the TOC to be correct before it is accessed in the hook the
   OPAL_PATCHER_BEGIN macro MUST come first. We did this and all was
   well with gcc. With xlc on the other hand there was a TOC access
   before the assembly inserted by OPAL_PATCHER_BEGIN. To fix this
   quickly I broke each hook into a pair of function with the
   OPAL_PATCHER_* macros on the top level functions. This works around
   the issue but is not a clean way to fix this. In the future we
   should 1) either update overwrite to not need this, or 2) figure
   out why xlc is not inserting the asm before the first TOC read.

This fixes open-mpi/ompi#1854

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-26 14:43:03 -06:00
Jeff Squyres
87a5ccc060 usnic: show the local UDP ports
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-26 12:25:18 -07:00
Jeff Squyres
e03a40a0e9 pmix3x: remove generated file
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-26 10:30:47 -07:00
Jeff Squyres
9ae51a09f2 Merge pull request #1989 from jsquyres/pr/update-usnic-to-libfabric-v1.4
Update usnic BTL to libfabric v1.4
2016-08-26 09:53:07 -04:00
Gilles Gouaillardet
e4bf915e75 pmix3x: remove auto-generated file
remove opal/mca/pmix/pmix3x/pmix/src/include/pmix_config.h.in
.gitignore is correct, so it seems this file was added before .gitignore was updated
2016-08-26 15:00:18 +09:00
Ralph Castain
af67f16422 Update configury to support multiple PMIx versions, rename pmix2x component to pmix3x for support of PMIx master
Update support for external v1.1.x and v2.x libraries. Minor corrections to the v3.x component
2016-08-25 18:19:05 -07:00
Gilles Gouaillardet
277c319389 opal/util: fix (again and again) incorrect type casting in opal_path_df
and silence CID 1371767

this fixes previous commits :
 - open-mpi/ompi@2eec8970ff
 - open-mpi/ompi@a439afce5b
2016-08-26 09:42:45 +09:00
Nathan Hjelm
de32c779e2 opal/wait_sync: add #if protection on header
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-25 14:31:52 -06:00
Jeff Squyres
f56b16f079 usnic: remove unused variable
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-25 03:53:18 -07:00
Jeff Squyres
9717bcb7e6 btl/usnic: remove stale comment
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-25 03:53:18 -07:00
Jeff Squyres
6f5e377fe0 btl/usnic: update for libfabric v1.4
With libfabric v1.4, the usnic provider changed the values of its
fabric and domain name strings (compared to libfabric <v1.4).  Update
the Open MPI usNIC BTL to handle both pre-v1.4 and v1.4 fabric/domain
names.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-25 03:53:17 -07:00
George Bosilca
3adff9d323 Fixes #1793.
Reshape the tearing down process (connection close) to prevent race
conditions between the main thread and the progress thread.

Minor cleanups.
2016-08-24 22:45:19 -04:00
Nathan Hjelm
83062db7cb btl/ugni: actually make the endpoint lock recursive
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-24 10:36:08 -06:00
Gilles Gouaillardet
2eec8970ff opal/util: fix (again) incorrect type casting in opal_path_df
this fixes previous commit open-mpi/ompi@a439afce5b
2016-08-24 12:50:15 +09:00
Gilles Gouaillardet
02847d9e7b pmix2x: dstore: add missing <fcntl.h> include file in pmix_esh.c
(back-ported from upstream pmix/master@5c66ffe0f0)
2016-08-24 11:18:46 +09:00
Gilles Gouaillardet
c11e8163f8 pmix2x: sec/native: fix the pmix_native module under solaris by using getpeerucred()
and fail with a user friendly message if no method is available:
"sec: native cannot validate_cred on this system"

(back-ported from upstream pmix/master@c474a1fc60)
2016-08-24 11:18:40 +09:00
Gilles Gouaillardet
e91292aa41 pmix2x: configury: add missing check for <netdb.h> header file
(back-ported from upstream pmix/master@e54ce6d423)
2016-08-24 11:18:32 +09:00
Gilles Gouaillardet
a439afce5b opal/util: fix incorrect type casting in opal_path_df 2016-08-24 10:26:13 +09:00
Potnuri Bharat Teja
9b7f9ece20 Add Chelsio T6 adapter device parameters.
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
2016-08-23 10:38:13 +05:30
Ralph Castain
639dbdb7ea For maintainability, fold the external PMIx 2.x integration into the internal PMIx 2.x library component. This ensures that we always stay in sync with the two as that is becoming a problem. 2016-08-22 13:28:55 -07:00
George Bosilca
fd57f5bccd Remove some of the clang warnings. 2016-08-20 14:21:42 -04:00
Ralph Castain
61ffba668b Roll in the latest PMIx version - includes shared memory datastore and reduced memory footprint 2016-08-20 07:53:06 -07:00
Artem Polyakov
6ea8cccdab Merge pull request #1969 from artpol84/pmix_jobid_fix
Pmix jobid fix
2016-08-18 17:24:58 +07:00
Ralph Castain
7da9793fef Support the PMIX_TIMEOUT key at the PMIx server when timeout=0 - this indicates that the user doesn't want a lookup of any data from the host RM. 2016-08-17 16:26:58 -05:00
Gilles Gouaillardet
3126ff77e2 pmix2x: common syms: whitelist bison-generated common symbols
Bison generates some common symbols that we can't do anything about,
so whitelist them.
2016-08-16 11:29:06 +09:00
Artem Polyakov
c5a91c5c9d opal/pmix: fix pmix jobid calculation if external PMIx server is used. 2016-08-15 21:13:51 +03:00
Ralph Castain
ecbedee8bb Fix typo 2016-08-15 07:32:00 -07:00
Artem Polyakov
f3c816b52e opal/pmix: fix indentation in some files. 2016-08-15 18:21:50 +07:00
Gilles Gouaillardet
483685eb6a update .gitignore
remove autogenerated opal/mca/pmix/pmix2x/pmix/src/include/pmix_config.h.in
2016-08-15 17:00:20 +09:00
Ralph Castain
be8424b691 Provide backward compatible keys so that the non-PMIx components in the opal/pmix framework don't have to adjust as we continue to work on finalizing the PMIx reference scheme. Activate and utilize the new PMIx show_help capability to provide more meaningful error output when the server cannot start.
Add a contrib script to cleanup permissions incorrectly modified due to things like smb mounts

dd
2016-08-13 12:13:04 -07:00
rhc54
ddde154d28 Merge pull request #1962 from rhc54/topic/notify
Ensure we properly convert pmix status to ORTE state before activatin…
2016-08-13 06:59:50 -07:00
Ralph Castain
48d35a9627 Ensure we properly convert pmix status to ORTE state before activating an error state upon notification. Cleanup some conversion issues on notification info. Add a new orte_notify.c test program 2016-08-12 21:14:29 -07:00
Ralph Castain
4a4c9703a9 Setup the job list in the PMIx integration so that static ports can run 2016-08-12 13:27:10 -07:00
Ralph Castain
0e58609327 Fix a bug where we were requiring that all paths in $PATH be absolute. Some users provide relative paths in their environment, and we should respect those. 2016-08-12 11:28:57 -07:00
Ralph Castain
1d44f0c0e2 Silence Coverity warnings 2016-08-11 21:22:01 -07:00
Ralph Castain
73544d2e00 Rename symbol 2016-08-11 13:06:46 -07:00
Ralph Castain
b0cc9b0bc8 Update to latest PMIx toolext branch
Fix indentations

Update the ext20 component to match latest PMIx master.

Cleanup name conflicts and uninit vars
2016-08-11 12:29:48 -07:00
Gilles Gouaillardet
dfbf2b7be4 opal/threads: add OPAL_THREAD_SUB_SIZE_T macro
-1 is not a valid size_t, so instead of OPAL_THREAD_ADD_SIZE_T(..., -1),
simply OPAL_THREAD_SUB_SIZE_T(..., 1) and keep picky compilers happy
2016-08-10 13:37:36 +09:00
rhc54
60f789dca1 Merge pull request #1948 from rhc54/topic/pmixtool
Update to include extended tool support, new datatypes
2016-08-09 16:17:28 -07:00
Nathan Hjelm
19be439998 Merge pull request #1949 from hjelmn/ugni_fix
btl/ugni: fix another connection race
2016-08-09 08:32:40 -06:00
Nathan Hjelm
38f18eed22 Merge pull request #1941 from ggouaillardet/topic/memory_patcher_configury
configury: make memory/patcher symbol detection more robust
2016-08-09 07:06:38 -06:00
Gilles Gouaillardet
13009aa290 opal/alfg: have opal_random() wrapper always return a positive int 2016-08-09 17:12:30 +09:00
Gilles Gouaillardet
6f6b3ac68a configury: standardize memory/patcher symbol detection and make it more robust
by default, Sun compilers optimize out the original test, and hence fail detecting a symbol is missing.
2016-08-09 09:35:52 +09:00
Nathan Hjelm
adb668209b btl/ugni: fix another connection race
This commit fixes a race that can occur when two threads are in the
ugni progress function at the same time. This race occurs when one
thread calls GNI_PostDataProbeById then goes to sleep then another
thread calls GNI_PostDataProbeById then GNI_EpPostDataWaitById before
the other thread wakes up. If this happens the first thread will print
a warning on GNI_EpPostDataWaitById about no matching post.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-08 15:38:11 -06:00
Ralph Castain
527b5c692a Update to include extended tool support, new datatypes 2016-08-08 13:39:46 -07:00
Todd Kordenbrock
b90da992c8 Merge pull request #1895 from PDeveze/Patchs-on-btl-portals4
btl/portals4: Take into account the limitation of portals4 (max_msg_s…
2016-08-08 15:12:50 -05:00
Nathan Hjelm
5ced037488 Merge pull request #1939 from hjelmn/ugni_fix
btl/ugni: protect against re-entry and races in connections
2016-08-08 08:55:30 -06:00
Artem Polyakov
b24ec3e3b9 pmix/s2: fix indentation (only) 2016-08-06 16:31:19 +06:00
Artem Polyakov
2cb923a413 pmix/s1: fix indentation (only) 2016-08-06 16:30:45 +06:00
Artem Polyakov
8aa3ef7799 pmix/s2: fix s2 component data placement
Use wildcard for the information related to the job-level data.
Fixes s2 component with regard to PR https://github.com/open-mpi/ompi/pull/1897.
2016-08-06 15:49:16 +06:00
Artem Polyakov
81063f1717 pmix/s1: fix s1 component data placement
Use wildcard for the information related to the job-level data.
Fixes s1 component with regard to PR https://github.com/open-mpi/ompi/pull/1897.
2016-08-06 15:45:46 +06:00
Nathan Hjelm
14b36d4503 btl/ugni: protect against re-entry and races in connections
This commit fixes two issues that can occur during a connection:

 - Re-entry to connection progress from modex lookup. Added an
   additional endpoint state that will keep the code from re-entering
   the common endpoint create.

 - Fixed a race between a process posting a directed datagram through
   a send and a connection being progressed through opal_progress().
   The progress code was not obtaining the endpoint lock before
   attempting to update the endpoint. To limit the amount of code
   changed for 2.0.1 this commit makes the endpoint lock recursive. In
   a future update this may be changed.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-04 16:08:01 -06:00
Jeff Squyres
c42d8867e6 Merge pull request #1925 from jsquyres/pr/warnings-fixes
hwloc: fix Valgrind warning
2016-08-04 08:48:50 -07:00
Jeff Squyres
36555b7a1d Merge pull request #1933 from thananon/fix_random
Make libevent use internal random
2016-08-04 08:27:56 -07:00
Boris Karasev
9d6a4b3b2d configury/libevent: fix incorrect drop of OPAL_HAVE_WORKING_EVENTOPS
Fixes PR https://github.com/open-mpi/ompi/pull/1687
The code that sets OPAL_HAVE_WORKING_EVENTOPS for internal libevent
was executed even if the external libevent component was configured.

As the result libevent progress wasn't called in opal_progress which
for example caused ring_c to hang when pml/ob1 was used.
2016-08-04 16:37:37 +06:00
Gilles Gouaillardet
30f98cd9d0 pmix: redefine OPAL_PMIX_ARCH macro
Architecture is set by the ompi layer *after* job startup, so the key cannot
have the "pmix" prefix since optimizations in open-mpi/ompi@01a653d50a
otherwise architecture cannot be retrieved
2016-08-04 13:31:28 +09:00
Thananon Patinyasakdikul
b3e9dadff2 libevent: use opal_random() instead of rand(3)
This commits changed rand(3) and family in libevent to use internal
random function provided in opal to prevent pertubing user's random seed.

Fixes open-mpi/ompi#1877
2016-08-03 09:18:12 -07:00
Howard Pritchard
08266a1a56 mpool/hugepage mntent intro fallout
On Cray, PR #1846 introduced a double free
situation which led to all kinds of random memory
corruption problems.

This commit fixes this problem.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-08-02 05:52:31 -05:00
Jeff Squyres
7bea563e02 hwloc: fix Valgrind warning
Cherry picked from open-mpi/hwloc@d4565c351e

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-01 18:50:40 -07:00
Gilles Gouaillardet
21e7f31dbe pmix2x: fix unpack sequence in PMIx_Get callback
first unpack the nspace (PMIX_STRING) before unpacking the various keys (PMIX_KVAL)
2016-08-01 14:21:22 +09:00
Howard Pritchard
477f6cb6a8 Merge pull request #1846 from ggouaillardet/topic/mntent
mpool/hugepage: set mntent API instead of manually parsing /proc/mounts
2016-07-31 20:17:37 -06:00
Gilles Gouaillardet
1778e5b586 atomic/sparcv9: fix a typo in the comment, no code change 2016-08-01 10:34:02 +09:00
Ralph Castain
16fccd4964 Establish a way for ORTE to tell PMIx the base tmpdir to use, and update PMIx to understand such directives 2016-07-29 09:52:36 -07:00
Nathan Hjelm
325c9ba4cc opal/thread: fix warnings
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-07-29 07:04:19 -06:00
Nathan Hjelm
1da558407c Merge pull request #1911 from hjelmn/threads
opal/thread: clean up and add additional OPAL_THREAD macros
2016-07-29 06:44:11 -06:00
Gilles Gouaillardet
273e56096b configury: capture configury command line
configury command line is quoted and made available via the OPAL_CONFIGURE_CLI macro.
it can be retrieved via {orte-info,ompi_info,oshmem_info} -c, or
{orte-info,ompi_info,oshmem_info} --all --parseable | grep ^config:cli:
2016-07-29 09:14:09 +09:00
Ralph Castain
cacb582ecd Support timeout values when performing connect/accept operations. Bump default timeout to 10 minutes so folks have time to start the partnering application 2016-07-28 14:09:06 -07:00
Nathan Hjelm
c281bd3c7f Merge pull request #1908 from hjelmn/udreg_fix
rcache/udreg: make reference count thread safe
2016-07-28 09:27:16 -06:00
Nathan Hjelm
aac611237b opal/thread: clean up and add additional OPAL_THREAD macros
This commit expands the OPAL_THREAD macros to include 32- and 64-bit
atomic swap. Additionally, macro declararations have been updated to
include both OPAL_THREAD_* and OPAL_ATOMIC_*. Before this commit the
former was used with add and the later with cmpset.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-07-28 09:23:14 -06:00
Nathan Hjelm
a8c3699484 Fix performance regression caused by enabling opal thread support
This commit adds opal_using_threads() protection around the atomic
operation in OBJ_RETAIN/OBJ_RELEASE. This resolves the performance
issues seen when running psm with MPI_THREAD_SINGLE.

To avoid issues with header dependencies opal_using_threads() has been
moved to a new header (thread_usage.h). The OPAL_THREAD_ADD* and
OPAL_THREAD_CMPSET* macros have also been relocated to this header.

This commit is cherry-picked off a fix that was submitted for the v1.8
release series but never applied to master. This fixes part of the
problem reported by @nysal in #1902.

(cherry picked from commit open-mpi/ompi-release@ce91307918)

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2016-07-28 07:01:27 -06:00
Nathan Hjelm
4658b761e4 rcache/udreg: make reference count thread safe
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-07-27 13:40:35 -06:00
Nathan Hjelm
1eb4ef438e Merge pull request #1903 from hjelmn/openib_fixes
btl/openib: set send flags only after endpoint is connected
2016-07-27 09:01:49 -06:00
Howard Pritchard
1dc7e9ed8f Merge pull request #1904 from hppritcha/topic/fix_cray_srun_native_launch
pmix/cray: switch to using wildcards for some
2016-07-27 07:12:02 -06:00
Howard Pritchard
b65bbe017f pmix/cray: switch to using wildcards for some
items so that at least srun native launch on
cray works again.

More issues to fix when using alps.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-07-26 17:07:58 -05:00
Nathan Hjelm
5e13e1ab7d btl/openib: set send flags only after endpoint is connected
The max inline send size on a queue pair is not available until after
the endpoint is connected. Before this commit the send flags
(including the inline flag) were set before this value was
initialized. This commit moves setting the send_flags down to
mca_btl_openib_put_internal which is only called after the endpoint is
connected. This fixes a bug when using osc/rdma.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-07-26 16:01:11 -06:00
Gilles Gouaillardet
91ccec342c btl/openib: remove some dead code
remove useless call to opal_mem_hooks_support_level() and the value local variable.
2016-07-22 09:26:33 +09:00
Gilles Gouaillardet
1b3be0ac8c configury + btl/openib: fix a typo
test for existence of struct ibv_exp_device_attr.exp_atomic_cap.
That was previously mistyped struct ibv_exp_device_attr.ext_atomic_cap
2016-07-22 09:26:33 +09:00
Ralph Castain
71de03fc67 Cleanup the new naming requirements to ensure that info is correctly retrieved
Cleanup permissions

Restore singleton operations
2016-07-21 09:46:03 -07:00
Ralph Castain
2b55ee8118 Cleanup Coverity warnings 2016-07-20 20:31:58 -07:00
Ralph Castain
01a653d50a Remove a debug print in comm_cid.c. Update PMIx2 to include the revised PMIx_Get logic for higher performance by reducing the number of hash table lookups. Fix a bug where requests for data from a proc in another nspace could hang, or result in "not found".
Remove stale file reference

Restore autogen pass thru pmix

Remove generated file
2016-07-20 00:58:19 -07:00
Pascal Deveze
6d6ec66705 btl/portals4: Take into account the limitation of portals4 (max_msg_size) 2016-07-19 15:19:29 +02:00
Nathan Hjelm
03bce91de8 pmix/pmix2x: add missing increment in loop
This commit fixes a bug in the pmix2x client code where a loop
variable is not correctly incremented. This was leading to hangs and
crashes when creating intercommunicators. Also fixed two double
increments in other loops.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-07-18 10:35:05 -06:00
Jeff Squyres
72f41d4490 pmix: replace all tabs with spaces
No code or logic changes

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-07-17 15:08:33 -04:00
Jeff Squyres
1c32742c66 pmix_ext20: fix syntax error
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-07-17 15:04:12 -04:00
Ralph Castain
99f7096031 Fix permissions 2016-07-16 21:03:55 -07:00
Ralph Castain
d4071fbd1c Fix dynamic operations by ensuring that we only fire the debugger release if the debugger is attached, and that the OPAL pmix key for directing events to non-default handlers matches the PMIx spelling 2016-07-16 13:20:41 -07:00
Ralph Castain
1ceb35ba5c Fix singletons - do not include the PMIx tool URI in the environment provided to child processes 2016-07-13 17:33:34 -07:00
Ralph Castain
20a91c2baf Add a new --continuous flag to mpirun that directs ORTE to let a job continue running as app procs terminate. Don't attempt to restart them. Add event notification of abnormally terminating procs, and demonstrate that in the mpi_spin test program.
Cleanup debug message
2016-07-13 15:28:33 -07:00
Artem Polyakov
72585a905f opal/pmix: add blocking Fence to SLURM components.
Blocking fence is used in yalla del proc. Native pmix exposes this functionality.
We need to expose it for SLURM's s1/s2 components as well.

Also this commit fixes uninitialized `rc` in fencenb's of both
components.
2016-07-11 09:43:15 +03:00
Artem Polyakov
8e16f47492 Merge pull request #1688 from artpol84/fix_base64
Fix base64 implementation in pmix framework.
2016-07-07 10:47:50 +06:00
Gilles Gouaillardet
1ba7e2b20b mpool/hugepage: set mntent API instead of manually parsing /proc/mounts
Refs open-mpi/ompi#1822
2016-07-06 15:00:19 +09:00
Gilles Gouaillardet
acda07472a configury: revamp and re-ident sub configure.m4 after open-mpi/ompi@846360fd4c 2016-07-06 11:59:51 +09:00
Gilles Gouaillardet
846360fd4c configury: correctly perform make distclean when {libevent,hwloc,pmix} are external components
Thanks Jeff for the guidance

Fixes open-mpi/ompi#1683

note:
in order to keep this commit easy to review, some AS_IF([...]) were replaced with
AS_IF([false], ...) or AS_IF_([true], ...)
these will be removed and re-idented in a subsequent commit
2016-07-06 11:57:24 +09:00
Ralph Castain
ee56d9dc1a Shorten the session directory name as some OS's are now providing unusually long temp directory names, causing us to overflow the sockaddr field 2016-07-05 14:59:50 -07:00
Ralph Castain
7e0af3f4f0 Update pmix2x to track upstream changes 2016-07-05 11:54:22 -07:00
Gilles Gouaillardet
267821f0dd pmix2x/pmix: fix a typo in PMIx_tool_init()
and remove now useless local variable i
2016-07-05 13:47:50 +09:00
Gilles Gouaillardet
efce8cc734 pmix2x/pmix: add missing include files
pmix cannot be built on alpine linux because of some missing includes.
uid_t and gid_t are defined in unistd.h or sys/types.h, and unistd.h
is not indirectly pulled under alpine linux, so do it manually.

Thanks N.L.K Nguyen for the report

(back-ported from upstream pmix/master@c8d55350a9)
2016-07-05 09:03:14 +09:00
Ralph Castain
c9ada8e095 Silence Coverity warnings 2016-07-03 20:45:08 -07:00
Ralph Castain
673f82e2b6 Update the PMIx listener to avoid leaking sockets into children, and better handle race condition errors 2016-07-03 08:23:33 -07:00
Nathan Hjelm
01d6da31af btl/openib: fix rdmacm locking bug
This commit fixes a long standing bug in rdmacm. It is required that
the thread that calls mca_btl_openib_endpoint_cpc_complete holds the
endpoint lock. This was not the case for rdmacm. This causes debug
builds to abort. This change also required changing
mca_btl_openib_endpoint_send_cts to require the endpoint lock to be
held when calling.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-30 15:50:07 -06:00
Nathan Hjelm
cc2b3e0c3f Merge pull request #1830 from hjelmn/rdmacm_test
Test for rdmacm hang fix
2016-06-30 10:41:46 -06:00
Nathan Hjelm
960fcd292c btl/openib: fix rdma hang
This commit is an attempt to fix a hang in finalize of rdmacm. This fixes
a path where no rdmacm client is found for an endpoint.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-29 20:31:26 -06:00
Ralph Castain
6e434d6785 Add support for PMIx tool connections and queries. Initially only support a request to list all known namespaces (jobids) from ORTE, but other folks will extend that support to include additional information
Update to match PMIx RFC

Fix configury to point to correct libevent and hwloc locations
2016-06-29 19:19:19 -07:00
Jeff Squyres
f18d6606da Merge pull request #1824 from hjelmn/rdmacm_fix
btl/openib: fix segmentation fault
2016-06-28 18:10:35 -04:00
Nathan Hjelm
8128c8eb29 btl/openib: fix segmentation fault
This commit fixes a segmentation fault that occurs if a device can be
initialized but not used. In this case the devices_count is not equal
to the number of usable devices in the devices pointer array.

Thanks to @artpol84 for tracking this down.

Fixes open-mpi/ompi#1823

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-28 10:31:32 -06:00
Nathan Hjelm
955269b4f1 Merge pull request #1816 from hjelmn/request_perfm_regression
opal/sync: fix race condition
2016-06-28 09:12:00 -06:00
Artem Polyakov
541715572f Fix MPI_Waitany and MPI_Waitsome
(request handling related)
2016-06-28 16:40:00 +03:00
Artem Polyakov
8d011ea403 Fix Mellanox copyright. 2016-06-26 21:01:19 -06:00
Nathan Hjelm
fb455f0802 opal/sync: fix race condition
This commit fixes a race condition discovered by @artpol84. The race
happens when a signalling thread decrements the sync count to 0 then
goes to sleep. If the waiting thread runs and detects the count == 0
before going to sleep on the condition variable it will destroy the
condition variable while the signalling thread is potentially still
processing the completion. The fix is to add a non-atomic member to
the sync structure that indicates another process is handling
completion. Since the member will only be set to false by the
initiating thread and the completing thread the variable does not need
to be protected. When destoying a condition variable the waiting
thread needs to wait until the singalling thread is finished.

Thanks to @artpol84 for tracking this down.

Fixes #1813

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-26 20:14:01 -06:00
Nathan Hjelm
dac9201f3b Merge pull request #1770 from hjelmn/rdma_wth
btl/openib: fix rdmacm
2016-06-24 22:46:53 -06:00
Nathan Hjelm
2992d6d238 Merge pull request #1808 from abjoshi-brcm/timer_arm64
arm64: add timer support
2016-06-23 07:10:56 -06:00
Abhishek Joshi
f06f7eb3e6 arm64: add timer support
Signed-off-by: Sreenidhi Bharathkar Ramesh <sreenidhi-bharathkar.ramesh@broadcom.com>
2016-06-23 11:01:00 +00:00
Ralph Castain
08b1438f15 Add missing PMIx range value so OPAL and PMIx align again 2016-06-22 22:03:25 -07:00
Nathan Hjelm
55d1933a89 opal/sync: fix warnings
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-22 15:03:21 -06:00
Nathan Hjelm
e4f920f6f9 opal/progress: improve performance when there are no LP callbacks
This commit adds another check to the low-priority callback
conditional that short-circuits the atomic-add if there are no
low-priority callbacks. This should improve performance in the common
case.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-22 09:52:37 -06:00
Nathan Hjelm
143a93f379 opal/sync: remove usage of OPAL_ENABLE_MULTI_THREADS
The OPAL_ENABLE_MULTI_THREADS macro is always defined as 1. This was
causing us to always use the multi-thread path for synchronization
objects. The code has been updated to use the opal_using_threads()
function. When MPI_THREAD_MULTIPLE support is disabled at build time
(2.x only) this function is a macro evaluating to false so the
compiler will optimize out the MT-path in this case. The
OPAL_ATOMIC_ADD_32 macro has been removed and replaced by the existing
OPAL_THREAD_ADD32 macro.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-06-22 09:52:37 -06:00
Gilles Gouaillardet
bf133c401e pmix2x: fix a typo in dereg_event_hdlr()
This bug has been fixed when open-mpi/ompi@dde69e1be2 was backported into upstream pmix in pmix/master@5e5577778c
but it was not fixed in open-mpi/ompi
2016-06-22 13:45:29 +09:00
Jeff Squyres
af614afedf Merge pull request #1800 from thananon/common_sym_fix
Fixed common symbol error in btl/usnic.
2016-06-21 20:11:52 -04:00
Ralph Castain
441739b5a4 Cleanup a lagging message that generates an annoying (but seemingly harmless) warning 2016-06-20 12:23:27 -07:00
Thananon Patinyasakdikul
afe07cd5d5 Fixed common symbol in btl/usnic
- This commit fixes the accidental common symbol btl_usnic_lock
- It also moves the btl_usnic_lock declaration to btl_usnic.h
2016-06-20 10:05:44 -07:00
Howard Pritchard
1bed9fdb59 Merge pull request #1799 from hppritcha/topic/help_aries_with_knl
common/ugni: help out knl with aries
2016-06-20 08:09:24 -06:00
Ralph Castain
0ba02821e6 Add requested key and job-level info 2016-06-19 18:22:31 -07:00
Ralph Castain
0a29f5cb77 Sigh - missed two typos 2016-06-18 20:57:53 -07:00
Ralph Castain
dd38cf1fed Fix typo 2016-06-18 20:56:43 -07:00
Howard Pritchard
8b53487977 common/ugni: help out knl with aries
The way the gni btl is currently coded,
it will run completely out of gas on KNL at
123 processes/node.  Since there are bound to be
those who try to run a MPI process/hyperthread
on KNL nodes, the fma sharing mode needs to be requested.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-06-18 15:09:05 -05:00
Ralph Castain
dde69e1be2 Cleanup CIDs 1362763, 1362762, 1362760, 1362759, 1362758, 1362757, 1362756, 1362755, 1362754. Unsure how to resolve 1362761.
Fixes #1792
2016-06-18 12:28:46 -07:00
Jeff Squyres
7a8d7fb948 openib: fix compiler warnings
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-06-18 07:15:11 -07:00
Jeff Squyres
c332ee5884 Merge pull request #1784 from thananon/fix_usnic_thread
Fix btl/usnic deadlock when the connectivity check is turned off.
2016-06-17 11:15:14 -04:00
Nathan Hjelm
f59c2fce6b Merge pull request #1786 from hjelmn/32_fix
opal/progress: use 32-bit atomics for call counter
2016-06-17 08:54:41 -06:00
Nathan Hjelm
2e4141f20a Merge pull request #1787 from hjelmn/asm_fix
opal/asm: fix syntax of timer code for ia32
2016-06-17 08:50:57 -06:00