Jeff Squyres
95c6f6cfc0
btl/tcp: fix help message
...
It looks like one help message was accidentally pasted in the middle
of another. Disentangle the two messages from each other, and
slightly tweak the one message to say that the job may also crash (in
addition to hanging).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-09-02 17:14:22 -04:00
Nathan Hjelm
f93c1f2106
btl/ugni: fix erroneous warning message
...
This commit prevents the connection code from trying to connect an
endpoint if the directed datagram has been posted but not received.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-09-02 09:17:44 -06:00
Ralph Castain
34f04a7924
Remove spurious Makefile.am line
2016-09-01 15:31:09 -07:00
Ralph Castain
0ea1cff733
Implement notification of completion on comm_spawn'd child jobs. Add a configure flag to enable PMIx 3's shared memory datastore, and set it disable by default so that comm_spawn functions again. Will reverse the default once that feature is fully functional
2016-09-01 13:10:10 -07:00
rhc54
39d086e000
Merge pull request #2035 from rhc54/topic/memprofile
...
Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint
2016-08-31 14:06:48 -05:00
Ralph Castain
39992d1ad7
Silence trivial Coverity warnings
2016-08-31 09:42:33 -07:00
Ralph Castain
c1050bc01e
Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint. Setting OMPI_MEMPROFILE=N causes mpirun to set a timer for N seconds. When the timer fires, mpirun will query each daemon in the job to report its own memory usage plus the average memory usage of its child processes. The Proportional Set Size (PSS) is used for this purpose.
2016-08-31 09:32:07 -07:00
Ralph Castain
cfa784c9a6
Since we changed storage to pointers in pmix_value_t, we need to allocate space for those values when unpacking
2016-08-29 20:22:24 -07:00
Nathan Hjelm
d33204b0dc
Merge pull request #2021 from hjelmn/xlc_fix
...
opal/patcher: fix xlc support
2016-08-26 18:15:41 -06:00
rhc54
b90a64e734
Merge pull request #2022 from rhc54/topic/nnodes
...
Provide the number of nodes in the job
2016-08-26 18:15:24 -05:00
Ralph Castain
2f6e0fec90
Provide the number of nodes in the job
2016-08-26 14:50:41 -07:00
Jeff Squyres
09ad7e81eb
Merge pull request #2007 from jsquyres/pr/usnic-show-local-udp-ports
...
usnic: show the local UDP ports
2016-08-26 17:03:16 -04:00
Nathan Hjelm
a9bc692d99
opal/patcher: fix xlc support
...
The xlc compiler seems to behave in a different way that gcc when it
comes the inline asm. There were two problems with the code with xlc:
- The TOC read in mca_patcher_base_patch_hook used the syntax
register unsigned long toc asm("r2") to read $r2 (the TOC
pointer). With gcc this seems to behave as expected but with xlc
the result in toc is not the same as $r2. I updated the code to use
asm volatile ("std 2, %0" : "=m" (toc)) to load the TOC pointer.
- The OPAL_PATCHER_BEGIN macro is meant to be the first thing in a
hook. On PPC64 it loads the correct TOC pointer (thanks to
mca_patcher_base_patch_hook) and saves the old one. The
OPAL_PATCHER_END macro restores the TOC pointer. Because we *need*
the TOC to be correct before it is accessed in the hook the
OPAL_PATCHER_BEGIN macro MUST come first. We did this and all was
well with gcc. With xlc on the other hand there was a TOC access
before the assembly inserted by OPAL_PATCHER_BEGIN. To fix this
quickly I broke each hook into a pair of function with the
OPAL_PATCHER_* macros on the top level functions. This works around
the issue but is not a clean way to fix this. In the future we
should 1) either update overwrite to not need this, or 2) figure
out why xlc is not inserting the asm before the first TOC read.
This fixes open-mpi/ompi#1854
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-26 14:43:03 -06:00
Jeff Squyres
87a5ccc060
usnic: show the local UDP ports
...
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-26 12:25:18 -07:00
Jeff Squyres
e03a40a0e9
pmix3x: remove generated file
...
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-26 10:30:47 -07:00
Jeff Squyres
9ae51a09f2
Merge pull request #1989 from jsquyres/pr/update-usnic-to-libfabric-v1.4
...
Update usnic BTL to libfabric v1.4
2016-08-26 09:53:07 -04:00
Gilles Gouaillardet
e4bf915e75
pmix3x: remove auto-generated file
...
remove opal/mca/pmix/pmix3x/pmix/src/include/pmix_config.h.in
.gitignore is correct, so it seems this file was added before .gitignore was updated
2016-08-26 15:00:18 +09:00
Ralph Castain
af67f16422
Update configury to support multiple PMIx versions, rename pmix2x component to pmix3x for support of PMIx master
...
Update support for external v1.1.x and v2.x libraries. Minor corrections to the v3.x component
2016-08-25 18:19:05 -07:00
Jeff Squyres
f56b16f079
usnic: remove unused variable
...
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-25 03:53:18 -07:00
Jeff Squyres
9717bcb7e6
btl/usnic: remove stale comment
...
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-25 03:53:18 -07:00
Jeff Squyres
6f5e377fe0
btl/usnic: update for libfabric v1.4
...
With libfabric v1.4, the usnic provider changed the values of its
fabric and domain name strings (compared to libfabric <v1.4). Update
the Open MPI usNIC BTL to handle both pre-v1.4 and v1.4 fabric/domain
names.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-25 03:53:17 -07:00
George Bosilca
3adff9d323
Fixes #1793 .
...
Reshape the tearing down process (connection close) to prevent race
conditions between the main thread and the progress thread.
Minor cleanups.
2016-08-24 22:45:19 -04:00
Nathan Hjelm
83062db7cb
btl/ugni: actually make the endpoint lock recursive
...
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-24 10:36:08 -06:00
Gilles Gouaillardet
02847d9e7b
pmix2x: dstore: add missing <fcntl.h> include file in pmix_esh.c
...
(back-ported from upstream pmix/master@5c66ffe0f0 )
2016-08-24 11:18:46 +09:00
Gilles Gouaillardet
c11e8163f8
pmix2x: sec/native: fix the pmix_native module under solaris by using getpeerucred()
...
and fail with a user friendly message if no method is available:
"sec: native cannot validate_cred on this system"
(back-ported from upstream pmix/master@c474a1fc60 )
2016-08-24 11:18:40 +09:00
Gilles Gouaillardet
e91292aa41
pmix2x: configury: add missing check for <netdb.h> header file
...
(back-ported from upstream pmix/master@e54ce6d423 )
2016-08-24 11:18:32 +09:00
Potnuri Bharat Teja
9b7f9ece20
Add Chelsio T6 adapter device parameters.
...
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
2016-08-23 10:38:13 +05:30
Ralph Castain
639dbdb7ea
For maintainability, fold the external PMIx 2.x integration into the internal PMIx 2.x library component. This ensures that we always stay in sync with the two as that is becoming a problem.
2016-08-22 13:28:55 -07:00
George Bosilca
fd57f5bccd
Remove some of the clang warnings.
2016-08-20 14:21:42 -04:00
Ralph Castain
61ffba668b
Roll in the latest PMIx version - includes shared memory datastore and reduced memory footprint
2016-08-20 07:53:06 -07:00
Artem Polyakov
6ea8cccdab
Merge pull request #1969 from artpol84/pmix_jobid_fix
...
Pmix jobid fix
2016-08-18 17:24:58 +07:00
Ralph Castain
7da9793fef
Support the PMIX_TIMEOUT key at the PMIx server when timeout=0 - this indicates that the user doesn't want a lookup of any data from the host RM.
2016-08-17 16:26:58 -05:00
Gilles Gouaillardet
3126ff77e2
pmix2x: common syms: whitelist bison-generated common symbols
...
Bison generates some common symbols that we can't do anything about,
so whitelist them.
2016-08-16 11:29:06 +09:00
Artem Polyakov
c5a91c5c9d
opal/pmix: fix pmix jobid calculation if external PMIx server is used.
2016-08-15 21:13:51 +03:00
Ralph Castain
ecbedee8bb
Fix typo
2016-08-15 07:32:00 -07:00
Artem Polyakov
f3c816b52e
opal/pmix: fix indentation in some files.
2016-08-15 18:21:50 +07:00
Gilles Gouaillardet
483685eb6a
update .gitignore
...
remove autogenerated opal/mca/pmix/pmix2x/pmix/src/include/pmix_config.h.in
2016-08-15 17:00:20 +09:00
Ralph Castain
be8424b691
Provide backward compatible keys so that the non-PMIx components in the opal/pmix framework don't have to adjust as we continue to work on finalizing the PMIx reference scheme. Activate and utilize the new PMIx show_help capability to provide more meaningful error output when the server cannot start.
...
Add a contrib script to cleanup permissions incorrectly modified due to things like smb mounts
dd
2016-08-13 12:13:04 -07:00
rhc54
ddde154d28
Merge pull request #1962 from rhc54/topic/notify
...
Ensure we properly convert pmix status to ORTE state before activatin…
2016-08-13 06:59:50 -07:00
Ralph Castain
48d35a9627
Ensure we properly convert pmix status to ORTE state before activating an error state upon notification. Cleanup some conversion issues on notification info. Add a new orte_notify.c test program
2016-08-12 21:14:29 -07:00
Ralph Castain
4a4c9703a9
Setup the job list in the PMIx integration so that static ports can run
2016-08-12 13:27:10 -07:00
Ralph Castain
1d44f0c0e2
Silence Coverity warnings
2016-08-11 21:22:01 -07:00
Ralph Castain
73544d2e00
Rename symbol
2016-08-11 13:06:46 -07:00
Ralph Castain
b0cc9b0bc8
Update to latest PMIx toolext branch
...
Fix indentations
Update the ext20 component to match latest PMIx master.
Cleanup name conflicts and uninit vars
2016-08-11 12:29:48 -07:00
rhc54
60f789dca1
Merge pull request #1948 from rhc54/topic/pmixtool
...
Update to include extended tool support, new datatypes
2016-08-09 16:17:28 -07:00
Nathan Hjelm
19be439998
Merge pull request #1949 from hjelmn/ugni_fix
...
btl/ugni: fix another connection race
2016-08-09 08:32:40 -06:00
Gilles Gouaillardet
6f6b3ac68a
configury: standardize memory/patcher symbol detection and make it more robust
...
by default, Sun compilers optimize out the original test, and hence fail detecting a symbol is missing.
2016-08-09 09:35:52 +09:00
Nathan Hjelm
adb668209b
btl/ugni: fix another connection race
...
This commit fixes a race that can occur when two threads are in the
ugni progress function at the same time. This race occurs when one
thread calls GNI_PostDataProbeById then goes to sleep then another
thread calls GNI_PostDataProbeById then GNI_EpPostDataWaitById before
the other thread wakes up. If this happens the first thread will print
a warning on GNI_EpPostDataWaitById about no matching post.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-08 15:38:11 -06:00
Ralph Castain
527b5c692a
Update to include extended tool support, new datatypes
2016-08-08 13:39:46 -07:00
Todd Kordenbrock
b90da992c8
Merge pull request #1895 from PDeveze/Patchs-on-btl-portals4
...
btl/portals4: Take into account the limitation of portals4 (max_msg_s…
2016-08-08 15:12:50 -05:00