Ralph Castain
840d6c9a1d
Merge pull request #3284 from rhc54/topic/hotel
...
Resolve the direct modex race condition.
2017-04-05 03:09:27 -07:00
Gilles Gouaillardet
8d1369db49
Merge pull request #3283 from ggouaillardet/topic/nvml
...
build nvml support only with CUDA support
2017-04-05 16:14:23 +09:00
Ralph Castain
b7e9711f45
Resolve the direct modex race condition. The request hotel was running out of rooms, thereby returning an error upon checkin - and we had missed error_logging a couple of those places. Hence no error message and things just hung.
...
Output a (hopefully) helpful message when we timeout an operation
Thanks to Nathan for tracking it down.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-04 21:32:44 -07:00
Ralph Castain
9a69b20d09
Merge pull request #3282 from rhc54/topic/direct
...
Set the PARENT vpid for direct routed module
2017-04-04 20:55:12 -07:00
Gilles Gouaillardet
10ea991d0a
hwloc: add CUDA include dir to CPPFLAGS
...
so hwloc configury can find nvml.h when CUDA support is built
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-04-05 11:46:22 +09:00
Gilles Gouaillardet
8d7541f766
hwloc: disable nvml is CUDA support is not built in Open MPI
...
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-04-05 11:07:34 +09:00
Ralph Castain
9132bb26fe
Merge pull request #3281 from rhc54/topic/dmx
...
Adjust the timeout for direct modex requests to reflect the size of t…
2017-04-04 19:04:33 -07:00
Ralph Castain
40ca43e157
Set the PARENT vpid for direct routed module
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-04 19:03:28 -07:00
Ralph Castain
734b90aa6b
Adjust the timeout for direct modex requests to reflect the size of the job. It can take several seconds to start all the procs, and we don't want to timeout due to differences in start times of the various procs
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-04 18:20:51 -07:00
Ralph Castain
9cb18b8348
Merge pull request #3280 from rhc54/topic/dvm
...
Fix the DVM by ensuring that all nodes, even those that didn't partic…
2017-04-04 18:15:33 -07:00
Ralph Castain
74863a0ea4
Fix the DVM by ensuring that all nodes, even those that didn't participate (i.e., didn't have any local children) in a job, clean up all resources associated with that job upon its completion. With the advent of backend distributed mapping, nodes that weren't part of the job would still allocate resources on other nodes - and then start from that point when mapping the next job. This change ensures that all daemons start from the same point each time.
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-04 17:31:38 -07:00
Aurelien Bouteiller
308d33dca7
Merge pull request #3266 from abouteiller/bugfix/f90mpiext
...
Fix the top_ompi_srcdir in Fortran mpiext building system
2017-04-04 16:25:30 -04:00
Nathaniel Graham
7063f3021f
Merge pull request #3231 from nrgraham23/revamp_mpirun_help
...
mpirun --help output revamp
2017-04-04 12:32:20 -06:00
Nathaniel Graham
19e5d15491
mpirun --help output revamp
...
This commit modifies the output from the mpirun --help
command. The options have been split into groups, to
make the output smaller and more readable. The groups
are: general, debug, output, input, mapping, ranking,
binding, devel, compatibility, launch, dvm, and
unsupported. There is also a special "full" command
that can be used to get the old behaviour of printing
out all of the options. Unsupported options may only
be seen with this full output.
This commit also adds a special case for the help
argument. It makes it possible for the user to
enter 0 or 1 arguments instead of having to always
enter an argument. This defaults to printing out
the "general" help options so the user can then
see what help arguments there are.
Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
2017-04-04 10:59:32 -06:00
Ralph Castain
a605bd4265
Merge pull request #3278 from rhc54/topic/tm
...
Remove stale code line
2017-04-04 09:04:52 -07:00
Ralph Castain
393c4536eb
Remove stale code line
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-04 08:13:15 -07:00
Nathan Hjelm
1322e5dee8
Merge pull request #3274 from hjelmn/osc_rdma_fix
...
osc/rdma: fix typo in atomic code
2017-04-04 00:20:42 -06:00
Gilles Gouaillardet
5dfd4ab6ca
coll/tuned: remove set-but-not-used variables
...
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-04-04 13:18:11 +09:00
Ralph Castain
fe64144892
Merge pull request #3259 from rhc54/topic/launch
...
Update how we pass the node regex so we pass _all_ nodes, even those without daemons.
2017-04-03 20:40:48 -07:00
Ralph Castain
92c996487c
Update how we pass the node regex so we pass _all_ nodes, even those without daemons. This allows the backend daemons to form a complete picture of the allocation. Include info on which nodes have daemons on them, and populate that info on the backend as well.
...
Set the daemons' state to "running" and mark them as "alive" by default when constructing the nidmap
Get the DVM running again
Fix direct modex by eliminating race condition caused by releasing data while sending it
Up the size limit before compressing
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-03 19:25:15 -07:00
Nathan Hjelm
fad0803920
osc/rdma: fix typo in atomic code
...
Fixes #3267
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-04-03 15:54:28 -06:00
Ralph Castain
9850832dbd
Merge pull request #3273 from rhc54/topic/pmix2.0
...
Update to PMIx v2.0alpha
2017-04-03 11:07:08 -07:00
Aurélien Bouteiller
6ef6a3fb18
Fix the Fortran mpiext building system
...
Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>
2017-04-03 13:46:32 -04:00
Ralph Castain
2cc5fea8be
Update to PMIx v2.0alpha
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-03 10:02:29 -07:00
Yossi
b3736701c4
Merge pull request #3239 from xinzhao3/topic/ucx-num-eps
...
Passing estimated_num_procs to UCX init in PML and SPML.
2017-04-03 18:31:27 +03:00
Ralph Castain
5f6ba81f11
Merge pull request #3263 from ggouaillardet/topic/hwloc1116
...
hwloc: update hwloc to 1.11.6
2017-03-31 08:03:04 -07:00
Gilles Gouaillardet
81062b7cd2
hwloc: update hwloc to 1.11.6
...
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-03-31 13:35:16 +09:00
Jeff Squyres
7e57075f0d
Merge pull request #3248 from jsquyres/pr/remove-macosx-pkg-support
...
dist: remove OS X package script
2017-03-29 18:46:14 -04:00
Jeff Squyres
f0a8a0af51
dist: remove OS X package script
...
We stopped supporting this long ago.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-03-29 10:13:01 -04:00
Jeff Squyres
f980cba6b3
Merge pull request #3249 from rhc54/topic/abort
...
Use the correct callback data - the callback function was expecting a…
2017-03-28 20:54:39 -04:00
Jeff Squyres
c7c13253ea
Merge pull request #3250 from jsquyres/pr/modulefile-in-srpm
...
openmpi.spec: also put the modulefile in /opt if install_in_opt==1
2017-03-28 20:46:26 -04:00
Kevin Buckley
9e23c5e3f6
openmpi.spec: also put the modulefile in /opt if install_in_opt==1
...
Thanks to Kevin Buckley for noticing the issue and supplying the
patch.
[skip ci]
bot:notest
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-03-28 20:45:09 -04:00
Ralph Castain
7dd34d0c9a
Use the correct callback data - the callback function was expecting a bool*, not a pmix_ptl_sr_t*.
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-28 17:21:47 -07:00
Xin Zhao
ee952fcccd
Passing estimated_num_procs to UCX init in PML and SPML.
...
Signed-off-by: Xin Zhao <xinz@mellanox.com>
2017-03-27 20:36:52 +03:00
Ralph Castain
d782542c5c
Merge pull request #3241 from rhc54/topic/cov
...
Silence coverity dead-code warning
2017-03-27 08:33:33 -07:00
Ralph Castain
71c9bc1f7e
Merge pull request #3242 from jsquyres/pr/orterun-run-as-root-minor-tweaks
...
orte: minor tweaks to run-as-root message
2017-03-27 08:29:43 -07:00
Jeff Squyres
a333cf691a
orte: minor tweaks to run-as-root message
...
Two updates:
1. Remove the "run as root" error message from orterun.c, because that
functionality is now in orted_submit.c (although it is still
required in orte-dvm.c -- so sync the message in orted_submit.c and
orte-dvm.c to be identical).
2. Slightly tweak the text of the "run as root" error message to
explicitly state that we (strongly) suggest running as a non-root
user (and add a little whitespace).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-03-27 04:50:21 -07:00
Ralph Castain
583dbe954c
Silence coverity dead-code warning
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-26 20:36:43 -07:00
Ralph Castain
b398d721d5
Merge pull request #3236 from rhc54/topic/craycleanups
...
Silence a flood of warnings when compiling with gcc on Cray
2017-03-24 13:33:46 -07:00
Ralph Castain
6ef079cdab
Merge pull request #3235 from rhc54/topic/tcp
...
Stop segfault in BTL/TCP
2017-03-24 12:38:31 -07:00
Ralph Castain
ecc8000136
Silence a flood of warnings when compiling with gcc on Cray
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-24 13:37:11 -06:00
Ralph Castain
470452cba0
Correctly check the sa_family and cast the data correctly before passing it to inet_nop, and don't be quite as fancy with the pointer arithmetic as the combination was causing us to segfault every time this debug message was called.
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-24 11:42:57 -07:00
Jeff Squyres
88a4a163ae
Merge pull request #3229 from hjelmn/osc_pt2pt
...
osc/pt2pt: fix typo
2017-03-24 14:38:16 -04:00
Ralph Castain
0694d0bfbe
Merge pull request #3234 from rhc54/topic/cov
...
Fix coverity issues
2017-03-24 08:56:45 -07:00
Ralph Castain
35f817911e
Fix coverity issues
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-24 08:09:46 -07:00
Ralph Castain
c0bcd11bcf
Fix permissions - no CI required
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-23 08:05:52 -07:00
Nathan Hjelm
c72fb30eb5
osc/pt2pt: fix typo
...
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2017-03-23 09:00:21 -06:00
Ralph Castain
b6d1e24ec9
Merge pull request #3227 from rhc54/topic/abort
...
If we lose connection to the server after initiating a send/recv in P…
2017-03-23 07:55:21 -07:00
Ralph Castain
55e4fba5f5
If we lose connection to the server after initiating a send/recv in PMIx (e.g., in PMIx_Abort), then we need to "resolve" all pending recvs to avoid hanging.
...
Fixes #3225
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-23 02:53:21 -07:00
Ralph Castain
ea84a53faa
Merge pull request #3218 from rhc54/topic/pmix2
...
Update to include the PMIx 2.0 APIs for monitoring and job control.
2017-03-21 20:11:10 -07:00