Ralph Castain
fcb7a2f29b
Minor cleanups for when using external pmix
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-24 09:53:04 -07:00
Ralph Castain
fe9b584c05
Fully support OMPI spawn options. Fix a bug in the round-robin mappers where we weren't adding nodes to the job map node array, and so resources were not released
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 285d8cfef74ffc899e9c51e1d9c597b7fb2ceb89)
2017-09-21 10:29:27 -07:00
Ralph Castain
e575c4d6f9
Fix tool connection logic so we properly search for default session server, perform specified number of retries, etc.
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 7c755e01004f8b86c71f1729662979ea45ab1adb)
2017-09-19 13:35:46 -07:00
Ralph Castain
16de607607
Merge pull request #4234 from rhc54/topic/upstream
...
Ensure we update the total_slots_alloc field on each job. Correct the client example
2017-09-19 09:03:04 -07:00
Ralph Castain
658c3d1d51
Ensure we update the total_slots_alloc field on each job. Correct the client example
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit bcedd12a8a24dd246f04ff13b4fd2f1bbac6ce5a)
2017-09-19 08:14:14 -07:00
Jeff Squyres
7cccee9d92
rmaps/base: remove debugging "DONE" message
...
Thanks for Ben Menadue for reporting and supplying the patch.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-09-19 07:10:00 -07:00
Ralph Castain
5708872112
Implement support for "local" range when publishing data
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 2d54f7e0dd3a47260b0b2634aae3361316005933)
2017-09-18 19:34:08 -07:00
Ralph Castain
08c93091f7
Merge pull request #4223 from rhc54/topic/stale
...
Remove stale tools
2017-09-18 09:43:06 -07:00
Josh Hursey
252be7ffb0
Merge pull request #4215 from jjhursey/fix/plm-lsf-rc
...
plm/lsf: Improve error message if lsb_launch fails
2017-09-18 11:14:25 -05:00
Ralph Castain
ed508010b4
Remove stale tools
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-18 07:30:47 -07:00
Ralph Castain
3c914a7a97
Complete the fix of the ORTE DVM. We will now use "prun" instead of "orterun -hnp foo" to execute jobs. This provides the feature of automatic discovery of the orte-dvm so you don't need to manually enter URI's or contact file locations. All IO is forwarded to prun.
...
Still in the "needs to be done" category:
* mapping/ranking/binding options aren't correctly supported
* if the DVM encounters some errors (e.g., not enough resources for the job), the resulting error is globally set and impacts any subsequent job submission
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-16 13:13:07 -07:00
Joshua Hursey
89c1aaf646
plm/lsf: Improve error message if lsb_launch fails
...
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-09-15 09:45:58 -05:00
Ralph Castain
7c7d8a69a0
Backport changes from PMIx reference server
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-14 11:48:56 -07:00
Ralph Castain
8d336ddcc0
Merge pull request #4209 from rhc54/topic/foobar
...
Only build prun if building --with-devel-headers
2017-09-13 13:07:29 -07:00
Ralph Castain
3f8908871b
Since the DVM is now tied to prun, don't build the DVM either unless prun can be built
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-13 11:55:10 -07:00
Ralph Castain
589cc03d8e
Only build prun if building --with-devel-headers
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-13 11:38:11 -07:00
Ralph Castain
0a3d8af4c2
Merge pull request #4202 from anandhis/master
...
Choosing provider when user requests generic transport "fabric"
2017-09-13 11:21:24 -07:00
Ralph Castain
bbd83fd4c0
Add a new launcher "prun" for starting applications against the ORTE DVM.
...
Unlike "orterun", "prun" is a PMIx-only program that discovers the DVM connection instead of requiring that we explicitly provide it. Only build "prun" if PMIx v2.x is available.
This gets the DVM working again, but still is showing problems for multiple executions. I'll detail those in a separate issue. Thus, the DVM should still be considered "broken".
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-12 21:40:41 -07:00
anandhi
4d7de8882f
Checking for generic transport "fabric" in mca parameter rml_ofi_transports
...
to choose the first available non-socket provider.
modified: orte/mca/rml/ofi/rml_ofi_component.c
modified: orte/mca/rml/ofi/rml_ofi_send.c
Signed-off-by: Anandhi Jayakumar <anandhi.s.jayakumar@intel.com>
2017-09-12 15:39:55 -07:00
Ralph Castain
3477079804
Repair the ORTE DVM
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-11 17:38:21 -07:00
Joshua Hursey
420ca65f4f
orte/pmix: Always seed environment with global rank
...
* Even if we are only launching one app context, we might call spawn
later and the remote groups might want their global rank information.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-09-08 08:53:49 -05:00
Howard Pritchard
5db9416724
rml/ofi: swat a compiler warning
...
On the path to -Werror passing builds!
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2017-08-30 09:16:49 -06:00
Josh Hursey
ad87aa2674
Merge pull request #4121 from jjhursey/explore/dlopen-local
...
mca: Dynamic components link against project lib
2017-08-25 13:15:51 -05:00
Joshua Hursey
e1d079544b
mca: Dynamic components link against project lib
...
* Resolves #3705
* Components should link against the project level library to better
support `dlopen` with `RTLD_LOCAL`.
* Extend the `mca_FRAMEWORK_COMPONENT_la_LIBADD` in the `Makefile.am`
with the appropriate project level library:
```
MCA components in ompi/
$(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la
MCA components in orte/
$(top_builddir)/orte/lib@ORTE_LIB_PREFIX@open-rte.la
MCA components in opal/
$(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la
MCA components in oshmem/
$(top_builddir)/oshmem/liboshmem.la"
```
Note: The changes in this commit were automated by the script in
the commit that proceeds it with the `libadd_mca_comp_update.py`
script. Some components were not included in this change because
they are statically built only.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-08-24 11:56:16 -04:00
Ralph Castain
68029b27e4
Fix the orte-dvm operations so that orterun can connect and execute an application. There is a lingering problem, though. The first invocation of orterun succeeds every time. However, subsequent invocations have a high probability of hanging in the OOB connection handshake.
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-23 17:31:08 -07:00
Ralph Castain
d80b0c7990
If the HWLOC shared memory system is unable to connect, then fallback to providing the topology via XML. Do not automatically provide the XML to every process as that defeats the purpose of the shared memory system. Instead, use PMIx_Query_info_nb to get the info from the server when required.
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-22 18:12:26 -07:00
Brice Goglin
046d870124
rtc/hwloc/shmem: add Inria copyrights
...
The code for finding the hole for the shmem region actually came from me.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2017-08-21 23:09:57 +02:00
Brice Goglin
baf762d99d
rtc/hwloc/shmem: dump /proc/self/maps if failed to find a hole and verbosity > 4
...
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2017-08-21 19:57:38 +02:00
Brice Goglin
8f6afbb641
rtc/hwloc/shmem: fix "heap" hole search kind
...
There can be multiple [heap] consecutively in proc/<pid>/maps,
and there's no room between them.
Don't use a hole after the first [heap] is there's another [heap]
immediately after it.
This code would fail to find the last [heap] if there were multiple
[heap] interleaved with non-heap VMA, but our kind "after heap"
wouldn't be meaningful anymore anyway.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2017-08-21 15:42:38 +02:00
Brice Goglin
b8b46b253b
rtc/hwloc/shmem: fix "libs" hole search kind
...
We want the biggest hole *between* heap and stack, not outside.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2017-08-21 15:40:36 +02:00
Ralph Castain
d515f48885
The local PMIx server is notifying its clients of all events, but for some reason I don't recall, the broadcast notification was marked for delivery only to non-default event handlers. This creates a discrepancy between the two behaviors, so don't restrict the broadcast notifications.
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-18 17:26:11 -07:00
Ralph Castain
b67b1e88a5
Merge pull request #4111 from rhc54/topic/multiconnect
...
Cleanup some issues in connect/accept support across jobs started by …
2017-08-17 12:49:01 -07:00
Ralph Castain
d85239e052
Cleanup some issues in connect/accept support across jobs started by different mpirun commands. Still not fully operational, but someone else will have to finish debugging it
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-17 11:58:48 -07:00
Ralph Castain
088b6cdeee
Silence coverity warnings
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-17 09:49:35 -07:00
Ralph Castain
41df973359
Add diagnostics for hwloc get_topology
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-16 14:21:27 -07:00
Ralph Castain
65fb6070d9
Update tool support by adding MCA params to direct orted's to drop
...
session and/or system-level tool rendezous files. Ensure PMIx is
enabled for tools
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-15 17:49:47 -07:00
Ralph Castain
2fbce9d93c
Fix hostfile filtering in allocated environments to preserve slot assignments
...
Refs #3984
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-15 14:41:12 -07:00
Artem Polyakov
10d6e90bf5
Revert "plm/rsh: Propagate PMIx prefix to orted's"
...
This reverts commit 71da0fcbef
.
(per https://github.com/open-mpi/ompi/pull/4052 ).
Refs: https://github.com/open-mpi/ompi/issues/3980
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2017-08-14 21:37:57 +07:00
Ralph Castain
edccb258cb
Provide the mapping, ranking, binding patterns
...
Apps might want to make use of the relative patterns used to place/assign their procs
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-09 11:34:43 -07:00
Nathan Hjelm
76320a8ba5
opal: rename opal_atomic_init to opal_atomic_lock_init
...
This function is used to initalize and opal atomic lock. The old name
was confusing.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-08-07 14:15:11 -06:00
Ralph Castain
9921237f99
Merge pull request #4012 from rhc54/topic/p3
...
Cover the use-cases for OPAL_PREFIX and PMIX_INSTALL_PREFIX options
2017-08-07 11:42:53 -07:00
Ralph Castain
d1b7c3d8d5
Silence some compile-time warnings. Update scripts now that AUTHORS is gone
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-04 20:08:31 -07:00
Ralph Castain
a239b4c3c3
Per discussion on the PMIx side, do a better job of detecting mismatches between location directives for OPAL and PMIx. Provide a more helpful error message and error out if we find a mismatch. If any OPAL values are set and the PMIx equivalent is not, then transfer it.
...
Do not clear PMIX_INSTALL_PREFIX from the daemon's launch environment
Fixes #3980
Closes #4007
Refs #3985
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-04 19:36:00 -07:00
Ralph Castain
88a7c9dca3
Merge pull request #4013 from rhc54/topic/hwloc
...
Silence warning on Mac - we know Mac doesn't support hwloc, and so it…
2017-08-03 15:52:44 -06:00
Howard Pritchard
897c62756b
Merge pull request #3999 from hppritcha/topic/slurmd_controls_them_all
...
SLURM: launch all processes via slurmd
2017-08-03 15:33:44 -06:00
Gilles Gouaillardet
6b6e65a5bc
rtc/hwloc: fix MCA parameter handling
...
always re-initialize vmhole *before* mca_base_component_var_register()
otherwise the vmhole gets NULL'ified if orte is initialized a second time.
that typically occurs when Open MPI is configure'd with --disable-dlopen
and the app does MPI_T_init_thread(); MPI_T_finalize(); MPI_T_init_thread();
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-08-03 14:45:43 +09:00
Ralph Castain
9f926b8083
Silence warning on Mac - we know Mac doesn't support hwloc, and so it doesn't matter if a VM hole isn't found. It also doesn't matter in general as all it really means is that we have to turn the hwloc shmem support "off".
...
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-02 20:20:45 -06:00
Howard Pritchard
d08be74573
SLURM: launch all processes via slurmd
...
It turns out that the approach of having the HNP do the
fork/exec of MPI ranks on the head node in a SLURM environment
introduces problems when users/sysadmins want to use the SLURM
scancl tool or sbatch --signal option to signal a job.
This commit disables use of the HNP fork/exec procedure when
a job is launched into a SLURM controlled allocation.
update NEWS with a blurb about new ras framework mca parameter.
related to #3998
Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
2017-08-02 14:56:55 -06:00
Artem Polyakov
71da0fcbef
plm/rsh: Propagate PMIx prefix to orted's
...
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2017-08-02 08:06:13 +03:00
Ralph Castain
f39ce67982
Merge pull request #3951 from rhc54/topic/hwloc2
...
Update to hwloc 2.0.0a
2017-08-01 15:18:31 -06:00