1
1
Граф коммитов

5867 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
20d231defa odls_base_default_fns.c: put the free() in the right place
Fixes CID 1441826.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit f96c04244d)
2018-12-22 06:41:33 -08:00
Howard Pritchard
af7a7f5da1
Merge pull request #6216 from abouteiller/export4x/overspawn
v4.0.x: Correctly propagate the oversubscribe flag to the spawnees
2018-12-21 15:33:05 -07:00
Aurélien Bouteiller
d9b0dad828
Correctly propagate the oversubscribe flag to the spawnees
This is a cherry-pick of master (2820aef). The propagation is intended to resolve issue #6130

Signed-off-by: Aurélien Bouteiller <bouteill@icl.utk.edu>
2018-12-21 14:53:25 -05:00
Ralph Castain
2d9c936082 If job is fully described, there will be no ppn string to unpack
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit d728380741)
2018-12-18 08:27:39 -08:00
Ralph Castain
98c8492057 Fix typo for rmaps_base_oversubscribe
Causes the MCA param to be ignored, while the cmd line option still
works.

Thanks to @iassiour for the report!

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-11-29 07:40:51 -08:00
Howard Pritchard
24dae8609e
Merge pull request #5926 from hjelmn/v4.0.x_need_to_unblock_sigchld_in_some_cases
v4.0.x: Ensure SIGCHLD is unblocked
2018-11-19 13:13:10 -07:00
Jeff Squyres
8be14b9b07 orte-rmaps-base: slightly amend help message
Follow on to 430c659908: clarify the help message and fix one typo.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit e9bf318dcb)
2018-11-08 18:20:28 -05:00
Jeff Squyres
76d4c1843e orte-rmaps-base: update out-of-slots show_help message
Update the show_help message for when there are not enough slots to
run an application.

Also, remove a bunch of copies of this message in various show_help
text files that aren't used/referred to anywhere in the code.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 430c659908)
2018-11-08 16:03:28 -05:00
Ralph Castain
ba6ad9fe42 Remove stale defunct tools
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 05ac8fa71c)
2018-10-30 08:51:25 -07:00
Ralph Castain
712ddd326f Remove the stale orte-dvm code
Users should migrate to https://github.com/pmix/prrte

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 1bd772e8eb)
2018-10-30 07:54:35 -07:00
Ralph Castain
8aae7943ab Provide deprecation warning of MPIR debugger
If we detect that we are being debugged by an MPIR-based debugger, then
print a warning that OMPI's MPIR support has been deprecated and will be
removed in a subsequent release.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 2cb271716b)
2018-10-25 08:49:00 -07:00
Howard Pritchard
d2fb9949a5
Merge pull request #5785 from jsquyres/pr/v4.0.x/more-compiler-warnings-fixes
v4.0.x: Squash a bunch of harmless compiler warnings.
2018-10-16 16:30:21 -06:00
Ralph Castain
05e0545581 Ensure SIGCHLD is unblocked
Thanks to @hjelmn for debugging it and providing the patch

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit efa8bcc17078c89f1c9d6aabed35c90973a469bf)
(cherry picked from commit 647a760b7e)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-10-16 15:21:18 -06:00
Gilles Gouaillardet
2e2366d193 util/hostfile: fix a double free error
As reported at https://stackoverflow.com/questions/52707242/mpirun-segmentation-fault-whenever-i-use-a-hostfile
mpirun crashes when the hostfile contains a "user@host" line.
The root cause is username was not strdup'ed and free'd twice by opal_argv_free() and free()

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@5803385d44)
2018-10-09 13:06:55 +09:00
Jeff Squyres
37a9cf5c82 Squash a bunch of harmless compiler warnings.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 6bb356ab87)
2018-09-26 14:42:13 -07:00
Jeff Squyres
2e37f97a38 Miscellaneous compiler warning stomps.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit fe0852bcb4)
2018-09-21 14:35:51 -05:00
Gilles Gouaillardet
229ec82cf0 orte: send error messages to stderr.
When a job terminates normally but with a non zero exit code,
display the error message to stderr.

Thanks Emre Brookes for the bug report.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@893270caee)
2018-09-13 10:39:57 +09:00
Gilles Gouaillardet
959aeab5d9 ess/hnp: plug a memory leak in rte_finalize()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@d234caef74)
2018-09-10 09:20:16 +09:00
Geoff Paulsen
954692f06e
Merge pull request #5614 from karasevb/v4.0.x_fix_hwloc_numa_obj
v4.0.x: Fixed the NUMA obj detection for hwloc ver >= 2.0.0
2018-09-07 14:49:26 -05:00
Geoff Paulsen
325e8d26e8
Merge pull request #5608 from hppritcha/topic/pr5600_to_v4.0.x
Deal with special case during cleanup
2018-08-31 14:00:04 -05:00
Geoff Paulsen
b2daa0001f
Merge pull request #5565 from rhc54/cmr40/pmix301
Update to PMIx 3.0.1
2018-08-31 13:58:41 -05:00
Nathan Hjelm
4eeb41506c odls/alps: resolve hang when launching with mpirun on Crays
This commit removes some code that protected the odls/alps component
from closing alps file descriptors. For some unknown reason leaving
these file descriptors open causes can cause an orted to hang when
launching apps.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 98172163e6)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-29 09:22:37 -06:00
Boris Karasev
31ca3842da Fixed copyrights of prev commit.
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit beb0697f24)
2018-08-28 12:29:16 +03:00
Boris Karasev
d995fb1b3f Fixed the NUMA obj detection for hwloc ver >= 2.0.0
Since version hwloc 2.0.0 has a new organization of NUMA nodes on the
topology tree. This commit adds the detection of local NUMA object for
hwloc => 2.0.0, which fixes the procs bindings policy for rmaps mindist
component.

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit e5291ccc34)
2018-08-28 12:29:08 +03:00
Ralph Castain
5725e91c1a Deal with special case during cleanup
In some scenarios, we can have a daemon sharing the node with mpirun. In
those cases, we need to avoid race conditions in cleanup

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 8d1be27a1e)
2018-08-27 13:20:11 -06:00
Ralph Castain
b4ae5d005f Allow run-as-root if 2 envars are set
Per suggestion by @bangerth, allow mpirun to execute as root if two
envars are set to specific values

Per conversation with @jsquyres, name the envars OMPI_ALLOW_RUN_AS_ROOT
and OMPI_ALLOW_RUN_AS_ROOT_CONFIRM

Fixes #4451

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 7f1444d5f9)
2018-08-24 20:24:22 -07:00
Howard Pritchard
c2cc336135
Merge pull request #5486 from rhc54/cmr40/maps
Cleanup pmix selection and map-by modifiers
2018-08-22 04:40:00 -04:00
Ralph Castain
e27e945d9a Complete job control integration
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-08-20 16:08:54 -07:00
Boris Karasev
8873d901e8 pmix: added check for pmix fence status
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
(cherry picked from commit 57683366ca)

Conflicts:
	opal/mca/common/ucx/common_ucx.c
	opal/mca/common/ucx/common_ucx.h

Modified:
	ompi/mca/pml/ucx/pml_ucx.c
	oshmem/mca/spml/ucx/spml_ucx.c
2018-08-17 21:33:50 +06:00
Ralph Castain
511319c316 Fix the multiple pe/proc option
Things got a little out of whack and we weren't actually processing the map-by modifiers, plus an error crept into the display of the binding report. So clean those up.

Thanks to @tonyreina for the error report

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit bcdb1f45ac)
2018-07-25 19:55:28 -07:00
Ralph Castain
50e6e14020 Protect against infinite loops
Flag that we provided a notification and ignore it if it attempts to come back up.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit ea0d70bc9396def61545e2ce492a55c4c3aa7772)
2018-07-19 13:57:52 -07:00
Ralph Castain
6b6e63a346 Control inheritance of launch directives by child jobs
Do not have child jobs inherit launch directives unless requested to do so. This affects the map-by, rank-by, bind-to, npernode, pernode, npersocket, persocket, and cpus-per-rank directives. Values provided in the spawn call always take precedence - if a particular value isn't specified, then the ORTE defaults will be used if inheritance is not requested, and the values specified by MCA param will be used if inheritance is set.

Always inherit oversubscribe for now as otherwise MTT will break

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-07-10 15:12:05 -07:00
Ralph Castain
c6ee8d2f5c
Merge pull request #5300 from hjelmn/goodbye_oobud
oob/ud: remove as it has bitrotted
2018-07-09 12:40:04 -07:00
Ralph Castain
0ddbc75ce5
Merge pull request #4930 from kizill/fix-ipv6
fixed ipv6 OOB connection problems (fix issue #1585)
2018-06-26 09:13:53 -07:00
Ralph Castain
3b2390e5d5 Silence coverity warnings, remove/ignore build product
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-25 08:01:28 -07:00
Jeff Squyres
4603852740 orterun: use consistent CLI option name for --bind-to
Since the new binding option is tied to the --cpu-list orterun CLI
option, make the --bind-to option reflect the same name (vs. the
--cpu-set CLI option, which is entirely different).  For example:

    mpirun --bind-to cpu-list:ordered ...

Note that "--bind-to cpulist:ordered" is accepted as a synonym,
because people will be lazy.

Also add some minor updates to the orterun.1in man page for
clarification.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-06-21 08:22:00 -07:00
Ralph Castain
d2838139e4 Update man and help output for new binding option
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-21 06:36:11 -07:00
Ralph Castain
f17d47087a Define a new binding method and qualifier
Allow users to request that procs be bound to a cpu in a given cpu-list based on their corresponding local rank

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-20 21:26:09 -07:00
Ralph Castain
97665d44cd Prevent thread log when show_help msgs are emitted
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-19 21:07:03 -07:00
Ralph Castain
98b4ed9a3a Fix the no-disconnect test
A race condition exists based on whether or not the userdata object attached to a hwloc_obj_t has been initialized. These objects are setup whenever we scan for resources under that location. You therefore must not set a variable to the pointer to the userdata object and then call a function that will initialize the data in it - you need to set the variable after the function call, and protect against a NULL pointer

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-19 13:52:34 -07:00
Nathan Hjelm
845516ca11 oob/ud: remove as it has bitrotted
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-06-19 13:06:26 -06:00
Ralph Castain
cdb3d798f0 Silence Coverity warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-19 07:47:50 -07:00
Ralph Castain
f0a0d606a0 Correct accounting for tools
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 1be080f7b92bad39745f42628a8cb6afefad2d2a)
2018-06-18 13:24:25 -07:00
Ralph Castain
795140e590 Make use of "instant-on" feature optional
The PMIx support for "instant on" remains experimental, so disable it by default. Provide an MCA param and corresponding command line option to enable it at runtime.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-17 02:42:00 -07:00
Ralph Castain
fa18ba395d Sync to latest PMIx v3.0rc
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-17 02:41:46 -07:00
Ralph Castain
ea21f7175a Silence warnings and remove unused code
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-16 17:42:48 -07:00
Ralph Castain
06eb51161f Rename prun ompi-prun
This is a minor abstraction break in naming, but hopefully acceptable for now. I will update the contents of the program a little later. This resolves the immediate issue of naming conflict with the PRRTE binary.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-06 07:40:03 -07:00
Ralph Castain
7c0ec7e851 Cleanup warnings in binding code
This still leaves two unresolved warnings:

base/rmaps_base_binding.c:577:22: warning: variable ‘clvm’ set but not used [-Wunused-but-set-variable]
     unsigned clvl=0, clvm=0;
                      ^~~~
base/rmaps_base_binding.c:576:27: warning: variable ‘hwm’ set but not used [-Wunused-but-set-variable]
     hwloc_obj_type_t hwb, hwm;
                           ^~~

The problem is that these values are used in the OPAL_HWLOC_MAKE_OBJ_CACHE macro to form a variable name. Thus, the compiler doesn't recognize the values as being "used". I'm not entirely sure how to resolve it cleanly.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-06-03 11:47:14 -07:00
Brice Goglin
c4dffa1d0f rmaps: simplify the lookup for the binding object and fix for hwloc 2.0
Don't bother doing a lookup upwards or downwards for the target object type.
Just use the target depth, iterate over the level until we find the min_bound
object that intersects the locale cpuset.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2018-05-24 11:53:07 +02:00
Brice Goglin
bd08a6ead9 hwloc: fix hwloc/shmem.h in the external case
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2018-05-24 11:53:07 +02:00
Jeff Squyres
af4299ebc5 hwloc: updates for hwloc 2.0.x API
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-05-24 11:53:07 +02:00
Ralph Castain
d2040497b8 Silence Coverity warning
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-04-27 07:30:00 -07:00
Ralph Castain
1e8add52d7 Silence warning
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-04-26 11:45:25 -07:00
Ralph Castain
9ae80596f6 Fix rank-by option and improve npernode/skt
This fixes a problem reported by @bgoglin where rank-by was incorrectly generating values when ranking by a type of object (e.g., socket). It also corrects the handling of the pernode, npernode, and npersocket options - these should only set the #procs and the default mapping pattern. They specifically should not prohibit the user from requesting a different mapping.

Thus, the following should be valid:

mpirun -npernode 2 --map-by socket ...

should put 2 procs on each node, mapping them by-socket on each node.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-04-25 20:35:43 -07:00
Gilles Gouaillardet
a05456ab5e orte: only set the ORTE_NODE_ALIAS attribute when required
When there is no alias for a given node, do not set the
ORTE_NODE_ALIAS attribute to an empty string any more.

Thanks Erico for reporting this issue.
Thanks Ralph for the guidance.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-04-25 11:43:46 +09:00
Gilles Gouaillardet
49d6658f60 Revert "orte_dt_print_node: correctly handle nodes with no alias(es)"
This reverts commit 866f449cff.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-04-25 11:43:35 +09:00
Gilles Gouaillardet
866f449cff orte_dt_print_node: correctly handle nodes with no alias(es)
Thanks Erico for reporting this issue

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-04-24 15:44:26 +09:00
Jeff Squyres
3c8d08a9f3
Merge pull request #5078 from jsquyres/pr/slurm-plm-minor-warning-message-update
plm/slurm: slightly improve verbose warning message
2018-04-17 07:59:50 -07:00
Gilles Gouaillardet
4f1cb4747c odls/base: fix support for PMIx < v2.1
wrap opal_pmix.get() around opal_pmix.legacy_get() to support previous PMIx releases.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-04-17 09:59:47 +09:00
Jeff Squyres
5394845ce6 plm/slurm: slightly improve verbose warning message
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-04-16 17:04:12 -07:00
Nathan Hjelm
664ba32435 plm/base: fix typo in variable name
An incorrectly named variable caused all pml variables to disappear
from ompi_info. This commit fixes the typo. We may add some logic into
the MCA base to catch these sorts of things in the future.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-04-10 17:53:16 -06:00
Jeff Squyres
7a6e8cac58 orterun.1in: fix typo
Found via https://github.com/open-mpi/ompi-www/pull/61.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-04-09 14:13:24 -04:00
Ralph Castain
d644f7ee26 Correctly fix the ranking policy
Shorten the loops as much as possible - if someone wants to further optimize, they are welcome to do so.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-26 16:06:46 -07:00
Ralph Castain
322f6c5056 Fix a breakage in the ranking system
While it may be faster to reverse the order of the assignment loops, it also results in the wrong answer

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-25 15:55:56 -07:00
Ralph Castain
8454fc8a65 Allow oversubscription on managed allocations
Fixes https://github.com/pmix/pmix-reference-server/issues/42

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-25 08:37:51 -07:00
Artem Polyakov
77ff99e9ee
Merge pull request #4933 from karasevb/timings_update
timings: added new timing points
2018-03-25 00:10:49 -07:00
Boris Karasev
6afc7099a0 plm/base: fixed the hosts filtering
Reseting the `ORTE_NODE_FLAG_MAPPED` flag after hosts filtering, this
flag is used subsequently and can be affect to the node mapping logic

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-03-23 09:41:16 +03:00
Jeff Squyres
0f8077ace6 oob/tcp: add show_help message about version mismatch
Be more explicit about version mismatch between ORTE processes.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-03-21 20:18:28 -07:00
Boris Karasev
3796307a57 timings: added new timing points
Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-03-21 05:16:25 +02:00
Stanislav Kirillov
86061fbf8d
fixed ipv6 OOB connection problems
Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>
2018-03-20 16:07:53 +00:00
Aurelien Bouteiller
e08e580e27
Merge pull request #4916 from abouteiller/topic/scaling.pl-m
Scaling.pl: Fix Srun options and wait for DVM launch
2018-03-15 22:06:01 -04:00
Aurélien Bouteiller
9e23d24bb4
Scaling.pl: Fix Srun options and wait for DVM launch
Flush out the DVM ready notice on stdout

Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
2018-03-15 00:00:49 -04:00
Joshua Hursey
ccb4f43c9b Fix MPIR_proctable structure visibility
* The `MPIR_PROCDESC` structure needs to be visible even in optimized
   builds so that debuggers can attach to `mpirun` and properly read the
   `MPIR_proctable`.
 * In the v2.0.x and v2.x series this structure resided in the `orterun`
   directory and included the `CFLAGS` fix included here. This code
   moved in the v3.x series and the `CFLAGS` did not move causing this
   issue.
   - Instead of applying the debug `CFLAGS` globally to libopen-rte,
     only apply them to the `orted_submit.c` compile which contains the
     MPIR symbols.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2018-03-09 21:15:28 -05:00
Ralph Castain
2f85db9791 Always register the nspace for jobs
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-02 02:00:31 -08:00
Ralph Castain
7241043809 Modify the internal logic for resolve nodes/peers
The current code path for PMIx_Resolve_peers and PMIx_Resolve_nodes executes a threadshift in the preg components themselves. This is done to ensure thread safety when called from the user level. However, it causes thread-stall when someone attempts to call the regex functions from _inside_ the PMIx code base should the call occur from within an event.

Accordingly, move the threadshift to the client-level functions and make the preg components just execute their algorithms. Create a new pnet/test component to verify that the prge code can be safely accessed - set that component to be selected only when the user directly specifies it. The new component will be used to validate various logical extensions during development, and can then be discarded.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 456ac7f7af3d9ba09888e3c899eb001daaa24aef)
2018-03-02 02:00:31 -08:00
Ralph Castain
17c40f4cea Implement support for proctable queries
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-02 02:00:31 -08:00
Ralph Castain
0434b615b5 Update ORTE to support PMIx v3
This is a point-in-time update that includes support for several new PMIx features, mostly focused on debuggers and "instant on":

* initial prototype support for PMIx-based debuggers. For the moment, this is restricted to using the DVM. Supports direct launch of apps under debugger control, and indirect launch using prun as the intermediate launcher. Includes ability for debuggers to control the environment of both the launcher and the spawned app procs. Work continues on completing support for indirect launch

* IO forwarding for tools. Output of apps launched under tool control is directed to the tool and output there - includes support for XML formatting and output to files. Stdin can be forwarded from the tool to apps, but this hasn't been implemented in ORTE yet.

* Fabric integration for "instant on". Enable collection of network "blobs" to be delivered to network libraries on compute nodes prior to local proc spawn. Infrastructure is in place - implementation will come later.

* Harvesting and forwarding of envars. Enable network plugins to harvest envars and include them in the launch msg for setting the environment prior to local proc spawn. Currently, only OmniPath is supported. PMIx MCA params control which envars are included, and also allows envars to be excluded.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-03-02 02:00:31 -08:00
Scott Miller
d7e594fcff Fix PATH and LD_LIBRARY_PATH prefixing to use first app context value for ORTE_APP_PREFIX_DIR
Signed-off-by: Scott Miller <scott.miller1@ibm.com>
2018-02-28 18:41:47 -05:00
Gilles Gouaillardet
02b97146de orted_submit: fix the --oversubscribe option
do set the ORTE_MAPPING_SUBSCRIBE_GIVEN directive when --oversubscribe is used

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-02-20 17:15:12 +09:00
Artem Polyakov
7333f128f6
Merge pull request #4815 from artpol84/slurm/plm_fix
plm/slurm:
2018-02-15 12:45:43 -08:00
Gilles Gouaillardet
dd24c746dc output-filename: cleanup obsolete code.
Since output-filename has been moved to a per-job attribute,
remove the orte_output_filename global variable, and stop passing
this option to orted.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-02-15 10:40:44 +09:00
Artem Polyakov
ab8bb4b0a3 plm/slurm:
Sync command line output for Slurm with RSH launcher.
Currently Slurm launch cmdline will only be visible in debug mode, while for RSH
it is enabled always.
cmdline makes sense for troubleshooting and should be enabled.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2018-02-15 03:09:18 +07:00
Ralph Castain
af07b3df89 Update help and man pages for output-filename
Warn that relative path will be converted to absolute path, meaning that the file system on remote nodes must be the same as on the node where mpirun is executed.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-02-13 15:33:33 -08:00
Ralph Castain
f5c3239290 Ensure that output-filename is passed as an absolute path
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-02-13 07:28:42 -08:00
Ralph Castain
cb221b6f6f Correct mapping errors
Since we now support the dynamic addition of hosts to the orte_node_pool, there is no longer any reason to require advanced specification of all possible nodes. Instead, use a precedence method to initially allocate only those hosts that were specified in the cmd line:

* rankfile, if given, as that will specify the nodes

* -host, aggregated across all app_contexts

* -hostfile, aggregated across all app_contexts

* default hostfile

* assign local node

Fix slots_inuse accounting so that the nodes are correctly reset upon error termination - e.g., when oversubscribed without permission.

Ensure we accurately track the user's specified desires for oversubscribe and no-use-local when dynamically spawning jobs.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit c9b3e68ce596a68a2ed2fbf73f211b3334b0a6a8)
2018-02-07 11:29:21 -08:00
Ralph Castain
ce901ba247 Ensure we fail if remote nodes cannot find executable
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-02-05 19:31:43 -08:00
Ralph Castain
10be1df1d3 Remove debug and add target/probe programs
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 9a03007115fc8978f4eb5fd938c05b26adbd433e)
2018-02-03 20:06:18 -08:00
Ralph Castain
9fe8153d38 Sync to IOF branch and continue fix of request for job info from unknown nspace
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 02400d30d79ce3c7e7e28f9a08f7062a5b6f4c51)
2018-02-03 19:56:35 -08:00
Ralph Castain
73ef976ead Silence warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-02-03 00:29:06 -08:00
Boris Karasev
52e81ee4b1 rmaps: fixed the ordering of mpirun target nodes
Fixed the desync of job-nodelists between mpirun and orted
daemons. The issue was observed when using RSH launching because user
can provide arbitrary order of nodes regarding HNP placement.
The mpirun process propagate the daemon's nodelist order to nodes.
The problem was that HNP itself is assembling the nodelist based on
user provided order. As the result ranks assignment was calculated
differently on orted and mpirun.

Consider following example:
* User launches mpirun on node cn2.
* Hostlist is cn1,cn2,cn3,cn4; ppn=1
* mpirun is passing hostlist cn[2:2,1,3-4]@0(4) to orteds
So as result mpirun will assing rank 0 on cn1 while orted will assign
rank 0 on cn2 (because orted sees cn2 as the first element in the node
list)

Signed-off-by: Boris Karasev <karasev.b@gmail.com>
2018-02-01 17:16:05 +02:00
Ralph Castain
e284a3e98b
Merge branch 'master' into topic/iof_hnp 2018-01-26 13:55:49 -08:00
Ralph Castain
b643852d8a Properly terminate the job when executable not found
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-01-26 12:09:24 -08:00
Ralph Castain
c166e26265
Merge branch 'master' into topic/iof_hnp 2018-01-26 06:15:58 -08:00
Ralph Castain
e9cd7fd7e6 Update orte
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-01-25 08:53:43 -08:00
Ralph Castain
d1071397ac Update the orte/ess framework
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-01-25 08:43:44 -08:00
Gilles Gouaillardet
54fb8ac5d5 iof: do not release a sink before all read data is written.
When too much data is available on stdin, it might not be
forwarded immediatly to the task (write() might fail with -EAGAIN),
so when stdin is terminated, there might be some remaining data
to be pushed to the task. In this case, delay the release of the sink
so no data is discarded.

Refs open-mpi/ompi#4744

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-01-25 16:29:22 +09:00
Gilles Gouaillardet
ebffaded5d iof/base: remove the unused iof_base_input_files MCA parameter
this option was only used by the iof/mr_hnp (aka Map/Reduce)
component that is no more part of master nor v3 branches.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-01-25 11:29:14 +09:00
Ralph Castain
75eb56522c Continue resolving add_host behavior
Fix a problem in packing/unpacking job updates. There remains a race condition that causes messages to attempt to be sent to the second new daemon before it is completely ready. Not entirely sure where it is coming from.

Refs #4665

Rebase to master. Reset orte_nidmap_communicated if hosts are added. Check for duplicate hostnames in an add_host command. Turn off tree_spawn for dynamic launch of additional daemons.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-01-15 08:21:01 -08:00
Ralph Castain
64ba33cb32
Merge pull request #4708 from rhc54/topic/cl4
Restrict MPI apps to cleaning up job-level dirs
2018-01-12 18:47:19 -08:00
Ralph Castain
1cd8e34765 Restrict MPI apps to cleaning up job-level dirs
MPI apps should only cleanup the session directory to the level of their
own job.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2018-01-12 17:14:24 -08:00