1
1

27 Коммитов

Автор SHA1 Сообщение Дата
Jason Williams
3d8ddbc136 Adding changes for issue #6303 for branch v4.0.x.
Signed-off-by: Jason Williams <uberlinuxguy@gmail.com>

(cherry picked from commit 98d81a5f7a619d5a19615297a6fe8a18d8e3781c)
2019-01-29 08:46:59 -05:00
Nathan Hjelm
4eeb41506c odls/alps: resolve hang when launching with mpirun on Crays
This commit removes some code that protected the odls/alps component
from closing alps file descriptors. For some unknown reason leaving
these file descriptors open causes can cause an orted to hang when
launching apps.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 98172163e6af7ae1bf510b8f31ec97fdb497eaf1)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2018-08-29 09:22:37 -06:00
Ralph Castain
e3c308dfc8 Update the odls/alps component
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-11-25 19:51:07 -08:00
Gilles Gouaillardet
e88767866e iof: optimize handling of stderr when iof_base_redirect_app_stderr_to_stdout is set
avoid creating a pipe for a task stderr when we know it will be redirected to stdout

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-11-24 13:20:03 +09:00
Gilles Gouaillardet
84e96522f2 iof: optimize handling of stdin
since some tasks migth end up having /dev/null as their stdin,
simply avoid pipe creation and destruction for these tasks.

From a pragmatic and MPI point of view, and unless explicitly required
otherwise, all MPI tasks but (the first) one end up with /dev/null
as their stdin.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-11-22 13:18:32 +09:00
Gilles Gouaillardet
16fc0996e6 odls: fix handling of the orte fork agent
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-05-08 16:07:13 +09:00
Ralph Castain
1585854335 Minor coverity cleanups
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-12 19:31:35 -07:00
Mark Santcroos
27fa8aabd6 Hardcode basename to "orted" for error reporting.
Signed-off-by: Mark Santcroos <mark.santcroos@rutgers.edu>
2017-04-11 18:59:23 -04:00
Mark Santcroos
af3a6e1a29 Verify that the chdir(2) succeeds.
Signed-off-by: Mark Santcroos <mark.santcroos@rutgers.edu>
2017-04-12 00:37:37 +02:00
Mark Santcroos
36ac54b5d8 Bring ALPS ODLS up to par regarding wdir.
Signed-off-by: Mark Santcroos <mark.santcroos@rutgers.edu>
2017-04-10 08:15:07 -04:00
Ralph Castain
74fd2c30af Cleanup alps odls module
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-21 17:41:11 -06:00
Ralph Castain
75684dc260 Resolve a race condition for setting our working directory when fork/exec'ing application procs. We have to ensure we do it after the fork occurs since we want to use multiple threads in the odls. Otherwise, the different threads are bouncing the entire process around.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-21 13:54:03 -07:00
Ralph Castain
6d6bc9bd07 Update alps module to new APIs
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-12 09:43:07 -07:00
Jeff Squyres
fec519a793 hwloc: rename opal/mca/hwloc/hwloc.h -> hwloc-internal.h
Per a prior commit, the presence of "hwloc.h" can cause ambiguity when
using --with-hwloc=external (i.e., whether to include
opal/mca/hwloc/hwloc.h or whether to include the system-installed
hwloc.h).

This commit:

1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h.
2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to
   find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc.
3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the
   rest of the code base.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-02-28 07:48:42 -08:00
Ralph Castain
0ba9572f9f Cleanup the forced termination a bit by restoring the delay before issuing the sigkill, and eliminating the large time loss spent checking if the proc died. The latter is responsible for a large number of test timeouts in MTT
Update alps component
2016-06-02 17:48:21 -07:00
Ralph Castain
d72c1c72ff Do not push child processes into separate process groups so that any host RM can still "see" them, and ensure that any signal sent to the orted's themselves will be provided to all child processes. Forward all signals from mpirun to the child processes, removing the old MCA parameter required to turn that behavior "on". 2016-03-06 17:55:09 -08:00
Ralph Castain
06c3dfc052 Refactor the ORTE DVM code so that external codes can submit multiple jobs using only a single connection to the HNP.
* Clean up the DVM so it continues to run even when applications error out and we would ordinarily abort the daemons.
* Create a new errmgr component for the DVM to handle the differences.
* Cleanup the DVM state component.
* Add ORTE bindings directory and brief README
* Pass a local tool index around to match jobs.
* Pass the jobid on job completion.
* Fix initialization logic.
* Add framework for python wrapper.
* Fix terminate-with-non-zero-exit behavior so it properly terminates only the indicated procs, notifies orte-submit, and orte-dvm continues executing.
* Add some missing options to orte-dvm
* Fix a bug in -host processing that caused us to ignore the #slots designator. Add a new attribute to indicate "do not expand the DVM" when submitting job spawn requests.
* It actually makes no sense that we treat the termination of all children differently than terminating the children of a specific job - it only creates confusion over the difference in behavior. So terminate children the same way regardless.

Extend the cmd_line utility to easily allow layering of command line definitions

Catch up with ORTE interface change and make build more generic.

Disable "fixed dvm" logic for now.

Add another cmd_line function to merge a table of cmd line options with another one, reporting as errors any duplicate entries. Use this to allow orterun to reuse the orted_submit code

Fix the "fixed_dvm" logic by ensuring we reset num_new_daemons to zero. Also ensure that the nidmap is sent with the first job so the downstream daemons get the node info. Remove a duplicate cmd line entry in orterun.

Revise the DVM startup procedure to pass the nidmap only once, at the startup of the DVM. This reduces the overhead on each job launch and ensures that the nidmap doesn't get overwritten.

Add new commands to get_orted_comm_cmd_str().

Move ORTE command line options to orte_globals.[ch].

Catch up with extra orte_submit_init parameter.

Add example code.

Add documentation.

Bump version.

The nidmap and routing data must be updated prior to propagating the xcast or else the xcast will fail.

Fix the return code so it is something more expected when an error occurs. Ensure we get an error returned to us when we fail to launch for some reason. In this case, we will always get a launch_cb as we did indeed attempt to spawn it. The error code will be returned in the complete_cb.

Fix the return code from orte_submit_job - it was returning the tracker index instead of "success". Take advantage of ORTE's pretty-print capabilities to provide a nice error output explaining why we failed to launch. Ensure we always get a launch_cb when we fail to launch, but no complete_cb as the job never launched.

Extend the error reporting capability to job completion as well.

Add index parameter to orte_submit_job().

Add orte_job_cancel and implement ORTE_DAEMON_TERMINATE_JOB_CMD.

Factor out dvm termination.

Parse the terminate option at tool level.

Add error string for ORTE_ERR_JOB_CANCELLED.

Add some safeguards.

Cleanup and/of comments.

Enable the return.

Properly ORTE_DECLSPEC orte_submit_halt.

Add orte_submit_halt and orte_submit_cancel to interface.

Use the plm interface to terminate the job
2016-02-13 08:10:44 -08:00
Mark Santcroos
5ec2b4d98c Fix some messages in the process. 2015-11-09 18:03:26 -05:00
Howard Pritchard
d899320574 odls/alps: close the directory
Close the /proc/self/fd dir after checking for open fds.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-10-06 11:13:44 -07:00
Howard Pritchard
a31cc21bea odls/alps: do smarter close of fds in child
Use a modified variant of #907.  Thanks to plesn
for noticing this.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-09-19 14:17:05 -07:00
Nathan Hjelm
4d92c9989e more c99 updates
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-25 10:14:13 -06:00
Ralph Castain
869041f770 Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
Howard Pritchard
05325b113e odls/alps: fix busted build for cray.
This commit fixes things broken by commit
ea35e47.

Fixes #616

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-02 05:10:38 -07:00
Howard Pritchard
d749077e1e odls/alps: make sure PMI env. variables set up
Add call to orte_odls_alps_get_rdma_creds in the
local proc launch step to obtain the Cray Rdma
credentials from the apshepherd, and to set
the PMI env. variables expected by uGNI BTL, etc.
2014-12-03 09:44:17 -07:00
Howard Pritchard
9425ebefae Be more selective about closing fd's for alps/odls
Be more selective about closing fd's for the alps odls
component.  Don't close fd's of pipes set up by the
apshepherd for providing RDMA credentials, etc.

Add an entry to the help file in case
alps_app_lli_pipes returns an error.
2014-11-19 11:21:30 -07:00
Howard Pritchard
ff362c16ce add/update copyrights for alps odls component 2014-11-18 10:16:11 -07:00
Howard Pritchard
dc98b62070 add initial support for an alps odls component
It turns out that the support for Open MPI apps on
Cray was hanging on a thin thread of support when
using the mpirun job launcher.  It just happened that
with a certain set of configuration options things would
work.  This is bound to backfire at some point.

To fix this weakness, as well as to allow for mpirun launched
jobs to benefit from many of the advanced placement features
provided by the Cray Linux Environment (as opposed to the hwloc
only default env of orte), a new odls alps component is introduced.
2014-11-17 14:00:09 -07:00