1
1
Граф коммитов

595 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
1911d74095 Prevent segfault when -debug given to mpirun 2016-05-08 10:19:05 -07:00
Ralph Castain
58dd41facf Repair the processing of cmd line options that mapped to MCA params. This was responsible for breaking things like map-by <foo>.
Remove debug, let orterun send terminate cmd to DVM

Recover the DVM support
2016-05-06 13:14:03 -07:00
Ralph Castain
6ac7929bd0 Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need.
Cleanups per @jjhursey review
2016-05-01 11:30:25 -07:00
Ralph Castain
fac409d094 Ensure the personality gets set for the debugger job launch when attaching 2016-04-28 15:28:55 -07:00
Rainer Keller
ad690a4bc0 Move the help into the proper file: all orte_show_help in
orte/orted/pmix/pmix_server.c reference orterun.
2016-04-07 22:52:23 +02:00
Ralph Castain
4a623778a9 Fix the debugger attach - previous commit had fixed one instance of a check prior to sending the release message, but there was a second code path that included a similar check that was missed. Thanks to John DelSignore for spotting it! 2016-03-23 08:25:25 -07:00
Ralph Castain
c146c4969b Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation.
Fixes #1467

Restore debugger attach operations

Fixes #1225
2016-03-18 21:49:04 -07:00
Ralph Castain
8f410d7897 Revert one part of open-mpi/ompi@4d0cc27eb7 2016-03-18 07:23:30 -07:00
Ralph Castain
d72c1c72ff Do not push child processes into separate process groups so that any host RM can still "see" them, and ensure that any signal sent to the orted's themselves will be provided to all child processes. Forward all signals from mpirun to the child processes, removing the old MCA parameter required to turn that behavior "on". 2016-03-06 17:55:09 -08:00
Ralph Castain
4d0cc27eb7 Update the singularity support to match that of the latest singularity master. Remove the restriction on shared memory components by instructing singularity to not isolate the PID space. Add a new schizo API to allow setting up the original app_context. Ensure the container is installed prior to execution. 2016-03-05 21:47:42 -08:00
Ralph Castain
4a55fba414 Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration. 2016-03-02 15:01:01 -08:00
Ralph Castain
011403c04a Fix a number of issues, some of which have lingered for a long time:
* provide a more reliable way of determining that a process is a singleton by leveraging the schizo framework. Add new components for slurm, alps, and orte to detect when we are in a managed environment, and if we have been launched by mpirun or a native launcher. Set the correct envars to control ess and pmix selection in each case.

* change the relative priority of the pmix120 and pmix112 components to make pmix120 the default

* fix singleton comm-spawn by correctly setting the num_apps field of the orte_job_t created by the daemon - this fixes a segfault in register_nspace on newly created daemons

* ensure orterun doesn't propagate any ess or pmix directives in its environment

* Cleanup a few valgrind issues and memory leaks

* Fix a race condition that prevented the client from completing notification registrations (missing thread shift)

* Ensure the shizo/alps component detects launch by mpirun
2016-03-01 06:53:00 -08:00
Ralph Castain
d653cf2847 Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM. 2016-02-21 11:55:49 -08:00
Nathan Hjelm
69de442136 orterun: allow DDT if options contain :'s
There is a bug in MPMD detection that disables totalview if a : is
found anywhere on the command line. This includes inside an argument
option or MCA variable value. This commit changes the check to look
for the string " : " instead of the character : which should eliminate
the issue in most cases.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-18 16:56:08 -07:00
Ralph Castain
60a7bc2e50 Enable the PMIx notification callback system. This currently is only supported by the pmix120 component, which is not selected by default. All other components will ignore error registration requests, and thus do not support debugger attach when launched via mpirun. Note that direct launched applications will support such attachment, but may not do so in a scalable fashion.
Fixes ##1225
2016-02-18 09:29:12 -08:00
Ralph Castain
8f9508cace Further enhance the support for Singularity containers. Extend the "personality" command-line option to allow specifying both model (e.g., "ompi") and container (e.g., "singularity"), and add the necessary logic to support multiple options. Add a new pmix "isolated" component to handle singletons where no HNP is available since containers cannot launch the HNP. 2016-02-17 13:33:06 -08:00
Ralph Castain
50431001a3 Modify the IOF subsystem to handle per-job directives for redirecting IO to files, tagging IO, and timestamping IO.
Fix stdin reader
2016-02-16 18:54:38 -08:00
Ralph Castain
06c3dfc052 Refactor the ORTE DVM code so that external codes can submit multiple jobs using only a single connection to the HNP.
* Clean up the DVM so it continues to run even when applications error out and we would ordinarily abort the daemons.
* Create a new errmgr component for the DVM to handle the differences.
* Cleanup the DVM state component.
* Add ORTE bindings directory and brief README
* Pass a local tool index around to match jobs.
* Pass the jobid on job completion.
* Fix initialization logic.
* Add framework for python wrapper.
* Fix terminate-with-non-zero-exit behavior so it properly terminates only the indicated procs, notifies orte-submit, and orte-dvm continues executing.
* Add some missing options to orte-dvm
* Fix a bug in -host processing that caused us to ignore the #slots designator. Add a new attribute to indicate "do not expand the DVM" when submitting job spawn requests.
* It actually makes no sense that we treat the termination of all children differently than terminating the children of a specific job - it only creates confusion over the difference in behavior. So terminate children the same way regardless.

Extend the cmd_line utility to easily allow layering of command line definitions

Catch up with ORTE interface change and make build more generic.

Disable "fixed dvm" logic for now.

Add another cmd_line function to merge a table of cmd line options with another one, reporting as errors any duplicate entries. Use this to allow orterun to reuse the orted_submit code

Fix the "fixed_dvm" logic by ensuring we reset num_new_daemons to zero. Also ensure that the nidmap is sent with the first job so the downstream daemons get the node info. Remove a duplicate cmd line entry in orterun.

Revise the DVM startup procedure to pass the nidmap only once, at the startup of the DVM. This reduces the overhead on each job launch and ensures that the nidmap doesn't get overwritten.

Add new commands to get_orted_comm_cmd_str().

Move ORTE command line options to orte_globals.[ch].

Catch up with extra orte_submit_init parameter.

Add example code.

Add documentation.

Bump version.

The nidmap and routing data must be updated prior to propagating the xcast or else the xcast will fail.

Fix the return code so it is something more expected when an error occurs. Ensure we get an error returned to us when we fail to launch for some reason. In this case, we will always get a launch_cb as we did indeed attempt to spawn it. The error code will be returned in the complete_cb.

Fix the return code from orte_submit_job - it was returning the tracker index instead of "success". Take advantage of ORTE's pretty-print capabilities to provide a nice error output explaining why we failed to launch. Ensure we always get a launch_cb when we fail to launch, but no complete_cb as the job never launched.

Extend the error reporting capability to job completion as well.

Add index parameter to orte_submit_job().

Add orte_job_cancel and implement ORTE_DAEMON_TERMINATE_JOB_CMD.

Factor out dvm termination.

Parse the terminate option at tool level.

Add error string for ORTE_ERR_JOB_CANCELLED.

Add some safeguards.

Cleanup and/of comments.

Enable the return.

Properly ORTE_DECLSPEC orte_submit_halt.

Add orte_submit_halt and orte_submit_cancel to interface.

Use the plm interface to terminate the job
2016-02-13 08:10:44 -08:00
Ralph Castain
5e5adebf8e Port the changes from #782 to the master. Not everything applies here as the code in the 1.10 series is a little different. In addition, we asked for a few changes (e.g., using MPI_ERR_ARG instead of "13") that are incorporated here.
Thanks to @jsharpe for the PR
2015-12-12 12:40:34 -08:00
Dimitar Pashov
9f6e306064 change -0bind-to and -bind-to to --bind-to in the manpages 2015-11-10 17:44:53 +00:00
Ralph Castain
186c18be0e Add missing cmd line options to mpirun man page, update NEWS to contain that change 2015-11-01 09:19:08 -08:00
Ralph Castain
8bfbe7f16c Add a new MCA parameter for default_dash_host to offer a mirror of the default_hostfile 2015-10-31 19:09:54 -07:00
Igor Ivanov
d379873443 oshmem: Add man.1 pages for oshmem tools
This changes add man pages for oshrun, oshcc and oshfort as well as
depricated shmemrun, shmemcc and shmemfort.
2015-10-05 15:41:28 +03:00
Ralph Castain
f872e99315 Fix orte-submit so it allows application procs to select the correct ess component. Protect orte_data_server from multiple calls to finalize. 2015-09-21 20:31:57 -07:00
rhc54
665b30376a Merge pull request #868 from rhc54/topic/hwloc
Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given
2015-09-04 17:58:07 -07:00
Ralph Castain
d97bc29102 Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given 2015-09-04 16:54:40 -07:00
Ralph Castain
f6948c2bb4 Sync with PMIx master 43e45c3. Get multi-node publish/lookup/unpublish working 2015-09-04 10:07:17 -07:00
Nathan Hjelm
4d92c9989e more c99 updates
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-25 10:14:13 -06:00
Ralph Castain
869041f770 Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
Gilles Gouaillardet
2e384a3b65 initialize common symbols from orte
A few uninitialized common symbols are remaining (generated by flex) :
 * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_leng
 * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_text
 * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_leng
 * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_text
2015-05-08 10:11:58 +09:00
Ralph Castain
0e23f76eee Fix comment 2015-04-15 20:09:14 -06:00
Jeff Squyres
99754afd25 orterun.c: re-justify the output message text
The type-A personality / english lit major in me compells me to
re-justify the text.  :-)
2015-04-01 10:57:23 -07:00
Mike Dubman
58d002098b Merge pull request #474 from elenash/master
Introduce -tune command line option to set env vars and mca params from ...
2015-04-01 08:23:34 +03:00
Ralph Castain
41dd65d6cd Per Jeff's request, tone down the comments and "standardize" the warning 2015-03-31 17:54:54 -07:00
Ralph Castain
f863147b05 Per the telecon and chat with Jeff, let root only do the version option without warning. Otherwise, require that the user specifically indicate allow-use-as-root 2015-03-31 10:34:35 -07:00
Ralph Castain
2f365720b0 Allow root to request the version and help from mpirun without having to override the run-as-root protection.
Thanks to Robert McLay for pointing this out
2015-03-28 08:17:44 -07:00
Elena
90f5b2bb84 Introduce -tune command line option to set env vars and mca params from file 2015-03-26 18:33:53 +02:00
Gilles Gouaillardet
4c0eb11e08 orterun: fix misc errors
as reported by Coverity with CIDs 70700, 71039, 710651
2015-03-09 11:57:18 +09:00
Gilles Gouaillardet
4e7b5240e4 orte/tools: fix misc memory leaks
as reported by Coverity with CIDs 70700, 71039, 71854, 72384 and 710651
2015-03-05 14:06:18 +09:00
Jeff Squyres
4f54fedf05 orterun: ensure to set used_num_procs=true after finding that token
This was CID 71687.
2015-02-24 15:25:39 -05:00
Ralph Castain
ec5ccb76cf Enable persistent ORTE DVM so users can execute multiple OMPI jobs within an allocation without restarting the DVM every time. 2015-01-30 11:00:43 -08:00
Ralph Castain
028b00154d Complete implementation of the schizo framework to support OMPI component 2015-01-27 09:29:42 -06:00
Mike Dubman
f83d6045aa ORTE: undeprecate -x var=val in mpirun
mpirun -x var=val is back, actually it is useful alias for -mca mca_base_env_list "var=val"
2014-11-12 10:51:15 +02:00
Ralph Castain
d0704ef118 Restore handling of physical processors in rankfiles. Note that the prior implementation was likely incorrect as it falsely assumed that physical core indices were unique, which isn't always true. Stipulate that physical rankfiles can only include PU numbers, and bind the result to the core that contains that physical PU. Update the mpirun man page to cover the new use-case. 2014-11-10 14:00:40 -08:00
Ralph Castain
738c3e1d72 Ensure that mpirun correctly selects the HNP ess component without attempting to init the PMI subsystem as mpirun won't be supported anyway, so let's avoid the error message. Also, daemons launched by the plm/slurm component must use the ess/slurm module as we cannot trust the Slurm PMI_Init functions to correctly tell us when PMI support is available. 2014-11-03 21:35:42 -08:00
Ralph Castain
894acb0aa8 configury: new OPAL_SET_MCA_PREFIX/ORTE_SET_MCA_CMD_LINE_ID macros
These two macros set the MCA prefix and MCA cmd line id,
   respectively.  Specifically, MCA parameters will be named
   PREFIX<foo> in the environment, and the cmd line will use
   -ID foo bar.

   These macros must be called during configure.ac and a value
   supplied. In the case of Open MPI, the values given are
   PREFIX=OMPI_MCA_ and ID=mca.

   Other projects (such as ORCM) will call these macros with
   their own unique values.  For example, ORCM uses PREFIX=ORCM_MCA_
   and ID=omca

   This scheme is necessary to allow running Open MPI applications under
   systems that use their own versions of ORTE and OPAL.  For example,
   when running OMPI applications under ORCM, we need the MCA params passed
   to the ORCM daemons to be separated from those recognized by the OMPI application.
2014-10-22 18:57:40 -07:00
Jeff Squyres
c22e1ae33b configury: new OPAL_SET_LIB_PREFIX/ORTE_SET_LIB_PREFIX macros
These two macros set the prefix for the OPAL and ORTE libraries,
respectively.  Specifically, the OPAL library will be named
libPREFIXopen-pal.la and the ORTE library will be named
libPREFIXopen-rte.la.

These macros must be called, even if the prefix argument is empty.

The intent is that Open MPI will call these macros with an empty
prefix, but other projects (such as ORCM) will call these macros with
a non-empty prefix.  For example, ORCM libraries can be named
liborcm-open-pal.la and liborcm-open-rte.la.

This scheme is necessary to allow running Open MPI applications under
systems that use their own versions of ORTE and OPAL.  For example,
when running MPI applications under ORTE, if the ORTE and OPAL
libraries between OMPI and ORCM are not identical (which, because they
are released at different times, are likely to be different), we need
to ensure that the OMPI applications link against their ORTE and OPAL
libraries, but the ORCM executables link against their ORTE and OPAL
libraries.
2014-10-22 10:32:19 -07:00
Jeff Squyres
01fd96bfa5 Revert "Provide a mechanism by which an upstream project can rename
the OPAL and ORTE libraries. This is required by projects such as ORCM
that have their own ORTE and OPAL libraries in order to avoid library
confusion. By renaming their version of the libraries, the OMPI
applications can correctly dynamically load the correct one for their
build."

This reverts commit 63f619f871.
2014-10-22 10:32:11 -07:00
Jeff Squyres
206eade32c mpirun.1in: whitespace cleanup
Whitespace cleanup only; no content changes.
2014-10-20 05:18:25 -07:00
Jeff Squyres
9529289319 mpirun.1in: more updates about binding/etc.
Follow on to 91e9686 and f9d620e.
2014-10-20 05:17:49 -07:00