Ralph Castain
d5fd635efe
Bring forward the debugger-related changes
...
Refs https://github.com/open-mpi/ompi/pull/2425
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-29 13:15:20 -08:00
Ralph Castain
9c6c2fa61d
Bring the v2.0.x debugger patch up to the master branch
...
Ensure the personality gets set as specified by user, or defaults to
"ompi"
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-18 12:45:45 -08:00
Ralph Castain
649301a3a2
Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier).
...
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
2016-10-23 21:52:39 -07:00
Ralph Castain
57114a09ae
Pickup the npernode and npersocket options and include them in the job object
2016-10-17 12:26:21 -07:00
Ralph Castain
5b1484a836
Implement the backend support for process-generated event notification
2016-10-08 09:24:28 -07:00
Ralph Castain
e773c17cf3
Put show_help thru the PMIx "log" API. This pushes the show_help output from apps into the pmix thread, thus avoiding conflicts in the RML thread, which should help with thread lock situations.
2016-10-02 16:02:23 -07:00
Gilles Gouaillardet
c7bf9a0ec9
ess/singleton: fix read on the pipe to spawn'ed orted
...
and close the pipe on both ends when it is no more needed
2016-09-22 14:21:52 +09:00
Gilles Gouaillardet
83399adb3f
singleton: "safe" read/write to the pipe between (spawn'ed) orted and singleton
2016-09-20 14:56:58 +09:00
Gilles Gouaillardet
e7ae6975d0
orted: fix spawn in singleton mode
...
in singleton mode, have the spawn'ed orted invoke orte_pre_condition_transports()
and send the transport key back to the singleton
2016-09-20 14:39:22 +09:00
Gilles Gouaillardet
d84ac9bdc5
orted: remove debug
...
remove debug code that was added by mistake in open-mpi/ompi@eae9d31784
2016-09-19 19:15:42 +09:00
Gilles Gouaillardet
eae9d31784
pre_condition_transports: code cleanup
...
replace hard coded "OMPI_MCA_orte_precondition_transports" environment variable name
with macro'ed OPAL_MCA_PREFIX"orte_precondition_transports"
2016-09-19 13:31:47 +09:00
Artem Polyakov
9eba1b0b75
Merge pull request #2042 from artpol84/pmix_sdirs
...
Several fixes related to session directories:
2016-09-07 14:15:47 +07:00
Gilles Gouaillardet
be41b120d0
orted: plug misc memory leaks
...
as reported by Coverity with CID 1362603 and 1362606
2016-09-07 10:08:44 +09:00
Artem Polyakov
81195ab724
Several fixes related to session directories:
...
* enable OMPI to retrieve paths from RM through PMIx
* cleanups related to tempdirs.
2016-09-05 07:48:44 +03:00
Ralph Castain
4e0788e9ad
Enable PSM to support dynamic processes
...
Fix comm_spawn to correctly reference the actual parent process that requested the spawn when looking for the parent job object
2016-09-02 10:22:04 -07:00
Ralph Castain
0ea1cff733
Implement notification of completion on comm_spawn'd child jobs. Add a configure flag to enable PMIx 3's shared memory datastore, and set it disable by default so that comm_spawn functions again. Will reverse the default once that feature is fully functional
2016-09-01 13:10:10 -07:00
Ralph Castain
c1050bc01e
Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint. Setting OMPI_MEMPROFILE=N causes mpirun to set a timer for N seconds. When the timer fires, mpirun will query each daemon in the job to report its own memory usage plus the average memory usage of its child processes. The Proportional Set Size (PSS) is used for this purpose.
2016-08-31 09:32:07 -07:00
Ralph Castain
2f6e0fec90
Provide the number of nodes in the job
2016-08-26 14:50:41 -07:00
Ralph Castain
ae2af61ee3
Update the session dir structure. Restore the creation of a top-level dir based on userid so that everything is contained under the user's top-level dir. Make the next level down (the "job family" level) be either the pid (indicated by a name of "pid.N") or the job family if not launched by mpirun. This allows for proper rendezvous by direct-launched procs.
2016-08-15 22:46:46 -05:00
Ralph Castain
be8424b691
Provide backward compatible keys so that the non-PMIx components in the opal/pmix framework don't have to adjust as we continue to work on finalizing the PMIx reference scheme. Activate and utilize the new PMIx show_help capability to provide more meaningful error output when the server cannot start.
...
Add a contrib script to cleanup permissions incorrectly modified due to things like smb mounts
dd
2016-08-13 12:13:04 -07:00
Ralph Castain
08a0644df5
Fix shared memory rendezvous
2016-08-13 08:14:50 -07:00
Ralph Castain
d4327fd973
The node index isn't normally passed with the packed node object, so we need to set it on the remote end as the orted needs to pass it down to the procs. Refactor the registration code to better package proc-level info - we will separate out the node and app levels in a subsequent change.
2016-08-12 12:06:23 -07:00
Ralph Castain
527b5c692a
Update to include extended tool support, new datatypes
2016-08-08 13:39:46 -07:00
Ralph Castain
16fccd4964
Establish a way for ORTE to tell PMIx the base tmpdir to use, and update PMIx to understand such directives
2016-07-29 09:52:36 -07:00
Ralph Castain
b748afceb1
Fix copy/paste error
2016-07-29 06:41:30 -07:00
Gilles Gouaillardet
e67c3d0a14
orted/pmix: protect against NULL node in orte_pmix_server_register_nspace()
2016-07-29 16:20:31 +09:00
Ralph Castain
cacb582ecd
Support timeout values when performing connect/accept operations. Bump default timeout to 10 minutes so folks have time to start the partnering application
2016-07-28 14:09:06 -07:00
Ralph Castain
9ab20cafe3
Pass the nodeid for each proc in the job. Fix a mistaken error output message
2016-07-25 15:41:15 -07:00
Ralph Castain
99f7096031
Fix permissions
2016-07-16 21:03:55 -07:00
Ralph Castain
d4071fbd1c
Fix dynamic operations by ensuring that we only fire the debugger release if the debugger is attached, and that the OPAL pmix key for directing events to non-default handlers matches the PMIx spelling
2016-07-16 13:20:41 -07:00
Ralph Castain
20a91c2baf
Add a new --continuous flag to mpirun that directs ORTE to let a job continue running as app procs terminate. Don't attempt to restart them. Add event notification of abnormally terminating procs, and demonstrate that in the mpi_spin test program.
...
Cleanup debug message
2016-07-13 15:28:33 -07:00
Ralph Castain
ee56d9dc1a
Shorten the session directory name as some OS's are now providing unusually long temp directory names, causing us to overflow the sockaddr field
2016-07-05 14:59:50 -07:00
Ralph Castain
c9ada8e095
Silence Coverity warnings
2016-07-03 20:45:08 -07:00
Ralph Castain
6e434d6785
Add support for PMIx tool connections and queries. Initially only support a request to list all known namespaces (jobids) from ORTE, but other folks will extend that support to include additional information
...
Update to match PMIx RFC
Fix configury to point to correct libevent and hwloc locations
2016-06-29 19:19:19 -07:00
Gilles Gouaillardet
5d32282230
orted/pmix_server_pub: fix packing type in pmix_server_lookup_fn()
...
and make it match the one used when unpacking in orte_data_server()
2016-06-27 14:37:08 +09:00
Ralph Castain
e3e4d73986
Need to be a little more careful when checking the range on a publish/lookup operation. If the range was constrained at publish, then we need to check that the lookup fits within that constraint. Otherwise, we should provide the data. More detailed constraint checking will be provided later.
2016-06-24 17:01:49 -07:00
Ralph Castain
0ba02821e6
Add requested key and job-level info
2016-06-19 18:22:31 -07:00
Ralph Castain
5d330d5220
Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler.
...
Add PMIx 2.0
Remove PMIx 1.1.4
Cleanup copying of component
Add missing file
Touchup a typo in the Makefile.am
Update the pmix ext114 component
Minor cleanups and resync to master
Update to latest PMIx 2.x
Update to the PMIx event notification branch latest changes
2016-06-14 13:08:41 -07:00
Ralph Castain
dd0f843843
Fix rare hangs observed on OS-X by properly thread-shifting upcalls from the PMIx server into ORTE
2016-06-05 21:39:44 -07:00
Ralph Castain
0cd0ccb7fd
Provide ETIMEDOUT as the mpirun exit code if the timeout limit was hit
2016-05-31 07:45:31 -07:00
Ralph Castain
ebe159acef
Add a timeout cmd line option and an option to report state info upon timeout to assist with debugging Jenkins tests
...
If requested, obtain stacktraces for each application process and report it to stderr upon timeout
stack traces: minor improvements
- Also include the hostname and PID of the each process for which
we're sending the stack traces (vs. just including the ORTE process
name)
- Send a specific error message if we couldn't find "gstack" in the
$PATH (e.g., on OS X)
- Send a sepcific error message if gstack fails to run
- Print a message that obtaining the stack traces may take a few
seconds so that users don't wonder what's happening
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
help-orterun.txt: minor tweaks
Trivial update: show "--timeout" (instead of "-timeout") in the help
message, just to encourage the use of double-dash options.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
trivial: stacktrace -> stack trace
Trivial word smything.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-28 08:36:25 -07:00
rhc54
927d3f4c3c
Merge pull request #1692 from rhc54/topic/eval2
...
Fix the --tune problem by searching the argv for MCA params in advance of opal_init_util
2016-05-23 22:19:09 -07:00
Ralph Castain
80f4e3b872
Fix the --tune problem by searching the argv for MCA params in advance of opal_init_util. Only search the first app_context as we historically have done - we can debate whether or not to search all app_contexts
2016-05-23 21:09:44 -07:00
Ralph Castain
2da0210de3
Fix command line usage when Java user provides the -Djava.library.path=foo options
2016-05-23 15:29:36 -07:00
Ralph Castain
42ecffb6d0
Move the registration of MCA params out of the init of the var system - put them in with the rest of the OPAL MCA param registrations
...
Take another shot at untangling the spaghetti
orterun: fix for command line parsing
orte-submit calls opal_init_util () before parsing out MCA command line
options (-mca, -am, etc). This prevents mpirun from setting opal MCA
variables for some frameworks as well as the MCA base. This is because
when a framework is opened all of its variables are set to read-only.
Eventually we want to lift this restriction on some MCA variables but
since -mca is affected we must parse out the MCA command line options
before opal_init_util(). This commit fixes the bug by adding a new
option to opal_cmd_line_parse (ignore unknown option) so orte-submit
can pre-parse the command line for MCA options.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
Minor cleanups to avoid releasing/recreating the cmd line
2016-05-20 09:59:50 -07:00
Ralph Castain
7767882346
Per user request, add some missing data and definitions:
...
OPAL_PMIX_UNIV_RANK - synonym for OPAL_PMIX_GLOBAL_RANK
OPAL_PMIX_APP_SIZE - #ranks in the application of this proc
2016-05-09 08:39:01 -07:00
Ralph Castain
58dd41facf
Repair the processing of cmd line options that mapped to MCA params. This was responsible for breaking things like map-by <foo>.
...
Remove debug, let orterun send terminate cmd to DVM
Recover the DVM support
2016-05-06 13:14:03 -07:00
Jeff Squyres
265e5b9795
Merge pull request #1552 from kmroz/wip-hostname-len-cleanup-1
...
ompi/opal/orte/oshmem/test: max hostname length cleanup
2016-05-02 09:44:18 -04:00
rhc54
2fa8b6c6ac
Merge pull request #1525 from rhc54/topic/schizo
...
Extend the schizo framework
2016-05-01 15:09:08 -07:00
Ralph Castain
6ac7929bd0
Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need.
...
Cleanups per @jjhursey review
2016-05-01 11:30:25 -07:00