Gilles Gouaillardet
93e73841f9
ess/singleton: push all PMIX_* environment variables, regardless how many there are
2016-08-23 09:46:55 +09:00
Gilles Gouaillardet
a1e8e58a8a
ess/singleton: expects 4 PMIX_* environment variables or more
2016-08-23 09:34:03 +09:00
Jeff Squyres
71ec5cfb43
rsh: robustify the check for plm_rsh_agent default value
...
Don't strcmp against the default value -- the default value may change
over time. Instead, check to see if the MCA var source is not
DEFAULT.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-16 06:58:20 -05:00
rhc54
d7cd802426
Merge pull request #1971 from rhc54/topic/sesdir
...
Update the session dir structure. Restore the creation of a top-level…
2016-08-16 03:14:08 -05:00
Ralph Castain
ae2af61ee3
Update the session dir structure. Restore the creation of a top-level dir based on userid so that everything is contained under the user's top-level dir. Make the next level down (the "job family" level) be either the pid (indicated by a name of "pid.N") or the job family if not launched by mpirun. This allows for proper rendezvous by direct-launched procs.
2016-08-15 22:46:46 -05:00
Ralph Castain
9f43db7303
Further cleanup getpwuid usage - try it first (unless completely disabled), and then silently failover to try other methods.
2016-08-15 07:51:36 -07:00
Ralph Castain
be8424b691
Provide backward compatible keys so that the non-PMIx components in the opal/pmix framework don't have to adjust as we continue to work on finalizing the PMIx reference scheme. Activate and utilize the new PMIx show_help capability to provide more meaningful error output when the server cannot start.
...
Add a contrib script to cleanup permissions incorrectly modified due to things like smb mounts
dd
2016-08-13 12:13:04 -07:00
Ralph Castain
08a0644df5
Fix shared memory rendezvous
2016-08-13 08:14:50 -07:00
rhc54
ddde154d28
Merge pull request #1962 from rhc54/topic/notify
...
Ensure we properly convert pmix status to ORTE state before activatin…
2016-08-13 06:59:50 -07:00
Ralph Castain
48d35a9627
Ensure we properly convert pmix status to ORTE state before activating an error state upon notification. Cleanup some conversion issues on notification info. Add a new orte_notify.c test program
2016-08-12 21:14:29 -07:00
rhc54
9eed451916
Merge pull request #1960 from rhc54/topic/rsh
...
Restore the rsh template creation code
2016-08-12 13:38:43 -07:00
rhc54
1ef3c86d44
Merge pull request #1931 from hjelmn/ess_fix
...
ess/base: set up nidmap after pmix
2016-08-12 13:10:30 -07:00
Ralph Castain
5717b75b45
Restore the rsh template creation code
2016-08-12 12:43:40 -07:00
Ralph Castain
1c44543854
If the ssh agent hasn't been given, then check for qrsh and friends
2016-08-12 07:46:39 -07:00
Artem Polyakov
1351a7065c
ess/pmi: minor code readablility cleanup.
...
Split process name variable "name" to
- "wildcard_rank" for the cases where wildcard is used.
- "pname" for the case where reference to particular process is needed.
2016-08-06 15:45:19 +06:00
Nathan Hjelm
3c23502dfe
ess/base: set up nidmap after pmix
...
This fixes a SEGV when the nidmap code attempts to use
opal_pmix.store_local before pmix is set up.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-02 09:50:00 -06:00
Ralph Castain
71de03fc67
Cleanup the new naming requirements to ensure that info is correctly retrieved
...
Cleanup permissions
Restore singleton operations
2016-07-21 09:46:03 -07:00
Ralph Castain
01a653d50a
Remove a debug print in comm_cid.c. Update PMIx2 to include the revised PMIx_Get logic for higher performance by reducing the number of hash table lookups. Fix a bug where requests for data from a proc in another nspace could hang, or result in "not found".
...
Remove stale file reference
Restore autogen pass thru pmix
Remove generated file
2016-07-20 00:58:19 -07:00
rhc54
2414244171
Merge pull request #1872 from rhc54/topic/continuous
...
Add support for continuously operating applications
2016-07-13 15:29:31 -07:00
Ralph Castain
20a91c2baf
Add a new --continuous flag to mpirun that directs ORTE to let a job continue running as app procs terminate. Don't attempt to restart them. Add event notification of abnormally terminating procs, and demonstrate that in the mpi_spin test program.
...
Cleanup debug message
2016-07-13 15:28:33 -07:00
Ralph Castain
ddd0d05de3
Fix a bug in the handling of nper<foo> when -host or -hostfile was given. Correctly mark slots as "given" when we auto-assign them. Ensure we don't set the number of procs when using nper<foo> so the PPR mapper can correctly assing them.
2016-07-12 09:27:02 -07:00
Ralph Castain
ee56d9dc1a
Shorten the session directory name as some OS's are now providing unusually long temp directory names, causing us to overflow the sockaddr field
2016-07-05 14:59:50 -07:00
Ralph Castain
5d330d5220
Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler.
...
Add PMIx 2.0
Remove PMIx 1.1.4
Cleanup copying of component
Add missing file
Touchup a typo in the Makefile.am
Update the pmix ext114 component
Minor cleanups and resync to master
Update to latest PMIx 2.x
Update to the PMIx event notification branch latest changes
2016-06-14 13:08:41 -07:00
Ralph Castain
a6e6c37484
Remove stale map-reduce support
2016-06-12 07:41:57 -07:00
Ralph Castain
dd0f843843
Fix rare hangs observed on OS-X by properly thread-shifting upcalls from the PMIx server into ORTE
2016-06-05 21:39:44 -07:00
Ralph Castain
0ba9572f9f
Cleanup the forced termination a bit by restoring the delay before issuing the sigkill, and eliminating the large time loss spent checking if the proc died. The latter is responsible for a large number of test timeouts in MTT
...
Update alps component
2016-06-02 17:48:21 -07:00
Gilles Gouaillardet
5f565dfec3
configury: clean the flex generated .c files
2016-06-01 11:13:31 +09:00
Ralph Castain
3913595e10
Enable simulation of large-scale clusters by allowing multiple daemons/node. Specifying the ras_base_multiplier parameter to be greater than 1 will cause ORTE to replicate each allocated node by that factor. A daemon will be spawned for each replica, thus letting ORTE function as if it were on a much larger cluster.
...
Note that this cannot be used for MPI performance testing. It is really only useful for ORTE scaling tests. It also only works with the rsh/ssh launcher.
2016-05-29 18:56:18 -07:00
Ralph Castain
ebe159acef
Add a timeout cmd line option and an option to report state info upon timeout to assist with debugging Jenkins tests
...
If requested, obtain stacktraces for each application process and report it to stderr upon timeout
stack traces: minor improvements
- Also include the hostname and PID of the each process for which
we're sending the stack traces (vs. just including the ORTE process
name)
- Send a specific error message if we couldn't find "gstack" in the
$PATH (e.g., on OS X)
- Send a sepcific error message if gstack fails to run
- Print a message that obtaining the stack traces may take a few
seconds so that users don't wonder what's happening
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
help-orterun.txt: minor tweaks
Trivial update: show "--timeout" (instead of "-timeout") in the help
message, just to encourage the use of double-dash options.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
trivial: stacktrace -> stack trace
Trivial word smything.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-28 08:36:25 -07:00
Jeff Squyres
dd9a819a1c
odls_default: do not opal_output() while creating a process!
...
It is verbotten to use opal_output() after the fork() but before the
exec()! It results in all manner of undefined behavior. For example,
on some OS X systems, if you run a trivial "hello world" MPI program
with a high level of ODLS verbosity:
```sh
$ mpirun -np 3 --mca odls_base_verbose 100 ./hello_c
```
You will see a bunch of output from the mpirun ODLS base, but then it
*may* hang in odls_default_module.c:do_child() -- after the fork() but
before the exec() -- while trying to opal_output() some debugging
statements.
The solution is to remove these extraneous opal_output() statements.
Indeed, the ODLS base is already outputting the same information that
these opal_output() statements are trying to emit, anyway.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-05-24 21:28:57 -04:00
Ralph Castain
30aaf785a8
Fix the dist mapper option
2016-05-23 23:20:33 -07:00
George Bosilca
50b37758d4
Don't overwrite the function argument.
...
In a MPMD setup the app in the jdata can be NULL, so make sure we
don't leave the main argument to an inconsistent value.
2016-05-19 10:35:23 -04:00
Ralph Castain
7e5ef6a240
Fix the env_list support - the MCA param was being set way too early, so provide a "backdoor" way of providing the value
2016-05-06 15:38:39 -07:00
Ralph Castain
58dd41facf
Repair the processing of cmd line options that mapped to MCA params. This was responsible for breaking things like map-by <foo>.
...
Remove debug, let orterun send terminate cmd to DVM
Recover the DVM support
2016-05-06 13:14:03 -07:00
rhc54
ff8518853e
Merge pull request #1604 from rhc54/topic/psm2
...
Improve the transport key print statement to ensure that we don't get…
2016-05-03 13:43:10 -07:00
Jeff Squyres
265e5b9795
Merge pull request #1552 from kmroz/wip-hostname-len-cleanup-1
...
ompi/opal/orte/oshmem/test: max hostname length cleanup
2016-05-02 09:44:18 -04:00
rhc54
2fa8b6c6ac
Merge pull request #1525 from rhc54/topic/schizo
...
Extend the schizo framework
2016-05-01 15:09:08 -07:00
Ralph Castain
6ac7929bd0
Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need.
...
Cleanups per @jjhursey review
2016-05-01 11:30:25 -07:00
Ralph Castain
0f05893952
Ensure consistency between max_procs and univ_size values - since orte wants max_procs, have the proc get that value instead of univ_size
...
Make the singleton module consistent as well
2016-05-01 11:13:33 -07:00
Ralph Castain
29bc24bdd5
Improve the transport key print statement to ensure that we don't get zero fields as this can be a problem for PSM
2016-04-28 20:11:12 -07:00
Ralph Castain
e6ad1ad621
Up-port of change for 2.x: if user directs oversubscribe, then do not bind as we will otherwise overload resources
2016-04-28 13:21:10 -07:00
Ralph Castain
75dc4c305a
Correctly set the #procs in the job to "job_size", and the max_procs to "univ_size"
2016-04-27 12:00:19 -07:00
Gilles Gouaillardet
6bf57c799f
orte/rml: ORTE_RML_SEND_COMPLETE handles messages with both NULL iov and cbfunc.buffer
2016-04-26 09:19:31 +09:00
Karol Mroz
5c11bdb251
orte: fixup hostname max length usage
...
Also removes orte specific max hostname value.
Signed-off-by: Karol Mroz <mroz.karol@gmail.com>
2016-04-25 07:08:23 +02:00
Joshua Hursey
29b49351af
ras/lsf: Fix affinity for MPMD jobs running under LSF
2016-04-22 11:18:34 -05:00
Jeff Squyres
68c1a5eb6c
Merge pull request #1567 from jsquyres/pr/fix-ompi-to-opal-name-conversion
...
m4: rename OMPI_SUMMARY_* macros to OPAL_SUMMARY_*
2016-04-20 13:10:06 -04:00
Jeff Squyres
6800ef9ec0
m4: rename OMPI_SUMMARY_* macros to OPAL_SUMMARY_*
...
These macros should really be named OPAL_SUMMARY_*; they're used in
all projects, and therefore should be in the lowest later project (OPAL).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-04-20 08:40:00 -07:00
Ralph Castain
449ec41532
Roll to PMIx 1.1.4rc1 and remove the PMIx 1.2.0 directory as the community has decided to not do that release version. This incorporates a number of bug fixes that have been identified and repaired in the PMIx and OMPI code bases. Also includes several minor corrections to the PMIx code so it now supports run-thru without hanging on collectives involving a process that exits
2016-04-15 10:11:11 -07:00
Ralph Castain
1fa236b26c
Ensure that we exit with a non-zero status when oversubscribe fails
2016-04-14 05:51:10 -07:00
Ralph Castain
437f5b4289
Fix map-by node and do-not-launch
2016-04-13 09:21:19 -07:00