Gilles Gouaillardet
0931d09afa
ess/singleton: silence a valgrind warning
...
initialize a pointer and keep valgrind happy about it
2016-09-27 15:22:39 +09:00
Gilles Gouaillardet
f9ebba4668
ess/singleton: only realloc() when required in fork_hnp()
2016-09-23 16:35:59 +09:00
Gilles Gouaillardet
c7bf9a0ec9
ess/singleton: fix read on the pipe to spawn'ed orted
...
and close the pipe on both ends when it is no more needed
2016-09-22 14:21:52 +09:00
Gilles Gouaillardet
83399adb3f
singleton: "safe" read/write to the pipe between (spawn'ed) orted and singleton
2016-09-20 14:56:58 +09:00
Gilles Gouaillardet
e7ae6975d0
orted: fix spawn in singleton mode
...
in singleton mode, have the spawn'ed orted invoke orte_pre_condition_transports()
and send the transport key back to the singleton
2016-09-20 14:39:22 +09:00
Ralph Castain
a16b3cc33d
Fix some minor complaints - missing "void" in function parameters
2016-09-15 15:18:42 -07:00
Ralph Castain
6f086189e6
Fix trivial typo
2016-09-15 13:10:55 -07:00
Gilles Gouaillardet
11ebf3ab23
ess/singleton: when forking hnp, use the PMIX_NAMESPACE sent by the hnp
...
as the jobid
2016-09-15 13:57:23 +09:00
Artem Polyakov
a9a7f39773
ess/pmi: fix the comments about MCA/PMIx setting conflict resolution.
2016-09-07 07:47:35 +03:00
Artem Polyakov
74a11d7832
Fix session dir cleanup code.
2016-09-05 07:53:55 +03:00
Artem Polyakov
dc0ab674de
Add PMIx key to provide RM with ability to indicate that it will cleanup
...
session directories provided at through OPAL_PMIX_TMPDIR,
OPAL_PMIX_NSDIR, OPAL_PMIX_PROCDIR
2016-09-05 07:48:44 +03:00
Artem Polyakov
81195ab724
Several fixes related to session directories:
...
* enable OMPI to retrieve paths from RM through PMIx
* cleanups related to tempdirs.
2016-09-05 07:48:44 +03:00
Ralph Castain
2f6e0fec90
Provide the number of nodes in the job
2016-08-26 14:50:41 -07:00
Gilles Gouaillardet
93e73841f9
ess/singleton: push all PMIX_* environment variables, regardless how many there are
2016-08-23 09:46:55 +09:00
Gilles Gouaillardet
a1e8e58a8a
ess/singleton: expects 4 PMIX_* environment variables or more
2016-08-23 09:34:03 +09:00
Ralph Castain
be8424b691
Provide backward compatible keys so that the non-PMIx components in the opal/pmix framework don't have to adjust as we continue to work on finalizing the PMIx reference scheme. Activate and utilize the new PMIx show_help capability to provide more meaningful error output when the server cannot start.
...
Add a contrib script to cleanup permissions incorrectly modified due to things like smb mounts
dd
2016-08-13 12:13:04 -07:00
Ralph Castain
08a0644df5
Fix shared memory rendezvous
2016-08-13 08:14:50 -07:00
rhc54
1ef3c86d44
Merge pull request #1931 from hjelmn/ess_fix
...
ess/base: set up nidmap after pmix
2016-08-12 13:10:30 -07:00
Artem Polyakov
1351a7065c
ess/pmi: minor code readablility cleanup.
...
Split process name variable "name" to
- "wildcard_rank" for the cases where wildcard is used.
- "pname" for the case where reference to particular process is needed.
2016-08-06 15:45:19 +06:00
Nathan Hjelm
3c23502dfe
ess/base: set up nidmap after pmix
...
This fixes a SEGV when the nidmap code attempts to use
opal_pmix.store_local before pmix is set up.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-02 09:50:00 -06:00
Ralph Castain
71de03fc67
Cleanup the new naming requirements to ensure that info is correctly retrieved
...
Cleanup permissions
Restore singleton operations
2016-07-21 09:46:03 -07:00
Ralph Castain
01a653d50a
Remove a debug print in comm_cid.c. Update PMIx2 to include the revised PMIx_Get logic for higher performance by reducing the number of hash table lookups. Fix a bug where requests for data from a proc in another nspace could hang, or result in "not found".
...
Remove stale file reference
Restore autogen pass thru pmix
Remove generated file
2016-07-20 00:58:19 -07:00
Ralph Castain
ee56d9dc1a
Shorten the session directory name as some OS's are now providing unusually long temp directory names, causing us to overflow the sockaddr field
2016-07-05 14:59:50 -07:00
Ralph Castain
3913595e10
Enable simulation of large-scale clusters by allowing multiple daemons/node. Specifying the ras_base_multiplier parameter to be greater than 1 will cause ORTE to replicate each allocated node by that factor. A daemon will be spawned for each replica, thus letting ORTE function as if it were on a much larger cluster.
...
Note that this cannot be used for MPI performance testing. It is really only useful for ORTE scaling tests. It also only works with the rsh/ssh launcher.
2016-05-29 18:56:18 -07:00
rhc54
ff8518853e
Merge pull request #1604 from rhc54/topic/psm2
...
Improve the transport key print statement to ensure that we don't get…
2016-05-03 13:43:10 -07:00
rhc54
2fa8b6c6ac
Merge pull request #1525 from rhc54/topic/schizo
...
Extend the schizo framework
2016-05-01 15:09:08 -07:00
Ralph Castain
6ac7929bd0
Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need.
...
Cleanups per @jjhursey review
2016-05-01 11:30:25 -07:00
Ralph Castain
0f05893952
Ensure consistency between max_procs and univ_size values - since orte wants max_procs, have the proc get that value instead of univ_size
...
Make the singleton module consistent as well
2016-05-01 11:13:33 -07:00
Ralph Castain
29bc24bdd5
Improve the transport key print statement to ensure that we don't get zero fields as this can be a problem for PSM
2016-04-28 20:11:12 -07:00
Ralph Castain
75dc4c305a
Correctly set the #procs in the job to "job_size", and the max_procs to "univ_size"
2016-04-27 12:00:19 -07:00
Ralph Castain
2432daf065
Some minor cleanups of a memory leak and error output
2016-04-08 07:46:18 -07:00
Rainer Keller
52080a5736
As per the pull request to pmix/master:
...
https://github.com/pmix/master/pull/71
Have OMPI's current version of pmix120 nicely fail in case of
too long sun_path (longer than 108 or in case of OSX 103 chars).
And have OMPI return proper error messages with hints how to
amend.
2016-04-07 22:12:53 +02:00
Ralph Castain
a3fea58d1c
Minor cleanups to prior PR commit
2016-03-24 15:55:14 -07:00
rhc54
6756e19aa2
Merge pull request #1457 from anandhis/master
...
rml changes
2016-03-24 15:17:29 -07:00
Ralph Castain
8c14df2328
Revert "Modify singularity support per patch from Greg Kurtzer"
...
This reverts commit open-mpi/ompi@f7257a8310 .
Ensure that we properly cleanup the session directory tree. Prior code had issues with symlinks, especially if the file that the link points to was already removed as we traverse the tree. Also found that the dirent checks for directory type weren't fully portable, and so fall back to the stat-based approach which is known to be portable.
Fix singularity singletons by detecting we are in a container and properly setting the pmix selection to pick the isolated component. Remove a stale restriction blocking use of the sm btl
2016-03-24 11:27:18 -07:00
Ralph Castain
c146c4969b
Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation.
...
Fixes #1467
Restore debugger attach operations
Fixes #1225
2016-03-18 21:49:04 -07:00
Ralph Castain
a4c8e8c28a
Cleanup the proposed change:
...
* qos framework is moving to the scon layer and is no longer required in ORTE
* remove the rml/ftrm component as we now have multiple active components, and so the wrapper needs to be rethought
* no need for separating the "base" from "API" module definition. The two are identical
* move the "stub" functions into their own file for cleanliness
* general cleanup to meet coding standards
* cleanup some logic in the stubs
2016-03-10 13:14:17 -08:00
Ralph Castain
f3ae30ff39
Fix singletons yet again...
2016-03-08 10:33:35 -08:00
Ralph Castain
d72c1c72ff
Do not push child processes into separate process groups so that any host RM can still "see" them, and ensure that any signal sent to the orted's themselves will be provided to all child processes. Forward all signals from mpirun to the child processes, removing the old MCA parameter required to turn that behavior "on".
2016-03-06 17:55:09 -08:00
Gilles Gouaillardet
80bdbfd9e7
add missing include file
2016-03-03 13:46:28 +09:00
Ralph Castain
4a55fba414
Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration.
2016-03-02 15:01:01 -08:00
Ralph Castain
011403c04a
Fix a number of issues, some of which have lingered for a long time:
...
* provide a more reliable way of determining that a process is a singleton by leveraging the schizo framework. Add new components for slurm, alps, and orte to detect when we are in a managed environment, and if we have been launched by mpirun or a native launcher. Set the correct envars to control ess and pmix selection in each case.
* change the relative priority of the pmix120 and pmix112 components to make pmix120 the default
* fix singleton comm-spawn by correctly setting the num_apps field of the orte_job_t created by the daemon - this fixes a segfault in register_nspace on newly created daemons
* ensure orterun doesn't propagate any ess or pmix directives in its environment
* Cleanup a few valgrind issues and memory leaks
* Fix a race condition that prevented the client from completing notification registrations (missing thread shift)
* Ensure the shizo/alps component detects launch by mpirun
2016-03-01 06:53:00 -08:00
Ralph Castain
cdb494566d
Provide an option to allow isolated singletons
2016-02-25 11:33:26 -06:00
Ralph Castain
d653cf2847
Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM.
2016-02-21 11:55:49 -08:00
Ralph Castain
8f9508cace
Further enhance the support for Singularity containers. Extend the "personality" command-line option to allow specifying both model (e.g., "ompi") and container (e.g., "singularity"), and add the necessary logic to support multiple options. Add a new pmix "isolated" component to handle singletons where no HNP is available since containers cannot launch the HNP.
2016-02-17 13:33:06 -08:00
Ralph Castain
8ab28cdc82
Fix a typo that causes segfaults on multi-node executions
2015-12-24 08:43:47 -08:00
annu13
43f44f31c1
moved code to job setup first before enabling comm
2015-12-22 14:30:59 -08:00
Ralph Castain
94ffe10808
Do not override any external settings for PMIx component selection
2015-12-21 08:36:12 -08:00
Ralph Castain
1db3db022a
Don't be so prescriptive about the ess component to be used - we just need to protect against the proc incorrectly taking the singleton component, so rule that one out. Ensure that the other components understand that they are only for use by daemons.
2015-12-09 19:54:44 -08:00
Ralph Castain
8823069fe9
Provide a mechanism by which a tool can request async progress thread support for ORTE
2015-12-04 08:26:57 -08:00