Nathan Hjelm
3c23502dfe
ess/base: set up nidmap after pmix
...
This fixes a SEGV when the nidmap code attempts to use
opal_pmix.store_local before pmix is set up.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-08-02 09:50:00 -06:00
Ralph Castain
71de03fc67
Cleanup the new naming requirements to ensure that info is correctly retrieved
...
Cleanup permissions
Restore singleton operations
2016-07-21 09:46:03 -07:00
Ralph Castain
01a653d50a
Remove a debug print in comm_cid.c. Update PMIx2 to include the revised PMIx_Get logic for higher performance by reducing the number of hash table lookups. Fix a bug where requests for data from a proc in another nspace could hang, or result in "not found".
...
Remove stale file reference
Restore autogen pass thru pmix
Remove generated file
2016-07-20 00:58:19 -07:00
Ralph Castain
ee56d9dc1a
Shorten the session directory name as some OS's are now providing unusually long temp directory names, causing us to overflow the sockaddr field
2016-07-05 14:59:50 -07:00
Ralph Castain
3913595e10
Enable simulation of large-scale clusters by allowing multiple daemons/node. Specifying the ras_base_multiplier parameter to be greater than 1 will cause ORTE to replicate each allocated node by that factor. A daemon will be spawned for each replica, thus letting ORTE function as if it were on a much larger cluster.
...
Note that this cannot be used for MPI performance testing. It is really only useful for ORTE scaling tests. It also only works with the rsh/ssh launcher.
2016-05-29 18:56:18 -07:00
rhc54
ff8518853e
Merge pull request #1604 from rhc54/topic/psm2
...
Improve the transport key print statement to ensure that we don't get…
2016-05-03 13:43:10 -07:00
rhc54
2fa8b6c6ac
Merge pull request #1525 from rhc54/topic/schizo
...
Extend the schizo framework
2016-05-01 15:09:08 -07:00
Ralph Castain
6ac7929bd0
Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need.
...
Cleanups per @jjhursey review
2016-05-01 11:30:25 -07:00
Ralph Castain
0f05893952
Ensure consistency between max_procs and univ_size values - since orte wants max_procs, have the proc get that value instead of univ_size
...
Make the singleton module consistent as well
2016-05-01 11:13:33 -07:00
Ralph Castain
29bc24bdd5
Improve the transport key print statement to ensure that we don't get zero fields as this can be a problem for PSM
2016-04-28 20:11:12 -07:00
Ralph Castain
75dc4c305a
Correctly set the #procs in the job to "job_size", and the max_procs to "univ_size"
2016-04-27 12:00:19 -07:00
Ralph Castain
2432daf065
Some minor cleanups of a memory leak and error output
2016-04-08 07:46:18 -07:00
Rainer Keller
52080a5736
As per the pull request to pmix/master:
...
https://github.com/pmix/master/pull/71
Have OMPI's current version of pmix120 nicely fail in case of
too long sun_path (longer than 108 or in case of OSX 103 chars).
And have OMPI return proper error messages with hints how to
amend.
2016-04-07 22:12:53 +02:00
Ralph Castain
a3fea58d1c
Minor cleanups to prior PR commit
2016-03-24 15:55:14 -07:00
rhc54
6756e19aa2
Merge pull request #1457 from anandhis/master
...
rml changes
2016-03-24 15:17:29 -07:00
Ralph Castain
8c14df2328
Revert "Modify singularity support per patch from Greg Kurtzer"
...
This reverts commit open-mpi/ompi@f7257a8310 .
Ensure that we properly cleanup the session directory tree. Prior code had issues with symlinks, especially if the file that the link points to was already removed as we traverse the tree. Also found that the dirent checks for directory type weren't fully portable, and so fall back to the stat-based approach which is known to be portable.
Fix singularity singletons by detecting we are in a container and properly setting the pmix selection to pick the isolated component. Remove a stale restriction blocking use of the sm btl
2016-03-24 11:27:18 -07:00
Ralph Castain
c146c4969b
Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation.
...
Fixes #1467
Restore debugger attach operations
Fixes #1225
2016-03-18 21:49:04 -07:00
Ralph Castain
a4c8e8c28a
Cleanup the proposed change:
...
* qos framework is moving to the scon layer and is no longer required in ORTE
* remove the rml/ftrm component as we now have multiple active components, and so the wrapper needs to be rethought
* no need for separating the "base" from "API" module definition. The two are identical
* move the "stub" functions into their own file for cleanliness
* general cleanup to meet coding standards
* cleanup some logic in the stubs
2016-03-10 13:14:17 -08:00
Ralph Castain
f3ae30ff39
Fix singletons yet again...
2016-03-08 10:33:35 -08:00
Ralph Castain
d72c1c72ff
Do not push child processes into separate process groups so that any host RM can still "see" them, and ensure that any signal sent to the orted's themselves will be provided to all child processes. Forward all signals from mpirun to the child processes, removing the old MCA parameter required to turn that behavior "on".
2016-03-06 17:55:09 -08:00
Gilles Gouaillardet
80bdbfd9e7
add missing include file
2016-03-03 13:46:28 +09:00
Ralph Castain
4a55fba414
Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration.
2016-03-02 15:01:01 -08:00
Ralph Castain
011403c04a
Fix a number of issues, some of which have lingered for a long time:
...
* provide a more reliable way of determining that a process is a singleton by leveraging the schizo framework. Add new components for slurm, alps, and orte to detect when we are in a managed environment, and if we have been launched by mpirun or a native launcher. Set the correct envars to control ess and pmix selection in each case.
* change the relative priority of the pmix120 and pmix112 components to make pmix120 the default
* fix singleton comm-spawn by correctly setting the num_apps field of the orte_job_t created by the daemon - this fixes a segfault in register_nspace on newly created daemons
* ensure orterun doesn't propagate any ess or pmix directives in its environment
* Cleanup a few valgrind issues and memory leaks
* Fix a race condition that prevented the client from completing notification registrations (missing thread shift)
* Ensure the shizo/alps component detects launch by mpirun
2016-03-01 06:53:00 -08:00
Ralph Castain
cdb494566d
Provide an option to allow isolated singletons
2016-02-25 11:33:26 -06:00
Ralph Castain
d653cf2847
Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM.
2016-02-21 11:55:49 -08:00
Ralph Castain
8f9508cace
Further enhance the support for Singularity containers. Extend the "personality" command-line option to allow specifying both model (e.g., "ompi") and container (e.g., "singularity"), and add the necessary logic to support multiple options. Add a new pmix "isolated" component to handle singletons where no HNP is available since containers cannot launch the HNP.
2016-02-17 13:33:06 -08:00
Ralph Castain
8ab28cdc82
Fix a typo that causes segfaults on multi-node executions
2015-12-24 08:43:47 -08:00
annu13
43f44f31c1
moved code to job setup first before enabling comm
2015-12-22 14:30:59 -08:00
Ralph Castain
94ffe10808
Do not override any external settings for PMIx component selection
2015-12-21 08:36:12 -08:00
Ralph Castain
1db3db022a
Don't be so prescriptive about the ess component to be used - we just need to protect against the proc incorrectly taking the singleton component, so rule that one out. Ensure that the other components understand that they are only for use by daemons.
2015-12-09 19:54:44 -08:00
Ralph Castain
8823069fe9
Provide a mechanism by which a tool can request async progress thread support for ORTE
2015-12-04 08:26:57 -08:00
Nathan Hjelm
9602484568
Merge pull request #1040 from hjelmn/mtl_priority
...
Change how cm's priority is calculated
2015-10-19 14:18:36 -06:00
Nathan Hjelm
8b5810f7f7
mca/base: add priority output to mca_base_select
...
The mca_base_select function uses returned priorities to select the
best component/module. This priority may be of use to the caller so
pass that information back in an optional argument. If the priority is
not needed pass NULL.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-19 12:32:41 -06:00
Ralph Castain
363f62a506
Fix singleton operations when running under a SLURM allocation. Sadly, SLURM's PMI will return success even if the PMI server isn't actually available. This leads to erroneous selection of pmix and ess components. So add a further requirement (namely, that we see a job_step envar) to the SLURM pmix components along with some modification of ess selection code to avoid the problem
2015-10-17 20:24:03 -07:00
Ralph Castain
0140ff048d
Now that we have an "isolated" PLM component, we cannot just let rsh silently decline to run when it cannot find a launch agent - if we do, then we will -always- run on the local node. So if the user specifies a launch agent and we can't find it, then generate a pretty error message, report a fatal error back to the component select, and exit out.
...
This required modifying the mca_component_select function to actually check the return code on a component query - it was blissfully ignoring it.
Also do a little cleanup to avoid bombarding the user with multiple error messages.
Thanks to Patrick Begou for reporting the problem
2015-09-24 07:16:48 -07:00
Ralph Castain
749bd4e6fe
Plug a few memory leaks identified by valgrind
2015-09-23 15:21:04 -07:00
Ralph Castain
f872e99315
Fix orte-submit so it allows application procs to select the correct ess component. Protect orte_data_server from multiple calls to finalize.
2015-09-21 20:31:57 -07:00
Howard Pritchard
8d7e759b85
oob/alps: swat compiler warning
...
swat some alps related compiler warnings when using --enable-picky
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-09-21 14:24:26 -07:00
Ralph Castain
8b88ea9b13
Fix singletons by removing stale code
2015-09-16 00:58:05 -07:00
Ralph Castain
c1bbbb5e2f
Remove the last involvement of the OOB system from the MPI layer, remove the no-longer-needed usock/oob component, and have procs no longer open the RML, OOB, ROUTED, and GRPCOMM frameworks as PMIx now provides all required app-mpirun cmds
2015-09-15 13:08:35 -07:00
Ralph Castain
dc5796b8a1
Revert "Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local""
...
Fix the locality computation by correctly computing the vpid of the local peer
This reverts commit open-mpi/ompi@6a8fad49e5 .
2015-09-11 08:29:51 -07:00
Ralph Castain
6a8fad49e5
Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local"
...
This reverts commit f94f3cda214ab937c46802896fb53b84bec6cc3a.
2015-09-11 02:01:25 -07:00
Ralph Castain
f94f3cda21
Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local
2015-09-10 10:25:30 -07:00
Ralph Castain
1cdb86b8c7
Cleanup s1 and s2 components, and ensure that mpirun and orteds only use non-direct-launch pmix components.
2015-09-08 18:37:09 -07:00
Ralph Castain
37c3ed68e7
Cleanup connect/disconnect and bring comm_spawn back online!
2015-09-06 10:27:39 -07:00
rhc54
665b30376a
Merge pull request #868 from rhc54/topic/hwloc
...
Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given
2015-09-04 17:58:07 -07:00
Ralph Castain
d97bc29102
Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given
2015-09-04 16:54:40 -07:00
Ralph Castain
f6948c2bb4
Sync with PMIx master 43e45c3. Get multi-node publish/lookup/unpublish working
2015-09-04 10:07:17 -07:00
Ralph Castain
38ba54366c
Fix shared memory operations by resolving local peers
2015-08-30 12:07:14 -07:00
Ralph Castain
0d5814b5ca
Cleanup Coverity issues
2015-08-29 21:19:27 -07:00