Ralph Castain
6e6bbfda91
Very minor typo
2016-03-23 08:31:47 -07:00
Ralph Castain
4a623778a9
Fix the debugger attach - previous commit had fixed one instance of a check prior to sending the release message, but there was a second code path that included a similar check that was missed. Thanks to John DelSignore for spotting it!
2016-03-23 08:25:25 -07:00
Howard Pritchard
69200e6229
plm/alps: fix usage of cray wlm_detect methods
...
Turns out there are some cases where the Cray
wlm_detect_get_active may return NULL, in which
case fallback to wlm_detect_get_default method
is suggested. Make use of the fallback to
avoid segfaults under some circumstances in the
ALPS plm selection method.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-03-22 11:40:56 -07:00
Ralph Castain
c146c4969b
Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation.
...
Fixes #1467
Restore debugger attach operations
Fixes #1225
2016-03-18 21:49:04 -07:00
Ralph Castain
8f410d7897
Revert one part of open-mpi/ompi@4d0cc27eb7
2016-03-18 07:23:30 -07:00
Ralph Castain
2970becd6b
Revert "Merge pull request #1451 from ggouaillardet/topic/orte_fork_wrapper_fullname"
...
This reverts commit efafd62d38bb12c161330d5a6e4f338e9b560a7e, reversing
changes made to a93b849f13b12a7b1c1cdde71a9e491ddc220e17.
2016-03-18 07:18:36 -07:00
Ralph Castain
a67ff065ae
Silence coverity warnings
2016-03-16 08:43:16 -07:00
Nysal Jan K.A
f6e932c864
Fix memory corruption in orte-ps
...
orte-ps ends up free'ing the same pointer multiple times
2016-03-15 16:03:31 +05:30
Ralph Castain
6d7ada9675
Silence Coverity warning
2016-03-14 09:42:43 -07:00
Gilles Gouaillardet
589924c4aa
odls/base: use the full app name when using an orte fork agent
2016-03-14 11:18:21 +09:00
Anandhi S Jayakumar
a31292abc7
fixes to ud for removing qos channel
2016-03-10 18:03:17 -08:00
Ralph Castain
a4c8e8c28a
Cleanup the proposed change:
...
* qos framework is moving to the scon layer and is no longer required in ORTE
* remove the rml/ftrm component as we now have multiple active components, and so the wrapper needs to be rethought
* no need for separating the "base" from "API" module definition. The two are identical
* move the "stub" functions into their own file for cleanliness
* general cleanup to meet coding standards
* cleanup some logic in the stubs
2016-03-10 13:14:17 -08:00
Jeff Squyres
48c650c47a
configury: minor updates to config summary output
2016-03-10 13:02:52 -08:00
Anandhi S Jayakumar
0188c3cf81
Adding commit for multiple plugin loading support in RML
2016-03-09 18:13:48 -08:00
Ralph Castain
f7257a8310
Modify singularity support per patch from Greg Kurtzer
2016-03-09 07:52:11 -08:00
Ralph Castain
f3ae30ff39
Fix singletons yet again...
2016-03-08 10:33:35 -08:00
Ralph Castain
d72c1c72ff
Do not push child processes into separate process groups so that any host RM can still "see" them, and ensure that any signal sent to the orted's themselves will be provided to all child processes. Forward all signals from mpirun to the child processes, removing the old MCA parameter required to turn that behavior "on".
2016-03-06 17:55:09 -08:00
Ralph Castain
4d0cc27eb7
Update the singularity support to match that of the latest singularity master. Remove the restriction on shared memory components by instructing singularity to not isolate the PID space. Add a new schizo API to allow setting up the original app_context. Ensure the container is installed prior to execution.
2016-03-05 21:47:42 -08:00
Ralph Castain
ce0a05d7d1
Minor cleanup - Singularity now has an internal check for installed, so we no longer need to do so.
2016-03-04 19:07:53 -08:00
Gilles Gouaillardet
80bdbfd9e7
add missing include file
2016-03-03 13:46:28 +09:00
Ralph Castain
4a55fba414
Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration.
2016-03-02 15:01:01 -08:00
Ralph Castain
f0680008d1
Add test file for singularity
2016-03-02 05:40:41 -08:00
Ralph Castain
06e811c5a6
Properly use the OPAL_MCA_PREFIX in orte_submit
2016-03-01 18:16:40 -08:00
Ralph Castain
1b81d90eaa
Minor cleanups required for orte-dvm operation
2016-03-01 18:12:53 -08:00
Ralph Castain
c9f7bb6751
Add the include file to all the schizo components
2016-03-01 13:18:23 -08:00
Ralph Castain
625083fe18
Add include file
2016-03-01 13:04:20 -08:00
Ralph Castain
011403c04a
Fix a number of issues, some of which have lingered for a long time:
...
* provide a more reliable way of determining that a process is a singleton by leveraging the schizo framework. Add new components for slurm, alps, and orte to detect when we are in a managed environment, and if we have been launched by mpirun or a native launcher. Set the correct envars to control ess and pmix selection in each case.
* change the relative priority of the pmix120 and pmix112 components to make pmix120 the default
* fix singleton comm-spawn by correctly setting the num_apps field of the orte_job_t created by the daemon - this fixes a segfault in register_nspace on newly created daemons
* ensure orterun doesn't propagate any ess or pmix directives in its environment
* Cleanup a few valgrind issues and memory leaks
* Fix a race condition that prevented the client from completing notification registrations (missing thread shift)
* Ensure the shizo/alps component detects launch by mpirun
2016-03-01 06:53:00 -08:00
Ralph Castain
263b0c95a8
Fix a segfault that can occur when very short-lived, non-ORTE procs are run
2016-02-28 12:30:20 -08:00
Ralph Castain
cdb494566d
Provide an option to allow isolated singletons
2016-02-25 11:33:26 -06:00
Ralph Castain
e8d347d7bd
Add missing includes
2016-02-24 08:56:02 -06:00
Ralph Castain
77f800b7e8
Tools don't create the orte_job_data table, so don't remove jobs from it
2016-02-21 16:29:00 -08:00
Ralph Castain
64b7728f33
Fix typo - do not look at daemon job when considering completion of launch
2016-02-21 14:44:51 -08:00
Ralph Castain
d653cf2847
Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM.
2016-02-21 11:55:49 -08:00
Ralph Castain
309e23ab3a
Fix minor typo
2016-02-20 01:33:10 -08:00
Ralph Castain
0c72ba89b9
Cleanup the output-filename options so they work as expected. Have the remote nodes output locally to the files instead of sending it all back to the HNP.
...
Fix Solaris issues by renaming struct field
2016-02-19 12:41:46 -08:00
rhc54
bfd4254a7b
Merge pull request #1382 from rhc54/topic/cleanup
...
Cleanup some valgrind complaints about jumps with uninitialized values.
2016-02-18 17:29:37 -08:00
Nathan Hjelm
27e7b6e466
Merge pull request #1381 from hjelmn/ddt_colon_fix
...
orterun: allow DDT if options contain :'s
2016-02-18 17:48:21 -07:00
Ralph Castain
6e68d758b9
Cleanup some valgrind complaints about jumps with uninitialized values. Fix a few IOF issues reported by Mark Santcroos when submitting jobs from tools. Add the ability to pass directives to the --output-filename option that tell ORTE to (a) not include the jobid in the path to the output files, and (b) not to copy the output to the tool (i.e., just store it in the files).
...
ck
Remove stale debug
Fix a segfault if no subscribers are present
2016-02-18 16:30:37 -08:00
Nathan Hjelm
69de442136
orterun: allow DDT if options contain :'s
...
There is a bug in MPMD detection that disables totalview if a : is
found anywhere on the command line. This includes inside an argument
option or MCA variable value. This commit changes the check to look
for the string " : " instead of the character : which should eliminate
the issue in most cases.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-18 16:56:08 -07:00
Ralph Castain
1748f44147
Stop a segfault that results in zombied processes by checking for NULL prior to object release
2016-02-18 13:48:41 -08:00
Ralph Castain
60a7bc2e50
Enable the PMIx notification callback system. This currently is only supported by the pmix120 component, which is not selected by default. All other components will ignore error registration requests, and thus do not support debugger attach when launched via mpirun. Note that direct launched applications will support such attachment, but may not do so in a scalable fashion.
...
Fixes ##1225
2016-02-18 09:29:12 -08:00
Nysal Jan K.A
cc9b1316a4
Make UD OOB memory registrations a multiple of page size
...
If ibv_fork_init() has been invoked the pages are marked MADV_DONTFORK.
If we only partially use a page, any data allocated on the remainder of
the page will be inaccessible to the child process.
Fixes open-mpi/ompi#1363
2016-02-17 22:19:49 -05:00
rhc54
dc4d3edc06
Merge pull request #1372 from rhc54/topic/sing
...
Further enhance the support for Singularity containers.
2016-02-17 16:39:23 -08:00
Ralph Castain
8f9508cace
Further enhance the support for Singularity containers. Extend the "personality" command-line option to allow specifying both model (e.g., "ompi") and container (e.g., "singularity"), and add the necessary logic to support multiple options. Add a new pmix "isolated" component to handle singletons where no HNP is available since containers cannot launch the HNP.
2016-02-17 13:33:06 -08:00
Howard Pritchard
31841b4367
ras/alps: squelch common symbol warnings
...
squelch a couple of warnings from the common symbols
script.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-02-17 13:27:29 -06:00
Ralph Castain
e0de4423ba
Remove debug
2016-02-16 20:58:53 -08:00
Ralph Castain
50431001a3
Modify the IOF subsystem to handle per-job directives for redirecting IO to files, tagging IO, and timestamping IO.
...
Fix stdin reader
2016-02-16 18:54:38 -08:00
Mark Santcroos
14f0390b7d
Release child object when we are recording someone's relatives.
...
(Thanks to Mark Santcroos!)
Release routing list entries.
(Thanks to Mark Santcroos!)
Address some Coverity concerns
2016-02-15 20:50:42 -08:00
Ralph Castain
351070659e
Correct ordering when checking for privileged ports
2016-02-14 09:43:01 -08:00
rhc54
59cc1f0a96
Merge pull request #1357 from rhc54/topic/oob
...
Protect against a non-privileged port connecting to us when we are running as root
2016-02-13 08:12:29 -08:00