1
1
Граф коммитов

5273 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
bd18d9c9d5 Ensure the compiler knows that a critical variable is volatile 2016-03-29 09:18:25 -07:00
Howard Pritchard
e7433fcb44 Merge pull request #1486 from hppritcha/topic/fix_wlm_detect_code
plm/alps: fix usage of cray wlm_detect methods
2016-03-26 13:22:50 -06:00
Ralph Castain
0e1350f5b7 Add missing header files 2016-03-25 09:06:51 -07:00
Ralph Castain
a3fea58d1c Minor cleanups to prior PR commit 2016-03-24 15:55:14 -07:00
rhc54
6756e19aa2 Merge pull request #1457 from anandhis/master
rml changes
2016-03-24 15:17:29 -07:00
rhc54
ba8c8700aa Merge pull request #1493 from rhc54/topic/sing
Update singularity support to track changes in upstream Singularity code
2016-03-24 15:16:38 -07:00
Ralph Castain
8c14df2328 Revert "Modify singularity support per patch from Greg Kurtzer"
This reverts commit open-mpi/ompi@f7257a8310.

Ensure that we properly cleanup the session directory tree. Prior code had issues with symlinks, especially if the file that the link points to was already removed as we traverse the tree. Also found that the dirent checks for directory type weren't fully portable, and so fall back to the stat-based approach which is known to be portable.

Fix singularity singletons by detecting we are in a container and properly setting the pmix selection to pick the isolated component. Remove a stale restriction blocking use of the sm btl
2016-03-24 11:27:18 -07:00
Ralph Castain
378d9cbb5e Extend the abort on non zero status flag to apply to processes which die as the result of signals. 2016-03-24 08:33:55 -07:00
Ralph Castain
cdd3dc99ca Correct the binding for the --map-by node case - we should still use our default binding algorithms 2016-03-23 09:55:24 -07:00
Ralph Castain
6e6bbfda91 Very minor typo 2016-03-23 08:31:47 -07:00
Ralph Castain
4a623778a9 Fix the debugger attach - previous commit had fixed one instance of a check prior to sending the release message, but there was a second code path that included a similar check that was missed. Thanks to John DelSignore for spotting it! 2016-03-23 08:25:25 -07:00
Howard Pritchard
69200e6229 plm/alps: fix usage of cray wlm_detect methods
Turns out there are some cases where the Cray
wlm_detect_get_active may return NULL, in which
case fallback to wlm_detect_get_default method
is suggested.  Make use of the fallback to
avoid segfaults under some circumstances in the
ALPS plm selection method.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-03-22 11:40:56 -07:00
Ralph Castain
c146c4969b Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation.
Fixes #1467

Restore debugger attach operations

Fixes #1225
2016-03-18 21:49:04 -07:00
Ralph Castain
8f410d7897 Revert one part of open-mpi/ompi@4d0cc27eb7 2016-03-18 07:23:30 -07:00
Ralph Castain
2970becd6b Revert "Merge pull request #1451 from ggouaillardet/topic/orte_fork_wrapper_fullname"
This reverts commit efafd62d38, reversing
changes made to a93b849f13.
2016-03-18 07:18:36 -07:00
Ralph Castain
a67ff065ae Silence coverity warnings 2016-03-16 08:43:16 -07:00
Nysal Jan K.A
f6e932c864 Fix memory corruption in orte-ps
orte-ps ends up free'ing the same pointer multiple times
2016-03-15 16:03:31 +05:30
Ralph Castain
6d7ada9675 Silence Coverity warning 2016-03-14 09:42:43 -07:00
Gilles Gouaillardet
589924c4aa odls/base: use the full app name when using an orte fork agent 2016-03-14 11:18:21 +09:00
Anandhi S Jayakumar
a31292abc7 fixes to ud for removing qos channel 2016-03-10 18:03:17 -08:00
Ralph Castain
a4c8e8c28a Cleanup the proposed change:
* qos framework is moving to the scon layer and is no longer required in ORTE

* remove the rml/ftrm component as we now have multiple active components, and so the wrapper needs to be rethought

* no need for separating the "base" from "API" module definition. The two are identical

* move the "stub" functions into their own file for cleanliness

* general cleanup to meet coding standards

* cleanup some logic in the stubs
2016-03-10 13:14:17 -08:00
Jeff Squyres
48c650c47a configury: minor updates to config summary output 2016-03-10 13:02:52 -08:00
Anandhi S Jayakumar
0188c3cf81 Adding commit for multiple plugin loading support in RML 2016-03-09 18:13:48 -08:00
Ralph Castain
f7257a8310 Modify singularity support per patch from Greg Kurtzer 2016-03-09 07:52:11 -08:00
Ralph Castain
f3ae30ff39 Fix singletons yet again... 2016-03-08 10:33:35 -08:00
Ralph Castain
d72c1c72ff Do not push child processes into separate process groups so that any host RM can still "see" them, and ensure that any signal sent to the orted's themselves will be provided to all child processes. Forward all signals from mpirun to the child processes, removing the old MCA parameter required to turn that behavior "on". 2016-03-06 17:55:09 -08:00
Ralph Castain
4d0cc27eb7 Update the singularity support to match that of the latest singularity master. Remove the restriction on shared memory components by instructing singularity to not isolate the PID space. Add a new schizo API to allow setting up the original app_context. Ensure the container is installed prior to execution. 2016-03-05 21:47:42 -08:00
Ralph Castain
ce0a05d7d1 Minor cleanup - Singularity now has an internal check for installed, so we no longer need to do so. 2016-03-04 19:07:53 -08:00
Gilles Gouaillardet
80bdbfd9e7 add missing include file 2016-03-03 13:46:28 +09:00
Ralph Castain
4a55fba414 Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration. 2016-03-02 15:01:01 -08:00
Ralph Castain
f0680008d1 Add test file for singularity 2016-03-02 05:40:41 -08:00
Ralph Castain
06e811c5a6 Properly use the OPAL_MCA_PREFIX in orte_submit 2016-03-01 18:16:40 -08:00
Ralph Castain
1b81d90eaa Minor cleanups required for orte-dvm operation 2016-03-01 18:12:53 -08:00
Ralph Castain
c9f7bb6751 Add the include file to all the schizo components 2016-03-01 13:18:23 -08:00
Ralph Castain
625083fe18 Add include file 2016-03-01 13:04:20 -08:00
Ralph Castain
011403c04a Fix a number of issues, some of which have lingered for a long time:
* provide a more reliable way of determining that a process is a singleton by leveraging the schizo framework. Add new components for slurm, alps, and orte to detect when we are in a managed environment, and if we have been launched by mpirun or a native launcher. Set the correct envars to control ess and pmix selection in each case.

* change the relative priority of the pmix120 and pmix112 components to make pmix120 the default

* fix singleton comm-spawn by correctly setting the num_apps field of the orte_job_t created by the daemon - this fixes a segfault in register_nspace on newly created daemons

* ensure orterun doesn't propagate any ess or pmix directives in its environment

* Cleanup a few valgrind issues and memory leaks

* Fix a race condition that prevented the client from completing notification registrations (missing thread shift)

* Ensure the shizo/alps component detects launch by mpirun
2016-03-01 06:53:00 -08:00
Ralph Castain
263b0c95a8 Fix a segfault that can occur when very short-lived, non-ORTE procs are run 2016-02-28 12:30:20 -08:00
Ralph Castain
cdb494566d Provide an option to allow isolated singletons 2016-02-25 11:33:26 -06:00
Ralph Castain
e8d347d7bd Add missing includes 2016-02-24 08:56:02 -06:00
Ralph Castain
77f800b7e8 Tools don't create the orte_job_data table, so don't remove jobs from it 2016-02-21 16:29:00 -08:00
Ralph Castain
64b7728f33 Fix typo - do not look at daemon job when considering completion of launch 2016-02-21 14:44:51 -08:00
Ralph Castain
d653cf2847 Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM. 2016-02-21 11:55:49 -08:00
Ralph Castain
309e23ab3a Fix minor typo 2016-02-20 01:33:10 -08:00
Ralph Castain
0c72ba89b9 Cleanup the output-filename options so they work as expected. Have the remote nodes output locally to the files instead of sending it all back to the HNP.
Fix Solaris issues by renaming struct field
2016-02-19 12:41:46 -08:00
rhc54
bfd4254a7b Merge pull request #1382 from rhc54/topic/cleanup
Cleanup some valgrind complaints about jumps with uninitialized values.
2016-02-18 17:29:37 -08:00
Nathan Hjelm
27e7b6e466 Merge pull request #1381 from hjelmn/ddt_colon_fix
orterun: allow DDT if options contain :'s
2016-02-18 17:48:21 -07:00
Ralph Castain
6e68d758b9 Cleanup some valgrind complaints about jumps with uninitialized values. Fix a few IOF issues reported by Mark Santcroos when submitting jobs from tools. Add the ability to pass directives to the --output-filename option that tell ORTE to (a) not include the jobid in the path to the output files, and (b) not to copy the output to the tool (i.e., just store it in the files).
ck

Remove stale debug

Fix a segfault if no subscribers are present
2016-02-18 16:30:37 -08:00
Nathan Hjelm
69de442136 orterun: allow DDT if options contain :'s
There is a bug in MPMD detection that disables totalview if a : is
found anywhere on the command line. This includes inside an argument
option or MCA variable value. This commit changes the check to look
for the string " : " instead of the character : which should eliminate
the issue in most cases.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-18 16:56:08 -07:00
Ralph Castain
1748f44147 Stop a segfault that results in zombied processes by checking for NULL prior to object release 2016-02-18 13:48:41 -08:00
Ralph Castain
60a7bc2e50 Enable the PMIx notification callback system. This currently is only supported by the pmix120 component, which is not selected by default. All other components will ignore error registration requests, and thus do not support debugger attach when launched via mpirun. Note that direct launched applications will support such attachment, but may not do so in a scalable fashion.
Fixes ##1225
2016-02-18 09:29:12 -08:00
Nysal Jan K.A
cc9b1316a4 Make UD OOB memory registrations a multiple of page size
If ibv_fork_init() has been invoked the pages are marked MADV_DONTFORK.
If we only partially use a page, any data allocated on the remainder of
the page will be inaccessible to the child process.

Fixes open-mpi/ompi#1363
2016-02-17 22:19:49 -05:00
rhc54
dc4d3edc06 Merge pull request #1372 from rhc54/topic/sing
Further enhance the support for Singularity containers.
2016-02-17 16:39:23 -08:00
Ralph Castain
8f9508cace Further enhance the support for Singularity containers. Extend the "personality" command-line option to allow specifying both model (e.g., "ompi") and container (e.g., "singularity"), and add the necessary logic to support multiple options. Add a new pmix "isolated" component to handle singletons where no HNP is available since containers cannot launch the HNP. 2016-02-17 13:33:06 -08:00
Howard Pritchard
31841b4367 ras/alps: squelch common symbol warnings
squelch a couple of warnings from the common symbols
script.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-02-17 13:27:29 -06:00
Ralph Castain
e0de4423ba Remove debug 2016-02-16 20:58:53 -08:00
Ralph Castain
50431001a3 Modify the IOF subsystem to handle per-job directives for redirecting IO to files, tagging IO, and timestamping IO.
Fix stdin reader
2016-02-16 18:54:38 -08:00
Mark Santcroos
14f0390b7d Release child object when we are recording someone's relatives.
(Thanks to Mark Santcroos!)

Release routing list entries.
(Thanks to Mark Santcroos!)

Address some Coverity concerns
2016-02-15 20:50:42 -08:00
Ralph Castain
351070659e Correct ordering when checking for privileged ports 2016-02-14 09:43:01 -08:00
rhc54
59cc1f0a96 Merge pull request #1357 from rhc54/topic/oob
Protect against a non-privileged port connecting to us when we are running as root
2016-02-13 08:12:29 -08:00
Ralph Castain
06c3dfc052 Refactor the ORTE DVM code so that external codes can submit multiple jobs using only a single connection to the HNP.
* Clean up the DVM so it continues to run even when applications error out and we would ordinarily abort the daemons.
* Create a new errmgr component for the DVM to handle the differences.
* Cleanup the DVM state component.
* Add ORTE bindings directory and brief README
* Pass a local tool index around to match jobs.
* Pass the jobid on job completion.
* Fix initialization logic.
* Add framework for python wrapper.
* Fix terminate-with-non-zero-exit behavior so it properly terminates only the indicated procs, notifies orte-submit, and orte-dvm continues executing.
* Add some missing options to orte-dvm
* Fix a bug in -host processing that caused us to ignore the #slots designator. Add a new attribute to indicate "do not expand the DVM" when submitting job spawn requests.
* It actually makes no sense that we treat the termination of all children differently than terminating the children of a specific job - it only creates confusion over the difference in behavior. So terminate children the same way regardless.

Extend the cmd_line utility to easily allow layering of command line definitions

Catch up with ORTE interface change and make build more generic.

Disable "fixed dvm" logic for now.

Add another cmd_line function to merge a table of cmd line options with another one, reporting as errors any duplicate entries. Use this to allow orterun to reuse the orted_submit code

Fix the "fixed_dvm" logic by ensuring we reset num_new_daemons to zero. Also ensure that the nidmap is sent with the first job so the downstream daemons get the node info. Remove a duplicate cmd line entry in orterun.

Revise the DVM startup procedure to pass the nidmap only once, at the startup of the DVM. This reduces the overhead on each job launch and ensures that the nidmap doesn't get overwritten.

Add new commands to get_orted_comm_cmd_str().

Move ORTE command line options to orte_globals.[ch].

Catch up with extra orte_submit_init parameter.

Add example code.

Add documentation.

Bump version.

The nidmap and routing data must be updated prior to propagating the xcast or else the xcast will fail.

Fix the return code so it is something more expected when an error occurs. Ensure we get an error returned to us when we fail to launch for some reason. In this case, we will always get a launch_cb as we did indeed attempt to spawn it. The error code will be returned in the complete_cb.

Fix the return code from orte_submit_job - it was returning the tracker index instead of "success". Take advantage of ORTE's pretty-print capabilities to provide a nice error output explaining why we failed to launch. Ensure we always get a launch_cb when we fail to launch, but no complete_cb as the job never launched.

Extend the error reporting capability to job completion as well.

Add index parameter to orte_submit_job().

Add orte_job_cancel and implement ORTE_DAEMON_TERMINATE_JOB_CMD.

Factor out dvm termination.

Parse the terminate option at tool level.

Add error string for ORTE_ERR_JOB_CANCELLED.

Add some safeguards.

Cleanup and/of comments.

Enable the return.

Properly ORTE_DECLSPEC orte_submit_halt.

Add orte_submit_halt and orte_submit_cancel to interface.

Use the plm interface to terminate the job
2016-02-13 08:10:44 -08:00
Ralph Castain
233bd085ca Protect against a non-privileged port connecting to us when we are running as root
Don't close the listener socket upon error unless we are giving up

Cleanup the incoming socket
2016-02-13 08:07:27 -08:00
Ralph Castain
aa9e5a1a27 Add support for Singularity containers, including a .m4 file for checking if Singularity is available and an orte/schizo component for setting the proper support if a container was given as the executable
Cleanup the configury so we properly check for Singularity under the various typical use-cases

Bring the Singularity support online. We have to turn "off" the sm BTL as it segfaults from inside the container - root cause remains unclear. Also turned "off" the various OPAL shmem components in case they are involved and someone else tries to use them. Happily, the vader BTL works just fine!
2016-02-13 04:40:22 -08:00
Gilles Gouaillardet
b55b9e6aee sentinel: fix sentinel to proc_name conversion
converting an opal_process_name_t means the loss of one bit,
it was decided to restrict the local job id to 15 bits, so the
useful information of an opal_process_name_t can fit in 63 bits.
2016-02-10 15:44:07 +09:00
Jeff Squyres
7850517215 brucks: rename the "brks" component to be "brucks"
After hearing the 3rd person ask what "brks" stood for, I'm renaming
this component to be "brucks" (because it uses a Bruck-based algorithm).
2016-02-09 13:17:11 -08:00
Ralph Castain
3fbad2e2bd Transfer across the -host number of slots 2016-02-08 10:38:03 -08:00
Ralph Castain
68912d04a8 Fix the grpcomm operations at scale. Restore the direct component to be the default, and to execute a rollup collective. This may in fact be faster than the alternatives, and something appears broken at scale when using brks in particular. Turn off the rcd and brks components as they don't work at scale right now - they can be restored at some future point when someone can debug them.
Adjust to Jeff's quibbles

Fixes open-mpi/mpi#1215
2016-02-04 05:42:29 -08:00
Igor Ivanov
34d861dfe9 orte/oob: Fix issue #1301
Signed-off-by: Igor Ivanov <Igor.Ivanov@itseez.com>
2016-01-20 12:08:00 +02:00
Gilles Gouaillardet
7d6b75f3b2 orte_util_snprintf_jobid: return ORTE_SUCCESS or ORTE_ERROR 2016-01-18 09:44:33 +09:00
Ralph Castain
fc6b260146 Protect against PMIx-based requests that don't come thru the MPI comm_spawn interface 2016-01-16 13:36:06 -08:00
Ralph Castain
4dad5de8ff Silence a couple of warnings - strncpy returns a char*, not an int 2016-01-16 09:44:52 -08:00
Jeff Squyres
60ffe713b8 common syms: whitelist bison-generated common symbols
Bison generates some common symbols that we can't do anything about,
so whitelist them.
2016-01-16 03:53:14 -08:00
Gilles Gouaillardet
1d38430e43 opal: replace opal_convert_jobid_to_string with opal_snprintf_jobid 2016-01-14 10:39:03 +09:00
Gilles Gouaillardet
4c43fb2a50 orte_rmaps_base_map_job: set OPAL_BIND_ALLOW_OVERLOAD when needed 2016-01-13 17:13:36 +09:00
Ralph Castain
332019b43a Silence warning 2016-01-10 09:59:36 -08:00
Nathan Hjelm
fab1eca536 grpcomm: fix bugs in grpcomm algorithms
This commit fixes multiple issues in the bruck's and recursive
doubling grpcomm algorithms. The following changes are included:

 - Use the existing bitmap implementation instead of implementing a
   new one. There were bugs in the implementation that caused an
   overrun of the bitmap array.

 - Clean up the algorithms to eliminate errors.

 - Send as little extra data as possible in the bruck's
   algorithm.

The changes were testest with various numbers of ortes varying from 1
to 4096.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-01-07 10:12:08 -07:00
Ralph Castain
f53d3c7a18 Silence warning 2015-12-30 10:16:58 -08:00
Ralph Castain
0a6b8d2c14 Correctly handle connection terminations during finalize so mpirun doesn't hang. Cleanup some corner cases in the error notification system 2015-12-30 07:16:43 -08:00
Ralph Castain
1cdc1c121c Revert "Standardize the handling of shutdown in the OOB TCP component"
This reverts commit open-mpi/ompi@12dccaa911.
2015-12-30 07:05:40 -08:00
Ralph Castain
12dccaa911 Standardize the handling of shutdown in the OOB TCP component 2015-12-29 07:57:22 -08:00
rhc54
5dfb7ac396 Merge pull request #1266 from ggouaillardet/topic/misc_pmix_fixes
Topic/misc pmix fixes
2015-12-29 07:02:44 -08:00
Ralph Castain
810f2446b7 Add pmix120 component, update the error handling functions in the PMIx API.
Update the configure logic for the new pmix120 component

ckpt

Get the pmix120 component to work - still not really registering or handling notifications, but infrastructure now operates

Cleanup some of the symbol scopes, and provide a more comprehensive rename.h file. Will pretty it up later - let's see how this works

Cleanup the rename files to use the pretty macros
2015-12-28 23:15:44 +09:00
Gilles Gouaillardet
352b05a552 rmaps: warn if oversubscribing when manually setting the number of hosts
This is a port of the v1.10 series one-off open-mpi/ompi-release@8c5ce45ab6
2015-12-28 10:38:57 +09:00
Ralph Castain
8ab28cdc82 Fix a typo that causes segfaults on multi-node executions 2015-12-24 08:43:47 -08:00
rhc54
d7199dc75b Merge pull request #1255 from annu13/fixup
Fixup
2015-12-22 20:54:48 -08:00
annu13
43f44f31c1 moved code to job setup first before enabling comm 2015-12-22 14:30:59 -08:00
Howard Pritchard
39367ca0bf plm/alps: only use srun for Native SLURM
Turns out that the way the SLURM plm works
is not compatible with the way MPI processes
on Cray XC obtain RDMA credentials to use
the high speed network.  Unlike with ALPS,
the mpirun process is on the first compute
node in the job.  With the current PLM launch
system, mpirun (HNP daemon) launches the MPI
ranks on that node rather than relying on
srun.

This will probably require a significant amount
of effort to rework to support Native SLURM
on Cray XC's.  As a short term alternative,
have the alps plm (which gets selected by default
again on Cray systems regardless of the launch system)
check whether or not srun or alps is being used on the
system.  If alps is not being used, print a helpful
message for the user and abort the job launch.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-22 11:03:42 -08:00
rhc54
d9cd451a16 Merge pull request #1250 from rhc54/topic/rf
Fix the default slot mapping in rank file mapper
2015-12-21 10:57:52 -08:00
Ralph Castain
7cc5879bdd Fix the default slot mapping in rank file mapper 2015-12-21 09:47:27 -08:00
Ralph Castain
94ffe10808 Do not override any external settings for PMIx component selection 2015-12-21 08:36:12 -08:00
Jeff Squyres
53ca721ff4 configury: clean up .so version numbers
Move .so version numbers to their appropriate project in the top-level
VERSION file.  Also add the project name to all .so version number
names.  Remove no-longer-used .so names.
2015-12-18 12:50:23 -05:00
Ralph Castain
64b695669a Cleanup warnings in opal and orte layers when building optimized on Mac 2015-12-17 07:51:24 -08:00
Ralph Castain
3a56f0d34b Create the pmix external component. Fix a few places where opal/util/argv.h were required when building with an external pmix (go figure).
NOTE: Building with external pmix *requires* that you also build with external libevent and hwloc libraries. Detect this at configure and error out with large message if this requirement is violated.

Closes #1204  (replaces it)
Fixes #1064
2015-12-15 15:26:13 -08:00
Howard Pritchard
7a82174747 Merge pull request #1195 from hppritcha/topic/wlm_detect
support Cray nativized slurm environment
2015-12-15 07:58:53 -07:00
Jeff Squyres
3e308f41f7 rmaps base help: update binding error messages
Due to user confusion, update the show-help messages displayed when
processor and/or memory binding fails.  Thanks to Dave Love
(@loveshack) for the initial suggestion.

Fixes open-mpi/ompi#1087
2015-12-14 13:02:41 -05:00
Ralph Castain
03eb1a80bf Update the PMIx native component to release v1.1.1, with addition of one bug-fix commit beyond the official release
Rename the pmix1xx component to pmix111 so it reflects the actual release it includes

Resolve the problem of PMIx being passed a bogus --with-platform argument when configuring the PMIx tarball code. There is no reason we should be passing --with-platform arguments to any internal subdirectory, so just leave that out when constructing the opal_subdir_args variable.

Update the PMIx code and continue attempting to debug direct modex

Fix a problem in the ORTE PMIx server - there was an early intent to optimize the direct modex by fetching data for all procs from the target job on the remote node, instead of fetching the data one proc at a time. However, this was never completely implemented, and so we would hang if we had multiple overlapping requests for data from more than one proc on the node.

Update PMIx to v1.1.2
2015-12-12 18:46:38 -08:00
Ralph Castain
5e5adebf8e Port the changes from #782 to the master. Not everything applies here as the code in the 1.10 series is a little different. In addition, we asked for a few changes (e.g., using MPI_ERR_ARG instead of "13") that are incorporated here.
Thanks to @jsharpe for the PR
2015-12-12 12:40:34 -08:00
Ralph Castain
1db3db022a Don't be so prescriptive about the ess component to be used - we just need to protect against the proc incorrectly taking the singleton component, so rule that one out. Ensure that the other components understand that they are only for use by daemons. 2015-12-09 19:54:44 -08:00
Jeff Squyres
00c5dc9449 rml oob: C99-ification of structure member assignment 2015-12-08 17:05:16 -08:00
Howard Pritchard
cb7c26ce96 plm/slurm: add support for cray native slurm
Cray has added plugins to slurm to support
the Cray programming env (alpslli, cray pmi, etc).
Some of the workarounds needed with plm/alps
to avoid issues with Cray PMI getting mixed up
with orte launch system are also required in
a cray native slurm environment.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-08 13:47:20 -06:00
Howard Pritchard
9548b8a9e8 plm/alps: add wlm detect infrastructure
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-12-07 07:43:20 -08:00
Ralph Castain
8823069fe9 Provide a mechanism by which a tool can request async progress thread support for ORTE 2015-12-04 08:26:57 -08:00
Mark Santcroos
3119bc14b2 Merge branch 'master' into fix/alpsinfov3 2015-11-13 08:53:06 -05:00
Ralph Castain
986a8c1d48 If an executable isn't found, it's possible for the state machine to hit the grpcomm with a zero-node map before we actually terminate with error. Silence the annoying malloc warning about zero-byte requests.
In a novm operation that only has the HNP, ensure the #nodes gets set

Clean up the error reporting
2015-11-11 14:24:13 -08:00
Jeff Squyres
8bd356549a orte proc_info.h: use symbolic names
This fix was actually applied in the v2.x branch first (as commit
open-mpi/ompi-release@a9b22afc1a).
2015-11-10 13:39:21 -08:00
Mark Santcroos
299fd69c6d Merge branch 'master' into fix/alpsinfov3 2015-11-10 15:40:19 -05:00
rhc54
474a869b8d Merge pull request #1121 from dmt4/orterun-manpage-typos
change -0bind-to and -bind-to to --bind-to in the manpages
2015-11-10 11:24:08 -08:00
Dimitar Pashov
9f6e306064 change -0bind-to and -bind-to to --bind-to in the manpages 2015-11-10 17:44:53 +00:00
Ralph Castain
6a607d42a6 Prevent a segfault on tools if a connection attempt fails - tools don't open the opal/pmix framework and thus have no way of looking up a proc hostname 2015-11-10 09:11:34 -08:00
Mark Santcroos
5ec2b4d98c Fix some messages in the process. 2015-11-09 18:03:26 -05:00
Mark Santcroos
8ec89001b3 Merge branch 'master' into fix/alpsinfov3 2015-11-09 02:45:23 -05:00
Ralph Castain
9b0cdc0de2 Add support for -pernode and -npernode options to orte-submit 2015-11-08 11:34:18 -08:00
Ralph Castain
f1483eb2dc Need to delay registration of the waitpid callback until after the fork/exec of the child process. Fix the bit testing of process type so that the proper state component gets selected for HNP. 2015-11-06 21:35:24 -08:00
Ralph Castain
5f446570d8 Work on cleaning up memory leaks that are causing orte-dvm to eventually run out of memory. Still don't have everything plugged, but getting better. Sync to the PMIx master that includes removal of the pmix_common.h.in file that really didn't need to be generated, and update to the PMIx_server_init API. 2015-11-06 14:15:30 -08:00
Mark Santcroos
a40b4eb2ee Support ALPS_APPINFO_VERSION 3. 2015-11-06 09:53:41 -05:00
Ralph Castain
ec0cc4bf21 Ensure that we completely register an nspace prior to launching local procs as otherwise we may attempt to send it down before it is registered, leading to data corruption 2015-11-05 20:51:56 -08:00
Ralph Castain
68996d6858 Move the argv_free back to the correct place - I blame Jeff for suggesting it was wrong to begin with 2015-11-05 07:57:54 -08:00
Ralph Castain
169c44258d Fix missing check 2015-11-03 19:00:28 -08:00
Ralph Castain
fe0c995f6b Fix a couple of minor issues identified by Jeff 2015-11-03 17:30:51 -08:00
Ralph Castain
186c18be0e Add missing cmd line options to mpirun man page, update NEWS to contain that change 2015-11-01 09:19:08 -08:00
Ralph Castain
0523f60479 Remove debug from orte-submit help output 2015-11-01 09:19:07 -08:00
rhc54
1fe27bf1dd Merge pull request #1084 from rhc54/topic/dashhost
Fix relative node syntax for dash-host option
2015-10-31 21:24:39 -07:00
Ralph Castain
8bfbe7f16c Add a new MCA parameter for default_dash_host to offer a mirror of the default_hostfile 2015-10-31 19:09:54 -07:00
Ralph Castain
24419b6523 Fix relative node syntax for dash-host option 2015-10-31 19:00:46 -07:00
rhc54
b23f1f3578 Merge pull request #1080 from federeghe/bugfixes
oob_tcp: fix peer->state wrong check
2015-10-31 16:09:23 -07:00
Ralph Castain
22dc05194e Minor cleanup - explicitly NULL the last member of a function pointer module. Should default to that anyway, but this is cosmetically nicer. 2015-10-30 08:19:55 -07:00
Federico Reghenzani
6536a6a9f5 oob_tcp: fix peer->state wrong check 2015-10-29 16:43:58 +01:00
Ralph Castain
267ca8fcd3 Cleanup the PMIx direct modex support. Add an MCA parameter pmix_base_async_modex that will cause the async modex to be used when set to 1. Default it to 0 for now
to continue current default behavior.

Also add an MCA param pmix_base_collect_data to direct that the blocking fence shall return all data to each process. Obviously, this param has no effect if async_
modex is used.
2015-10-27 17:31:56 -07:00
rhc54
3ffbf08283 Merge pull request #1068 from marksantcroos/master
Make odsl debug message consistent.
2015-10-24 08:11:11 -07:00
Mark Santcroos
30aab75b86 Make message consistent. 2015-10-24 13:40:03 +02:00
Ralph Castain
6506b0a5e5 Resolve a race condition that prevented the sigchild callback from being registered before short-lived apps terminated
Thanks to Mark Santcroos for the assistance in tracking it down.
2015-10-23 21:02:31 -07:00
Nathan Hjelm
9602484568 Merge pull request #1040 from hjelmn/mtl_priority
Change how cm's priority is calculated
2015-10-19 14:18:36 -06:00
Nathan Hjelm
8b5810f7f7 mca/base: add priority output to mca_base_select
The mca_base_select function uses returned priorities to select the
best component/module. This priority may be of use to the caller so
pass that information back in an optional argument. If the priority is
not needed pass NULL.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-10-19 12:32:41 -06:00
Ralph Castain
363f62a506 Fix singleton operations when running under a SLURM allocation. Sadly, SLURM's PMI will return success even if the PMI server isn't actually available. This leads to erroneous selection of pmix and ess components. So add a further requirement (namely, that we see a job_step envar) to the SLURM pmix components along with some modification of ess selection code to avoid the problem 2015-10-17 20:24:03 -07:00
Jeff Squyres
62351f442a help: remove stale help messages and files
Found by contrib/check-help-strings.pl.
2015-10-13 16:50:20 -04:00
Jeff Squyres
f9e9b69d93 Merge pull request #1001 from igor-ivanov/master
orte/mca/rmaps: Improve orte_rmaps_dist_device help message
2015-10-09 14:07:47 -04:00
Igor Ivanov
489f27f8e9 orte/mca/rmaps: Improve orte_rmaps_dist_device help message
See: https://github.com/open-mpi/ompi/issues/953
2015-10-09 17:58:07 +03:00
rhc54
232f97a80c Merge pull request #968 from JohnWestlund/master
simplify use of sockaddr* structs to work around buffer overflow warning
2015-10-07 17:42:19 -07:00
Howard Pritchard
d899320574 odls/alps: close the directory
Close the /proc/self/fd dir after checking for open fds.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-10-06 11:13:44 -07:00
Igor Ivanov
d379873443 oshmem: Add man.1 pages for oshmem tools
This changes add man pages for oshrun, oshcc and oshfort as well as
depricated shmemrun, shmemcc and shmemfort.
2015-10-05 15:41:28 +03:00
John Westlund
044fea8df7 re-order != comparison, OBJ_RELEASE mca_oob_tcp_addr_t on failure 2015-10-02 15:59:48 -07:00
John Westlund
6bfaa925ec simplify use of sockaddr* structs to work around buffer overflow warning 2015-10-02 14:26:52 -07:00
Ralph Castain
8f6855459d Cleanup some coverity warnings 2015-09-30 10:33:53 -07:00
Gilles Gouaillardet
0445484820 ras: remove orte_ras_proc_t and associated code 2015-09-30 08:52:52 +09:00
Gilles Gouaillardet
7cc14ee6f6 orte/rmaps: silence warning 2015-09-29 16:05:52 +09:00
Ralph Castain
fad5638596 Resolve the naming issue when direct-launched by PMIx-enabled RMs using a minimal-impact approach. Detect if we were launched via ORTE - if so, then use our standard methods for computing the jobid. If not, then just hash the nspace to create the jobid, and track the jobid <-> nspace correspondece down in the opal/mca/pmix/pmix1xx component. We then do the translation any time a function that passes process names is invoked. 2015-09-27 09:57:59 -07:00
Ralph Castain
0140ff048d Now that we have an "isolated" PLM component, we cannot just let rsh silently decline to run when it cannot find a launch agent - if we do, then we will -always- run on the local node. So if the user specifies a launch agent and we can't find it, then generate a pretty error message, report a fatal error back to the component select, and exit out.
This required modifying the mca_component_select function to actually check the return code on a component query - it was blissfully ignoring it.

Also do a little cleanup to avoid bombarding the user with multiple error messages.

Thanks to Patrick Begou for reporting the problem
2015-09-24 07:16:48 -07:00
Ralph Castain
749bd4e6fe Plug a few memory leaks identified by valgrind 2015-09-23 15:21:04 -07:00
Ralph Castain
f28448702a Eliminate malloc by utilizing /proc/self/fd - optimization 2015-09-22 07:24:54 -07:00
Ralph Castain
f872e99315 Fix orte-submit so it allows application procs to select the correct ess component. Protect orte_data_server from multiple calls to finalize. 2015-09-21 20:31:57 -07:00
Howard Pritchard
ef6cf50687 Merge pull request #917 from hppritcha/topic/alps_warning_swat
oob/alps: swat compiler warning
2015-09-21 16:17:30 -06:00
Howard Pritchard
8d7e759b85 oob/alps: swat compiler warning
swat some alps related compiler warnings when using --enable-picky

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-09-21 14:24:26 -07:00
Ralph Castain
92ae386a34 As Jeff proposed, change the check to looking for the filename's first character to be a digit 2015-09-21 08:22:58 -07:00
rhc54
13def2a69b Merge pull request #911 from rhc54/topic/cleanup
Cleanup the odls "close file descriptor" commit to conform to OMPI co…
2015-09-20 07:01:39 -07:00
Howard Pritchard
1367a442b6 Merge pull request #910 from hppritcha/topic/odls_alps_use_907_stuff
odls/alps: do smarter close of fds in child
2015-09-20 07:37:55 -06:00
Ralph Castain
c167acc5a7 Cleanup the odls "close file descriptor" commit to conform to OMPI coding standards and remove memory leaks 2015-09-19 20:46:36 -07:00
Howard Pritchard
a31cc21bea odls/alps: do smarter close of fds in child
Use a modified variant of #907.  Thanks to plesn
for noticing this.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-09-19 14:17:05 -07:00
Piotr Lesnicki
1dd5487fae odls: close only used file descriptors at fork/exec 2015-09-18 16:44:57 +02:00
Ralph Castain
1b7930ad52 Silence some warnings and address Coverity issues 2015-09-16 07:58:22 -07:00
Ralph Castain
8b88ea9b13 Fix singletons by removing stale code 2015-09-16 00:58:05 -07:00
Ralph Castain
c1bbbb5e2f Remove the last involvement of the OOB system from the MPI layer, remove the no-longer-needed usock/oob component, and have procs no longer open the RML, OOB, ROUTED, and GRPCOMM frameworks as PMIx now provides all required app-mpirun cmds 2015-09-15 13:08:35 -07:00
Ralph Castain
22d7c0081a Fix the no-disconnect test by resolving a segfault on free - opal_dss.unload will return the remaining unpacked portion of a buffer. As such, it cannot return the pointer to that info as it might be partway inside of a malloc'd region. So copy the data out of the buffer. 2015-09-11 13:01:35 -07:00
Ralph Castain
dc5796b8a1 Revert "Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local""
Fix the locality computation by correctly computing the vpid of the local peer

This reverts commit open-mpi/ompi@6a8fad49e5.
2015-09-11 08:29:51 -07:00
Ralph Castain
6a8fad49e5 Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local"
This reverts commit f94f3cda21.
2015-09-11 02:01:25 -07:00
Ralph Castain
f94f3cda21 Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local 2015-09-10 10:25:30 -07:00
rhc54
f6b6b9a9ca Merge pull request #877 from rhc54/topic/s1s2
Cleanup s1 and s2 components
2015-09-08 19:20:59 -07:00
Ralph Castain
1cdb86b8c7 Cleanup s1 and s2 components, and ensure that mpirun and orteds only use non-direct-launch pmix components. 2015-09-08 18:37:09 -07:00
Ralph Castain
459f169e06 Fix segfault upon job error
Silence some unnecessary error-logs
2015-09-08 14:03:06 -07:00
Jeff Squyres
bc9e5652ff whitespace: purge whitespace at end of lines
Generated by running "./contrib/whitespace-purge.sh".
2015-09-08 09:47:17 -07:00
Ralph Castain
e6add86e4f Deal with connect/accept between two jobs from different mpirun's. Somewhat optimize connect/accept by using MPI bcast to distribute the participants instead of another PMIx lookup. Cleanup some Coverity issues. 2015-09-07 09:19:24 -07:00
Ralph Castain
37c3ed68e7 Cleanup connect/disconnect and bring comm_spawn back online! 2015-09-06 10:27:39 -07:00
rhc54
665b30376a Merge pull request #868 from rhc54/topic/hwloc
Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given
2015-09-04 17:58:07 -07:00
Ralph Castain
d97bc29102 Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given 2015-09-04 16:54:40 -07:00
Ralph Castain
f6948c2bb4 Sync with PMIx master 43e45c3. Get multi-node publish/lookup/unpublish working 2015-09-04 10:07:17 -07:00
Ralph Castain
a772b46c15 Bring the MPI_Publish and friends online 2015-09-02 12:04:07 -07:00
Ralph Castain
38ba54366c Fix shared memory operations by resolving local peers 2015-08-30 12:07:14 -07:00
Ralph Castain
0d5814b5ca Cleanup Coverity issues 2015-08-29 21:19:27 -07:00
Ralph Castain
cf6137b530 Integrate PMIx 1.0 with OMPI.
Bring Slurm PMI-1 component online
Bring the s2 component online

Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways.

Bring the OMPI pubsub/pmi component online

Get comm_spawn working again

Ensure we always provide a cpuset, even if it is NULL

pmix/cray: adjust cray pmix component for pmix

Make changes so cray pmix can work within the integrated
ompi/pmix framework.

Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet

Cleanup comm_spawn - procs now starting, error in connect_accept

Complete integration
2015-08-29 16:04:10 -07:00
Ralph Castain
89c80b2294 Only start a listener for processes that will actually receive connection requests. Tools such as orte-submit always initiate connections and thus do not need to start a listener. 2015-08-27 16:41:00 -07:00
Nathan Hjelm
156ce6af21 periodic whitespace purge
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-08-24 09:32:33 -06:00
Ralph Castain
bc7815e178 Adjust the process type flags to remove confusion between orted and dvm state machines 2015-08-21 07:50:08 -07:00
Ralph Castain
5040f47ef3 Use the correct verbosity in an output_verbose 2015-08-13 22:33:25 -07:00
Ralph Castain
a2a049a612 Update test to match the one in MTT 2015-08-13 11:12:34 -07:00
Ralph Castain
0b1d4b62be Cleanup some cruft and update to coordinate with CM operations:
* don't pass --tree-spawn to the orted cmd line. If someone doesn't want tree-spawn, it shows up as an MCA param anyway
* ensure state/orted component disqualifies itself from CM operations
* clarify the DVM proc_type definitions
* ensure we stop littering the tmp dir with session directories
2015-08-12 10:32:14 -07:00
Jeff Squyres
31b329e585 odls default: ensure to initialize opts
This fixes CID 71127.
2015-08-12 05:27:37 -07:00
Howard Pritchard
8e7e4ca7f4 Merge pull request #780 from hppritcha/topic/plm_alps_minor_cleanup
plm/alps: remove unneded env. variable setting
2015-08-07 15:03:45 -06:00
Jeff Squyres
09f7434491 ORTE: update for the new opal_progress_thread API 2015-08-07 10:13:40 -07:00
Howard Pritchard
1b55d14dff plm/alps: remove unneded env. variable setting
In order to address issue #741, the orted's now are
always launched with the Cray PMI environment variables

PMI_NO_FORK
PMI_NO_PREINITIALIZE

set to disable running of the library's ctor.
So there's no longer a need to set these for the
application(s) being launched by the orted's.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-08-05 13:27:18 -07:00
Ralph Castain
9bc384282a Fix an annoying segfault caused by incorrect indentation in a loop that causes the buffer to not be created prior to packing. 2015-08-01 10:01:47 -07:00
Ralph Castain
023936e84b Silence coverity warnings 2015-07-29 07:28:08 -07:00
Gilles Gouaillardet
429bdf1af7 oob/tcp: fix a race condition when finalizing the oob/tcp component 2015-07-28 09:16:13 +09:00
Ralph Castain
93f7a51275 Update the orte/system/opal_hotel test 2015-07-24 07:34:59 -07:00
Howard Pritchard
70096d3753 plm/alps: fix orted based launch failures.
Turns out that when one builds Open MPI with --disable-dlopen
for Cray, a whole bunch of cray specific libraries get linked
in to the orted executable.  One of these is Cray PMI.  The
Cray PMI has a ctor which, if run, causes job launches using
mpirun to fail.  This commit suppresses the running of the
ctor and thus prevents failure to launch.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-07-23 15:07:57 -07:00
Jeff Squyres
60609cbb79 orte/test/system: fix compiler warnings
Note that the opal_hotel test still doesn't compile; it looks like it
needs to be updated to the new requirement to pass an event base.
2015-07-23 06:19:33 -07:00
Ralph Castain
4853457b93 The RML posted recvs are controlled by the async progress thread when in an application process. The call to finalize and close the RML is done from the main thread, and so we need to shift the actual destruct of the posted recv list to the async thread for handling or else we encounter a race condition when accessing the posted recvs.
Thanks to Gilles for providing the required debug info
2015-07-21 08:44:23 -07:00
Ralph Castain
219c4dfba5 Create a new opal_async_event_base and have the pmix/native and ORTE level use it. This reduces our thread count by one. 2015-07-12 08:23:34 -07:00
rhc54
bd91225cb5 Merge pull request #716 from rhc54/topic/alloc
Default allocated nodes to the UP state
2015-07-11 12:30:32 -07:00
Ralph Castain
2c896c5a2d Default allocated nodes to the UP state 2015-07-11 10:43:11 -07:00
Ralph Castain
683efcb850 Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename. 2015-07-11 10:08:19 -07:00
rhc54
053d9b2a7c Merge pull request #713 from rhc54/topic/errhandler
Add an opal/errhandler so opal-level errors can be up-leveled
2015-07-11 07:58:57 -07:00
Ralph Castain
a2243dcddd Add an opal/errhandler so opal-level errors can be up-leveled 2015-07-11 07:09:11 -07:00
Ralph Castain
61fb067f14 Update the opal_hotel class to support a given event base instead of defaulting to using opal_event_base 2015-07-11 06:42:23 -07:00
rhc54
c6bb227073 Merge pull request #692 from rhc54/topic/mapper
Fix hetero operations. An error in the hwloc utilities only allocated…
2015-07-07 13:33:42 -07:00
Ralph Castain
ed93154e43 Fix hetero operations. An error in the hwloc utilities only allocated memory for the first display of a binding map, and then assumed that all nodes had the same number of cores in them. This resulted in memory corruption whenever someone displayed a binding pattern for a hetero cluster, and a smaller node was first in line. 2015-07-07 12:52:16 -07:00
rhc54
a4aff5e3d9 Merge pull request #691 from rhc54/topic/mapper
Add a bunch of debug, and correct an error that caused us to use the …
2015-07-07 11:08:01 -07:00
Ralph Castain
7455802a36 Add a bunch of debug, and correct an error that caused us to use the wrong mapping policy when determining the default binding policy 2015-07-07 10:13:10 -07:00
Gilles Gouaillardet
409874eb47 remove trigraph '??)' from comment
Fujitsu compilers issue way too many warnings because of this trigraph
2015-07-07 11:00:13 +09:00
Ralph Castain
eb582b8276 Minor whitespace cleanups 2015-07-06 09:38:33 -07:00
Ralph Castain
836f49597d There is no reason for tools to have an async progress thread as they can loop the event library themselves. This has the added benefit of causing the tool to "block" while waiting for events so they don't use cpu.
Also, fix orte-submit so it appropriately handles --help option
2015-07-05 10:45:28 -07:00
Ralph Castain
6829e192ad Okay, that's it - trash it 2015-07-01 05:27:30 -05:00
Ralph Castain
6cd3ccd305 Update the OMP support per request from IBM and LLNL 2015-06-30 10:24:34 -05:00
Ralph Castain
a58171a974 Add some debug 2015-06-29 14:51:41 -05:00
Ralph Castain
a4557d4ed2 Add new component to support OpenMP envars per request from IBM and LLNL 2015-06-27 17:57:04 -07:00
Ralph Castain
4352123c26 Protect the oob/tcp component from port scanners 2015-06-26 01:40:57 -07:00
Nathan Hjelm
ee36d813dc Merge pull request #657 from hjelmn/c99
more c99 updates
2015-06-25 11:21:09 -06:00
Nathan Hjelm
4d92c9989e more c99 updates
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-25 10:14:13 -06:00
Howard Pritchard
e49a37c034 ownership: update ownership files
per discussions at OMPI devel workshop

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-25 10:04:42 -06:00
Ralph Castain
014a6a5969 Initialize variable to make clang happy 2015-06-24 22:01:09 -07:00
Ralph Castain
869041f770 Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
Ralph Castain
db3c59b943 Silence a warning by converting the bitmap to a string prior to printing the error 2015-06-23 11:49:11 -07:00
Ralph Castain
706884652f Silence Coverity warning about failing to check return code 2015-06-17 19:24:51 -07:00
Ralph Castain
869b2891c4 When doing comm-spawn, track the last object we bound to and ensure that we start the next job on the next object so we avoid overload situations when they aren't necessary 2015-06-17 09:20:08 -07:00
Gilles Gouaillardet
ec679b3fc2 orte/orted: fix misc memory leaks 2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
b72e9288bc rmaps: fix a misc memory leak
as reported by Coverity with CID 1269887
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
27b4727fcf orte/orted: fix misc memory leak
as reported by Coverity with CID 743448
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
ac5921d7da orte/util: fix misc memory leak
as reported by Coverity with CID 1196738-1196739
2015-06-17 11:17:55 +09:00
Gilles Gouaillardet
e77d3057d6 orte-submit: fix a misc memory leak
as reported by Coverity with CID 710651
2015-06-17 11:17:54 +09:00
Gilles Gouaillardet
67638690ea orte/util: fix a misc memory leak
as reported by Coverity with CID 710652
2015-06-17 11:17:54 +09:00
Gilles Gouaillardet
a43abceb88 fix dfs misc memory leaks
as reported by Coverity with CIDs 739887, 747706, 1196707-1196709
2015-06-17 11:17:54 +09:00
rhc54
adbff46a13 Merge pull request #642 from rhc54/topic/hwloc
Update hwloc to 1.11.0
2015-06-13 12:09:58 -07:00
Ralph Castain
ff92781ec4 Replace hwloc191 with hwloc1110
Fix hwloc compile. Ignore LAMA mapper due to deprecated hwloc functions
2015-06-13 10:11:45 -07:00
Ralph Castain
cebdf0b7c0 Add missing include 2015-06-09 22:08:05 -07:00
Howard Pritchard
05325b113e odls/alps: fix busted build for cray.
This commit fixes things broken by commit
ea35e47.

Fixes #616

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-06-02 05:10:38 -07:00
Ralph Castain
6b93db6a9a Grrr...not sure how this slipped thru 2015-05-29 19:37:24 -07:00
Ralph Castain
bac308b184 Remove stale header 2015-05-29 19:24:51 -07:00
Ralph Castain
ea35e47228 Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail.
Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time.

We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later.

This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.
2015-05-29 14:37:14 -07:00
Nathan Hjelm
7db48c581d orte_quit: Remove logically dead code
CID 71993 Logically dead code (DEADCODE)

As indicated by coverity proc can not be NULL at any point after the
continue. Removed dead code.

CID 1269682 Unchecked return value (CHECKED_RETURN)

Check the return code of orte_get_attribute. I assume we still need to
check for a NULL proc in case the aborted proc attribute is set to
NULL. This might be better as an assert ().

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-26 12:16:12 -06:00
Ralph Castain
c21cd1c91e Ensure the ssh session is dead 2015-05-23 08:14:29 -07:00
Ralph Castain
920562d9b4 Ensure that all ssh sessions are terminated when abnormally terminating the job 2015-05-23 08:14:29 -07:00
Jeff Squyres
5e52ce26b5 help-errmgr-base.txt: remove trailing newline
Removed spurrious newline at end of file so that the emitted help
message doesn't contain a blank line before the final "-----" output.
2015-05-23 03:33:23 -07:00
Ralph Castain
55cd2a07f6 Update exit code 2015-05-22 21:06:43 -07:00
Ralph Castain
3510bb4ced Set the exit code when a daemon fails 2015-05-22 21:05:23 -07:00
Ralph Castain
bc7a7f3de5 Fix abnormal shutdown when a node dies 2015-05-22 17:29:06 -07:00
Ralph Castain
96cd42699e Cleanup warnings for uninitialized vars and convert bare debug output to verbose 2015-05-21 07:41:26 -07:00
Jeff Squyres
3069daa015 oob_tcp_listener: slightly refactor EAGAIN/EWOULDBLOCK
Have only a single level of "if" conditionals.  Also, slightly change
the logic such that we only die/break out of the loop if we get EMFILE
-- all other errors are ok to go on to the next fd.

Finally, use a real show_help() message to warn when other errors occur.
2015-05-20 21:10:11 -04:00
Jeff Squyres
e43c8dc291 oob tcp: label a few #endif's
Only bother labeling the ones that are a little far away from their
corresponding #if statements.
2015-05-20 21:10:11 -04:00
Jeff Squyres
4b2f0d4827 oob tcp: reset MCA params from level 9
Set various MCA param levels
2015-05-20 21:10:11 -04:00
Jeff Squyres
1a4c9960e1 oob tcp: set KEEPALIVE timeout 60s, retry interval 5s
The timeout is frequency at which to send keepalive pings; the retry
interval is how often to send successive pings once a keepalive has
not replied.

Also update comments and MCA param help strings.

60 seconds -- squashme
2015-05-20 21:08:37 -04:00
Jeff Squyres
c95215dfc2 oob_tcp: do not set KEEPALIVE on listening sockets 2015-05-20 17:28:45 -04:00
Jeff Squyres
32d81af35f oob tcp: re-enable keepalive option for Mac
Plus very minor #if/#endif reduction.
2015-05-20 17:28:45 -04:00
rhc54
95c40e64b9 Merge pull request #584 from nkogteva/oob_ud_stress_test
oob ud: fixed a bug that prevented the work with QoS framework
2015-05-20 09:56:08 -06:00