openmpi

Автор	SHA1	Сообщение	Дата
Howard Pritchard	69200e6229	plm/alps: fix usage of cray wlm_detect methods Turns out there are some cases where the Cray wlm_detect_get_active may return NULL, in which case fallback to wlm_detect_get_default method is suggested. Make use of the fallback to avoid segfaults under some circumstances in the ALPS plm selection method. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-03-22 11:40:56 -07:00
Ralph Castain	c146c4969b	Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation. Fixes #1467 Restore debugger attach operations Fixes #1225	2016-03-18 21:49:04 -07:00
Ralph Castain	8f410d7897	Revert one part of open-mpi/ompi@4d0cc27eb7	2016-03-18 07:23:30 -07:00
Ralph Castain	2970becd6b	Revert "Merge pull request #1451 from ggouaillardet/topic/orte_fork_wrapper_fullname" This reverts commit `efafd62d38`, reversing changes made to `a93b849f13`.	2016-03-18 07:18:36 -07:00
Ralph Castain	a67ff065ae	Silence coverity warnings	2016-03-16 08:43:16 -07:00
Nysal Jan K.A	f6e932c864	Fix memory corruption in orte-ps orte-ps ends up free'ing the same pointer multiple times	2016-03-15 16:03:31 +05:30
Ralph Castain	6d7ada9675	Silence Coverity warning	2016-03-14 09:42:43 -07:00
Gilles Gouaillardet	589924c4aa	odls/base: use the full app name when using an orte fork agent	2016-03-14 11:18:21 +09:00
Anandhi S Jayakumar	a31292abc7	fixes to ud for removing qos channel	2016-03-10 18:03:17 -08:00
Ralph Castain	a4c8e8c28a	Cleanup the proposed change: * qos framework is moving to the scon layer and is no longer required in ORTE * remove the rml/ftrm component as we now have multiple active components, and so the wrapper needs to be rethought * no need for separating the "base" from "API" module definition. The two are identical * move the "stub" functions into their own file for cleanliness * general cleanup to meet coding standards * cleanup some logic in the stubs	2016-03-10 13:14:17 -08:00
Jeff Squyres	48c650c47a	configury: minor updates to config summary output	2016-03-10 13:02:52 -08:00
Anandhi S Jayakumar	0188c3cf81	Adding commit for multiple plugin loading support in RML	2016-03-09 18:13:48 -08:00
Ralph Castain	f7257a8310	Modify singularity support per patch from Greg Kurtzer	2016-03-09 07:52:11 -08:00
Ralph Castain	f3ae30ff39	Fix singletons yet again...	2016-03-08 10:33:35 -08:00
Ralph Castain	d72c1c72ff	Do not push child processes into separate process groups so that any host RM can still "see" them, and ensure that any signal sent to the orted's themselves will be provided to all child processes. Forward all signals from mpirun to the child processes, removing the old MCA parameter required to turn that behavior "on".	2016-03-06 17:55:09 -08:00
Ralph Castain	4d0cc27eb7	Update the singularity support to match that of the latest singularity master. Remove the restriction on shared memory components by instructing singularity to not isolate the PID space. Add a new schizo API to allow setting up the original app_context. Ensure the container is installed prior to execution.	2016-03-05 21:47:42 -08:00
Ralph Castain	ce0a05d7d1	Minor cleanup - Singularity now has an internal check for installed, so we no longer need to do so.	2016-03-04 19:07:53 -08:00
Gilles Gouaillardet	80bdbfd9e7	add missing include file	2016-03-03 13:46:28 +09:00
Ralph Castain	4a55fba414	Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration.	2016-03-02 15:01:01 -08:00
Ralph Castain	f0680008d1	Add test file for singularity	2016-03-02 05:40:41 -08:00
Ralph Castain	06e811c5a6	Properly use the OPAL_MCA_PREFIX in orte_submit	2016-03-01 18:16:40 -08:00
Ralph Castain	1b81d90eaa	Minor cleanups required for orte-dvm operation	2016-03-01 18:12:53 -08:00
Ralph Castain	c9f7bb6751	Add the include file to all the schizo components	2016-03-01 13:18:23 -08:00
Ralph Castain	625083fe18	Add include file	2016-03-01 13:04:20 -08:00
Ralph Castain	011403c04a	Fix a number of issues, some of which have lingered for a long time: * provide a more reliable way of determining that a process is a singleton by leveraging the schizo framework. Add new components for slurm, alps, and orte to detect when we are in a managed environment, and if we have been launched by mpirun or a native launcher. Set the correct envars to control ess and pmix selection in each case. * change the relative priority of the pmix120 and pmix112 components to make pmix120 the default * fix singleton comm-spawn by correctly setting the num_apps field of the orte_job_t created by the daemon - this fixes a segfault in register_nspace on newly created daemons * ensure orterun doesn't propagate any ess or pmix directives in its environment * Cleanup a few valgrind issues and memory leaks * Fix a race condition that prevented the client from completing notification registrations (missing thread shift) * Ensure the shizo/alps component detects launch by mpirun	2016-03-01 06:53:00 -08:00
Ralph Castain	263b0c95a8	Fix a segfault that can occur when very short-lived, non-ORTE procs are run	2016-02-28 12:30:20 -08:00
Ralph Castain	cdb494566d	Provide an option to allow isolated singletons	2016-02-25 11:33:26 -06:00
Ralph Castain	e8d347d7bd	Add missing includes	2016-02-24 08:56:02 -06:00
Ralph Castain	77f800b7e8	Tools don't create the orte_job_data table, so don't remove jobs from it	2016-02-21 16:29:00 -08:00
Ralph Castain	64b7728f33	Fix typo - do not look at daemon job when considering completion of launch	2016-02-21 14:44:51 -08:00
Ralph Castain	d653cf2847	Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM.	2016-02-21 11:55:49 -08:00
Ralph Castain	309e23ab3a	Fix minor typo	2016-02-20 01:33:10 -08:00
Ralph Castain	0c72ba89b9	Cleanup the output-filename options so they work as expected. Have the remote nodes output locally to the files instead of sending it all back to the HNP. Fix Solaris issues by renaming struct field	2016-02-19 12:41:46 -08:00
rhc54	bfd4254a7b	Merge pull request #1382 from rhc54/topic/cleanup Cleanup some valgrind complaints about jumps with uninitialized values.	2016-02-18 17:29:37 -08:00
Nathan Hjelm	27e7b6e466	Merge pull request #1381 from hjelmn/ddt_colon_fix orterun: allow DDT if options contain :'s	2016-02-18 17:48:21 -07:00
Ralph Castain	6e68d758b9	Cleanup some valgrind complaints about jumps with uninitialized values. Fix a few IOF issues reported by Mark Santcroos when submitting jobs from tools. Add the ability to pass directives to the --output-filename option that tell ORTE to (a) not include the jobid in the path to the output files, and (b) not to copy the output to the tool (i.e., just store it in the files). ck Remove stale debug Fix a segfault if no subscribers are present	2016-02-18 16:30:37 -08:00
Nathan Hjelm	69de442136	orterun: allow DDT if options contain :'s There is a bug in MPMD detection that disables totalview if a : is found anywhere on the command line. This includes inside an argument option or MCA variable value. This commit changes the check to look for the string " : " instead of the character : which should eliminate the issue in most cases. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-02-18 16:56:08 -07:00
Ralph Castain	1748f44147	Stop a segfault that results in zombied processes by checking for NULL prior to object release	2016-02-18 13:48:41 -08:00
Ralph Castain	60a7bc2e50	Enable the PMIx notification callback system. This currently is only supported by the pmix120 component, which is not selected by default. All other components will ignore error registration requests, and thus do not support debugger attach when launched via mpirun. Note that direct launched applications will support such attachment, but may not do so in a scalable fashion. Fixes ##1225	2016-02-18 09:29:12 -08:00
Nysal Jan K.A	cc9b1316a4	Make UD OOB memory registrations a multiple of page size If ibv_fork_init() has been invoked the pages are marked MADV_DONTFORK. If we only partially use a page, any data allocated on the remainder of the page will be inaccessible to the child process. Fixes open-mpi/ompi#1363	2016-02-17 22:19:49 -05:00
rhc54	dc4d3edc06	Merge pull request #1372 from rhc54/topic/sing Further enhance the support for Singularity containers.	2016-02-17 16:39:23 -08:00
Ralph Castain	8f9508cace	Further enhance the support for Singularity containers. Extend the "personality" command-line option to allow specifying both model (e.g., "ompi") and container (e.g., "singularity"), and add the necessary logic to support multiple options. Add a new pmix "isolated" component to handle singletons where no HNP is available since containers cannot launch the HNP.	2016-02-17 13:33:06 -08:00
Howard Pritchard	31841b4367	ras/alps: squelch common symbol warnings squelch a couple of warnings from the common symbols script. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-02-17 13:27:29 -06:00
Ralph Castain	e0de4423ba	Remove debug	2016-02-16 20:58:53 -08:00
Ralph Castain	50431001a3	Modify the IOF subsystem to handle per-job directives for redirecting IO to files, tagging IO, and timestamping IO. Fix stdin reader	2016-02-16 18:54:38 -08:00
Mark Santcroos	14f0390b7d	Release child object when we are recording someone's relatives. (Thanks to Mark Santcroos!) Release routing list entries. (Thanks to Mark Santcroos!) Address some Coverity concerns	2016-02-15 20:50:42 -08:00
Ralph Castain	351070659e	Correct ordering when checking for privileged ports	2016-02-14 09:43:01 -08:00
rhc54	59cc1f0a96	Merge pull request #1357 from rhc54/topic/oob Protect against a non-privileged port connecting to us when we are running as root	2016-02-13 08:12:29 -08:00
Ralph Castain	06c3dfc052	Refactor the ORTE DVM code so that external codes can submit multiple jobs using only a single connection to the HNP. * Clean up the DVM so it continues to run even when applications error out and we would ordinarily abort the daemons. * Create a new errmgr component for the DVM to handle the differences. * Cleanup the DVM state component. * Add ORTE bindings directory and brief README * Pass a local tool index around to match jobs. * Pass the jobid on job completion. * Fix initialization logic. * Add framework for python wrapper. * Fix terminate-with-non-zero-exit behavior so it properly terminates only the indicated procs, notifies orte-submit, and orte-dvm continues executing. * Add some missing options to orte-dvm * Fix a bug in -host processing that caused us to ignore the #slots designator. Add a new attribute to indicate "do not expand the DVM" when submitting job spawn requests. * It actually makes no sense that we treat the termination of all children differently than terminating the children of a specific job - it only creates confusion over the difference in behavior. So terminate children the same way regardless. Extend the cmd_line utility to easily allow layering of command line definitions Catch up with ORTE interface change and make build more generic. Disable "fixed dvm" logic for now. Add another cmd_line function to merge a table of cmd line options with another one, reporting as errors any duplicate entries. Use this to allow orterun to reuse the orted_submit code Fix the "fixed_dvm" logic by ensuring we reset num_new_daemons to zero. Also ensure that the nidmap is sent with the first job so the downstream daemons get the node info. Remove a duplicate cmd line entry in orterun. Revise the DVM startup procedure to pass the nidmap only once, at the startup of the DVM. This reduces the overhead on each job launch and ensures that the nidmap doesn't get overwritten. Add new commands to get_orted_comm_cmd_str(). Move ORTE command line options to orte_globals.[ch]. Catch up with extra orte_submit_init parameter. Add example code. Add documentation. Bump version. The nidmap and routing data must be updated prior to propagating the xcast or else the xcast will fail. Fix the return code so it is something more expected when an error occurs. Ensure we get an error returned to us when we fail to launch for some reason. In this case, we will always get a launch_cb as we did indeed attempt to spawn it. The error code will be returned in the complete_cb. Fix the return code from orte_submit_job - it was returning the tracker index instead of "success". Take advantage of ORTE's pretty-print capabilities to provide a nice error output explaining why we failed to launch. Ensure we always get a launch_cb when we fail to launch, but no complete_cb as the job never launched. Extend the error reporting capability to job completion as well. Add index parameter to orte_submit_job(). Add orte_job_cancel and implement ORTE_DAEMON_TERMINATE_JOB_CMD. Factor out dvm termination. Parse the terminate option at tool level. Add error string for ORTE_ERR_JOB_CANCELLED. Add some safeguards. Cleanup and/of comments. Enable the return. Properly ORTE_DECLSPEC orte_submit_halt. Add orte_submit_halt and orte_submit_cancel to interface. Use the plm interface to terminate the job	2016-02-13 08:10:44 -08:00
Ralph Castain	233bd085ca	Protect against a non-privileged port connecting to us when we are running as root Don't close the listener socket upon error unless we are giving up Cleanup the incoming socket	2016-02-13 08:07:27 -08:00
Ralph Castain	aa9e5a1a27	Add support for Singularity containers, including a .m4 file for checking if Singularity is available and an orte/schizo component for setting the proper support if a container was given as the executable Cleanup the configury so we properly check for Singularity under the various typical use-cases Bring the Singularity support online. We have to turn "off" the sm BTL as it segfaults from inside the container - root cause remains unclear. Also turned "off" the various OPAL shmem components in case they are involved and someone else tries to use them. Happily, the vader BTL works just fine!	2016-02-13 04:40:22 -08:00
Gilles Gouaillardet	b55b9e6aee	sentinel: fix sentinel to proc_name conversion converting an opal_process_name_t means the loss of one bit, it was decided to restrict the local job id to 15 bits, so the useful information of an opal_process_name_t can fit in 63 bits.	2016-02-10 15:44:07 +09:00
Jeff Squyres	7850517215	brucks: rename the "brks" component to be "brucks" After hearing the 3rd person ask what "brks" stood for, I'm renaming this component to be "brucks" (because it uses a Bruck-based algorithm).	2016-02-09 13:17:11 -08:00
Ralph Castain	3fbad2e2bd	Transfer across the -host number of slots	2016-02-08 10:38:03 -08:00
Ralph Castain	68912d04a8	Fix the grpcomm operations at scale. Restore the direct component to be the default, and to execute a rollup collective. This may in fact be faster than the alternatives, and something appears broken at scale when using brks in particular. Turn off the rcd and brks components as they don't work at scale right now - they can be restored at some future point when someone can debug them. Adjust to Jeff's quibbles Fixes open-mpi/mpi#1215	2016-02-04 05:42:29 -08:00
Igor Ivanov	34d861dfe9	orte/oob: Fix issue #1301 Signed-off-by: Igor Ivanov <Igor.Ivanov@itseez.com>	2016-01-20 12:08:00 +02:00
Gilles Gouaillardet	7d6b75f3b2	orte_util_snprintf_jobid: return ORTE_SUCCESS or ORTE_ERROR	2016-01-18 09:44:33 +09:00
Ralph Castain	fc6b260146	Protect against PMIx-based requests that don't come thru the MPI comm_spawn interface	2016-01-16 13:36:06 -08:00
Ralph Castain	4dad5de8ff	Silence a couple of warnings - strncpy returns a char*, not an int	2016-01-16 09:44:52 -08:00
Jeff Squyres	60ffe713b8	common syms: whitelist bison-generated common symbols Bison generates some common symbols that we can't do anything about, so whitelist them.	2016-01-16 03:53:14 -08:00
Gilles Gouaillardet	1d38430e43	opal: replace opal_convert_jobid_to_string with opal_snprintf_jobid	2016-01-14 10:39:03 +09:00
Gilles Gouaillardet	4c43fb2a50	orte_rmaps_base_map_job: set OPAL_BIND_ALLOW_OVERLOAD when needed	2016-01-13 17:13:36 +09:00
Ralph Castain	332019b43a	Silence warning	2016-01-10 09:59:36 -08:00
Nathan Hjelm	fab1eca536	grpcomm: fix bugs in grpcomm algorithms This commit fixes multiple issues in the bruck's and recursive doubling grpcomm algorithms. The following changes are included: - Use the existing bitmap implementation instead of implementing a new one. There were bugs in the implementation that caused an overrun of the bitmap array. - Clean up the algorithms to eliminate errors. - Send as little extra data as possible in the bruck's algorithm. The changes were testest with various numbers of ortes varying from 1 to 4096. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-01-07 10:12:08 -07:00
Ralph Castain	f53d3c7a18	Silence warning	2015-12-30 10:16:58 -08:00
Ralph Castain	0a6b8d2c14	Correctly handle connection terminations during finalize so mpirun doesn't hang. Cleanup some corner cases in the error notification system	2015-12-30 07:16:43 -08:00
Ralph Castain	1cdc1c121c	Revert "Standardize the handling of shutdown in the OOB TCP component" This reverts commit open-mpi/ompi@12dccaa911.	2015-12-30 07:05:40 -08:00
Ralph Castain	12dccaa911	Standardize the handling of shutdown in the OOB TCP component	2015-12-29 07:57:22 -08:00
rhc54	5dfb7ac396	Merge pull request #1266 from ggouaillardet/topic/misc_pmix_fixes Topic/misc pmix fixes	2015-12-29 07:02:44 -08:00
Ralph Castain	810f2446b7	Add pmix120 component, update the error handling functions in the PMIx API. Update the configure logic for the new pmix120 component ckpt Get the pmix120 component to work - still not really registering or handling notifications, but infrastructure now operates Cleanup some of the symbol scopes, and provide a more comprehensive rename.h file. Will pretty it up later - let's see how this works Cleanup the rename files to use the pretty macros	2015-12-28 23:15:44 +09:00
Gilles Gouaillardet	352b05a552	rmaps: warn if oversubscribing when manually setting the number of hosts This is a port of the v1.10 series one-off open-mpi/ompi-release@8c5ce45ab6	2015-12-28 10:38:57 +09:00
Ralph Castain	8ab28cdc82	Fix a typo that causes segfaults on multi-node executions	2015-12-24 08:43:47 -08:00
rhc54	d7199dc75b	Merge pull request #1255 from annu13/fixup Fixup	2015-12-22 20:54:48 -08:00
annu13	43f44f31c1	moved code to job setup first before enabling comm	2015-12-22 14:30:59 -08:00
Howard Pritchard	39367ca0bf	plm/alps: only use srun for Native SLURM Turns out that the way the SLURM plm works is not compatible with the way MPI processes on Cray XC obtain RDMA credentials to use the high speed network. Unlike with ALPS, the mpirun process is on the first compute node in the job. With the current PLM launch system, mpirun (HNP daemon) launches the MPI ranks on that node rather than relying on srun. This will probably require a significant amount of effort to rework to support Native SLURM on Cray XC's. As a short term alternative, have the alps plm (which gets selected by default again on Cray systems regardless of the launch system) check whether or not srun or alps is being used on the system. If alps is not being used, print a helpful message for the user and abort the job launch. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-12-22 11:03:42 -08:00
rhc54	d9cd451a16	Merge pull request #1250 from rhc54/topic/rf Fix the default slot mapping in rank file mapper	2015-12-21 10:57:52 -08:00
Ralph Castain	7cc5879bdd	Fix the default slot mapping in rank file mapper	2015-12-21 09:47:27 -08:00
Ralph Castain	94ffe10808	Do not override any external settings for PMIx component selection	2015-12-21 08:36:12 -08:00
Jeff Squyres	53ca721ff4	configury: clean up .so version numbers Move .so version numbers to their appropriate project in the top-level VERSION file. Also add the project name to all .so version number names. Remove no-longer-used .so names.	2015-12-18 12:50:23 -05:00
Ralph Castain	64b695669a	Cleanup warnings in opal and orte layers when building optimized on Mac	2015-12-17 07:51:24 -08:00
Ralph Castain	3a56f0d34b	Create the pmix external component. Fix a few places where opal/util/argv.h were required when building with an external pmix (go figure). NOTE: Building with external pmix requires that you also build with external libevent and hwloc libraries. Detect this at configure and error out with large message if this requirement is violated. Closes #1204 (replaces it) Fixes #1064	2015-12-15 15:26:13 -08:00
Howard Pritchard	7a82174747	Merge pull request #1195 from hppritcha/topic/wlm_detect support Cray nativized slurm environment	2015-12-15 07:58:53 -07:00
Jeff Squyres	3e308f41f7	rmaps base help: update binding error messages Due to user confusion, update the show-help messages displayed when processor and/or memory binding fails. Thanks to Dave Love (@loveshack) for the initial suggestion. Fixes open-mpi/ompi#1087	2015-12-14 13:02:41 -05:00
Ralph Castain	03eb1a80bf	Update the PMIx native component to release v1.1.1, with addition of one bug-fix commit beyond the official release Rename the pmix1xx component to pmix111 so it reflects the actual release it includes Resolve the problem of PMIx being passed a bogus --with-platform argument when configuring the PMIx tarball code. There is no reason we should be passing --with-platform arguments to any internal subdirectory, so just leave that out when constructing the opal_subdir_args variable. Update the PMIx code and continue attempting to debug direct modex Fix a problem in the ORTE PMIx server - there was an early intent to optimize the direct modex by fetching data for all procs from the target job on the remote node, instead of fetching the data one proc at a time. However, this was never completely implemented, and so we would hang if we had multiple overlapping requests for data from more than one proc on the node. Update PMIx to v1.1.2	2015-12-12 18:46:38 -08:00
Ralph Castain	5e5adebf8e	Port the changes from #782 to the master. Not everything applies here as the code in the 1.10 series is a little different. In addition, we asked for a few changes (e.g., using MPI_ERR_ARG instead of "13") that are incorporated here. Thanks to @jsharpe for the PR	2015-12-12 12:40:34 -08:00
Ralph Castain	1db3db022a	Don't be so prescriptive about the ess component to be used - we just need to protect against the proc incorrectly taking the singleton component, so rule that one out. Ensure that the other components understand that they are only for use by daemons.	2015-12-09 19:54:44 -08:00
Jeff Squyres	00c5dc9449	rml oob: C99-ification of structure member assignment	2015-12-08 17:05:16 -08:00
Howard Pritchard	cb7c26ce96	plm/slurm: add support for cray native slurm Cray has added plugins to slurm to support the Cray programming env (alpslli, cray pmi, etc). Some of the workarounds needed with plm/alps to avoid issues with Cray PMI getting mixed up with orte launch system are also required in a cray native slurm environment. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-12-08 13:47:20 -06:00
Howard Pritchard	9548b8a9e8	plm/alps: add wlm detect infrastructure Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-12-07 07:43:20 -08:00
Ralph Castain	8823069fe9	Provide a mechanism by which a tool can request async progress thread support for ORTE	2015-12-04 08:26:57 -08:00
Mark Santcroos	3119bc14b2	Merge branch 'master' into fix/alpsinfov3	2015-11-13 08:53:06 -05:00
Ralph Castain	986a8c1d48	If an executable isn't found, it's possible for the state machine to hit the grpcomm with a zero-node map before we actually terminate with error. Silence the annoying malloc warning about zero-byte requests. In a novm operation that only has the HNP, ensure the #nodes gets set Clean up the error reporting	2015-11-11 14:24:13 -08:00
Jeff Squyres	8bd356549a	orte proc_info.h: use symbolic names This fix was actually applied in the v2.x branch first (as commit open-mpi/ompi-release@a9b22afc1a).	2015-11-10 13:39:21 -08:00
Mark Santcroos	299fd69c6d	Merge branch 'master' into fix/alpsinfov3	2015-11-10 15:40:19 -05:00
rhc54	474a869b8d	Merge pull request #1121 from dmt4/orterun-manpage-typos change -0bind-to and -bind-to to --bind-to in the manpages	2015-11-10 11:24:08 -08:00
Dimitar Pashov	9f6e306064	change -0bind-to and -bind-to to --bind-to in the manpages	2015-11-10 17:44:53 +00:00
Ralph Castain	6a607d42a6	Prevent a segfault on tools if a connection attempt fails - tools don't open the opal/pmix framework and thus have no way of looking up a proc hostname	2015-11-10 09:11:34 -08:00
Mark Santcroos	5ec2b4d98c	Fix some messages in the process.	2015-11-09 18:03:26 -05:00
Mark Santcroos	8ec89001b3	Merge branch 'master' into fix/alpsinfov3	2015-11-09 02:45:23 -05:00
Ralph Castain	9b0cdc0de2	Add support for -pernode and -npernode options to orte-submit	2015-11-08 11:34:18 -08:00
Ralph Castain	f1483eb2dc	Need to delay registration of the waitpid callback until after the fork/exec of the child process. Fix the bit testing of process type so that the proper state component gets selected for HNP.	2015-11-06 21:35:24 -08:00
Ralph Castain	5f446570d8	Work on cleaning up memory leaks that are causing orte-dvm to eventually run out of memory. Still don't have everything plugged, but getting better. Sync to the PMIx master that includes removal of the pmix_common.h.in file that really didn't need to be generated, and update to the PMIx_server_init API.	2015-11-06 14:15:30 -08:00
Mark Santcroos	a40b4eb2ee	Support ALPS_APPINFO_VERSION 3.	2015-11-06 09:53:41 -05:00
Ralph Castain	ec0cc4bf21	Ensure that we completely register an nspace prior to launching local procs as otherwise we may attempt to send it down before it is registered, leading to data corruption	2015-11-05 20:51:56 -08:00
Ralph Castain	68996d6858	Move the argv_free back to the correct place - I blame Jeff for suggesting it was wrong to begin with	2015-11-05 07:57:54 -08:00
Ralph Castain	169c44258d	Fix missing check	2015-11-03 19:00:28 -08:00
Ralph Castain	fe0c995f6b	Fix a couple of minor issues identified by Jeff	2015-11-03 17:30:51 -08:00
Ralph Castain	186c18be0e	Add missing cmd line options to mpirun man page, update NEWS to contain that change	2015-11-01 09:19:08 -08:00
Ralph Castain	0523f60479	Remove debug from orte-submit help output	2015-11-01 09:19:07 -08:00
rhc54	1fe27bf1dd	Merge pull request #1084 from rhc54/topic/dashhost Fix relative node syntax for dash-host option	2015-10-31 21:24:39 -07:00
Ralph Castain	8bfbe7f16c	Add a new MCA parameter for default_dash_host to offer a mirror of the default_hostfile	2015-10-31 19:09:54 -07:00
Ralph Castain	24419b6523	Fix relative node syntax for dash-host option	2015-10-31 19:00:46 -07:00
rhc54	b23f1f3578	Merge pull request #1080 from federeghe/bugfixes oob_tcp: fix peer->state wrong check	2015-10-31 16:09:23 -07:00
Ralph Castain	22dc05194e	Minor cleanup - explicitly NULL the last member of a function pointer module. Should default to that anyway, but this is cosmetically nicer.	2015-10-30 08:19:55 -07:00
Federico Reghenzani	6536a6a9f5	oob_tcp: fix peer->state wrong check	2015-10-29 16:43:58 +01:00
Ralph Castain	267ca8fcd3	Cleanup the PMIx direct modex support. Add an MCA parameter pmix_base_async_modex that will cause the async modex to be used when set to 1. Default it to 0 for now to continue current default behavior. Also add an MCA param pmix_base_collect_data to direct that the blocking fence shall return all data to each process. Obviously, this param has no effect if async_ modex is used.	2015-10-27 17:31:56 -07:00
rhc54	3ffbf08283	Merge pull request #1068 from marksantcroos/master Make odsl debug message consistent.	2015-10-24 08:11:11 -07:00
Mark Santcroos	30aab75b86	Make message consistent.	2015-10-24 13:40:03 +02:00
Ralph Castain	6506b0a5e5	Resolve a race condition that prevented the sigchild callback from being registered before short-lived apps terminated Thanks to Mark Santcroos for the assistance in tracking it down.	2015-10-23 21:02:31 -07:00
Nathan Hjelm	9602484568	Merge pull request #1040 from hjelmn/mtl_priority Change how cm's priority is calculated	2015-10-19 14:18:36 -06:00
Nathan Hjelm	8b5810f7f7	mca/base: add priority output to mca_base_select The mca_base_select function uses returned priorities to select the best component/module. This priority may be of use to the caller so pass that information back in an optional argument. If the priority is not needed pass NULL. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-10-19 12:32:41 -06:00
Ralph Castain	363f62a506	Fix singleton operations when running under a SLURM allocation. Sadly, SLURM's PMI will return success even if the PMI server isn't actually available. This leads to erroneous selection of pmix and ess components. So add a further requirement (namely, that we see a job_step envar) to the SLURM pmix components along with some modification of ess selection code to avoid the problem	2015-10-17 20:24:03 -07:00
Jeff Squyres	62351f442a	help: remove stale help messages and files Found by contrib/check-help-strings.pl.	2015-10-13 16:50:20 -04:00
Jeff Squyres	f9e9b69d93	Merge pull request #1001 from igor-ivanov/master orte/mca/rmaps: Improve orte_rmaps_dist_device help message	2015-10-09 14:07:47 -04:00
Igor Ivanov	489f27f8e9	orte/mca/rmaps: Improve orte_rmaps_dist_device help message See: https://github.com/open-mpi/ompi/issues/953	2015-10-09 17:58:07 +03:00
rhc54	232f97a80c	Merge pull request #968 from JohnWestlund/master simplify use of sockaddr* structs to work around buffer overflow warning	2015-10-07 17:42:19 -07:00
Howard Pritchard	d899320574	odls/alps: close the directory Close the /proc/self/fd dir after checking for open fds. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-10-06 11:13:44 -07:00
Igor Ivanov	d379873443	oshmem: Add man.1 pages for oshmem tools This changes add man pages for oshrun, oshcc and oshfort as well as depricated shmemrun, shmemcc and shmemfort.	2015-10-05 15:41:28 +03:00
John Westlund	044fea8df7	re-order != comparison, OBJ_RELEASE mca_oob_tcp_addr_t on failure	2015-10-02 15:59:48 -07:00
John Westlund	6bfaa925ec	simplify use of sockaddr* structs to work around buffer overflow warning	2015-10-02 14:26:52 -07:00
Ralph Castain	8f6855459d	Cleanup some coverity warnings	2015-09-30 10:33:53 -07:00
Gilles Gouaillardet	0445484820	ras: remove orte_ras_proc_t and associated code	2015-09-30 08:52:52 +09:00
Gilles Gouaillardet	7cc14ee6f6	orte/rmaps: silence warning	2015-09-29 16:05:52 +09:00
Ralph Castain	fad5638596	Resolve the naming issue when direct-launched by PMIx-enabled RMs using a minimal-impact approach. Detect if we were launched via ORTE - if so, then use our standard methods for computing the jobid. If not, then just hash the nspace to create the jobid, and track the jobid <-> nspace correspondece down in the opal/mca/pmix/pmix1xx component. We then do the translation any time a function that passes process names is invoked.	2015-09-27 09:57:59 -07:00
Ralph Castain	0140ff048d	Now that we have an "isolated" PLM component, we cannot just let rsh silently decline to run when it cannot find a launch agent - if we do, then we will -always- run on the local node. So if the user specifies a launch agent and we can't find it, then generate a pretty error message, report a fatal error back to the component select, and exit out. This required modifying the mca_component_select function to actually check the return code on a component query - it was blissfully ignoring it. Also do a little cleanup to avoid bombarding the user with multiple error messages. Thanks to Patrick Begou for reporting the problem	2015-09-24 07:16:48 -07:00
Ralph Castain	749bd4e6fe	Plug a few memory leaks identified by valgrind	2015-09-23 15:21:04 -07:00
Ralph Castain	f28448702a	Eliminate malloc by utilizing /proc/self/fd - optimization	2015-09-22 07:24:54 -07:00
Ralph Castain	f872e99315	Fix orte-submit so it allows application procs to select the correct ess component. Protect orte_data_server from multiple calls to finalize.	2015-09-21 20:31:57 -07:00
Howard Pritchard	ef6cf50687	Merge pull request #917 from hppritcha/topic/alps_warning_swat oob/alps: swat compiler warning	2015-09-21 16:17:30 -06:00
Howard Pritchard	8d7e759b85	oob/alps: swat compiler warning swat some alps related compiler warnings when using --enable-picky Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-09-21 14:24:26 -07:00
Ralph Castain	92ae386a34	As Jeff proposed, change the check to looking for the filename's first character to be a digit	2015-09-21 08:22:58 -07:00
rhc54	13def2a69b	Merge pull request #911 from rhc54/topic/cleanup Cleanup the odls "close file descriptor" commit to conform to OMPI co…	2015-09-20 07:01:39 -07:00
Howard Pritchard	1367a442b6	Merge pull request #910 from hppritcha/topic/odls_alps_use_907_stuff odls/alps: do smarter close of fds in child	2015-09-20 07:37:55 -06:00
Ralph Castain	c167acc5a7	Cleanup the odls "close file descriptor" commit to conform to OMPI coding standards and remove memory leaks	2015-09-19 20:46:36 -07:00
Howard Pritchard	a31cc21bea	odls/alps: do smarter close of fds in child Use a modified variant of #907. Thanks to plesn for noticing this. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-09-19 14:17:05 -07:00
Piotr Lesnicki	1dd5487fae	odls: close only used file descriptors at fork/exec	2015-09-18 16:44:57 +02:00
Ralph Castain	1b7930ad52	Silence some warnings and address Coverity issues	2015-09-16 07:58:22 -07:00
Ralph Castain	8b88ea9b13	Fix singletons by removing stale code	2015-09-16 00:58:05 -07:00
Ralph Castain	c1bbbb5e2f	Remove the last involvement of the OOB system from the MPI layer, remove the no-longer-needed usock/oob component, and have procs no longer open the RML, OOB, ROUTED, and GRPCOMM frameworks as PMIx now provides all required app-mpirun cmds	2015-09-15 13:08:35 -07:00
Ralph Castain	22d7c0081a	Fix the no-disconnect test by resolving a segfault on free - opal_dss.unload will return the remaining unpacked portion of a buffer. As such, it cannot return the pointer to that info as it might be partway inside of a malloc'd region. So copy the data out of the buffer.	2015-09-11 13:01:35 -07:00
Ralph Castain	dc5796b8a1	Revert "Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local"" Fix the locality computation by correctly computing the vpid of the local peer This reverts commit open-mpi/ompi@6a8fad49e5.	2015-09-11 08:29:51 -07:00
Ralph Castain	6a8fad49e5	Revert "Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local" This reverts commit `f94f3cda21`.	2015-09-11 02:01:25 -07:00
Ralph Castain	f94f3cda21	Fix the handling of cpusets so we get the correct cpuset for each local peer. Add the ability to indicate that a modex request is "optional" so we don't call the server if we don't find the value. Take advantage of that to allow the MPI layer to decide that the lack of locality info indicates non-local	2015-09-10 10:25:30 -07:00
rhc54	f6b6b9a9ca	Merge pull request #877 from rhc54/topic/s1s2 Cleanup s1 and s2 components	2015-09-08 19:20:59 -07:00
Ralph Castain	1cdb86b8c7	Cleanup s1 and s2 components, and ensure that mpirun and orteds only use non-direct-launch pmix components.	2015-09-08 18:37:09 -07:00
Ralph Castain	459f169e06	Fix segfault upon job error Silence some unnecessary error-logs	2015-09-08 14:03:06 -07:00
Jeff Squyres	bc9e5652ff	whitespace: purge whitespace at end of lines Generated by running "./contrib/whitespace-purge.sh".	2015-09-08 09:47:17 -07:00
Ralph Castain	e6add86e4f	Deal with connect/accept between two jobs from different mpirun's. Somewhat optimize connect/accept by using MPI bcast to distribute the participants instead of another PMIx lookup. Cleanup some Coverity issues.	2015-09-07 09:19:24 -07:00
Ralph Castain	37c3ed68e7	Cleanup connect/disconnect and bring comm_spawn back online!	2015-09-06 10:27:39 -07:00
rhc54	665b30376a	Merge pull request #868 from rhc54/topic/hwloc Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given	2015-09-04 17:58:07 -07:00
Ralph Castain	d97bc29102	Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given	2015-09-04 16:54:40 -07:00
Ralph Castain	f6948c2bb4	Sync with PMIx master 43e45c3. Get multi-node publish/lookup/unpublish working	2015-09-04 10:07:17 -07:00
Ralph Castain	a772b46c15	Bring the MPI_Publish and friends online	2015-09-02 12:04:07 -07:00
Ralph Castain	38ba54366c	Fix shared memory operations by resolving local peers	2015-08-30 12:07:14 -07:00
Ralph Castain	0d5814b5ca	Cleanup Coverity issues	2015-08-29 21:19:27 -07:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Ralph Castain	89c80b2294	Only start a listener for processes that will actually receive connection requests. Tools such as orte-submit always initiate connections and thus do not need to start a listener.	2015-08-27 16:41:00 -07:00
Nathan Hjelm	156ce6af21	periodic whitespace purge Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-08-24 09:32:33 -06:00
Ralph Castain	bc7815e178	Adjust the process type flags to remove confusion between orted and dvm state machines	2015-08-21 07:50:08 -07:00
Ralph Castain	5040f47ef3	Use the correct verbosity in an output_verbose	2015-08-13 22:33:25 -07:00
Ralph Castain	a2a049a612	Update test to match the one in MTT	2015-08-13 11:12:34 -07:00
Ralph Castain	0b1d4b62be	Cleanup some cruft and update to coordinate with CM operations: * don't pass --tree-spawn to the orted cmd line. If someone doesn't want tree-spawn, it shows up as an MCA param anyway * ensure state/orted component disqualifies itself from CM operations * clarify the DVM proc_type definitions * ensure we stop littering the tmp dir with session directories	2015-08-12 10:32:14 -07:00
Jeff Squyres	31b329e585	odls default: ensure to initialize opts This fixes CID 71127.	2015-08-12 05:27:37 -07:00
Howard Pritchard	8e7e4ca7f4	Merge pull request #780 from hppritcha/topic/plm_alps_minor_cleanup plm/alps: remove unneded env. variable setting	2015-08-07 15:03:45 -06:00
Jeff Squyres	09f7434491	ORTE: update for the new opal_progress_thread API	2015-08-07 10:13:40 -07:00
Howard Pritchard	1b55d14dff	plm/alps: remove unneded env. variable setting In order to address issue #741, the orted's now are always launched with the Cray PMI environment variables PMI_NO_FORK PMI_NO_PREINITIALIZE set to disable running of the library's ctor. So there's no longer a need to set these for the application(s) being launched by the orted's. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-08-05 13:27:18 -07:00
Ralph Castain	9bc384282a	Fix an annoying segfault caused by incorrect indentation in a loop that causes the buffer to not be created prior to packing.	2015-08-01 10:01:47 -07:00
Ralph Castain	023936e84b	Silence coverity warnings	2015-07-29 07:28:08 -07:00
Gilles Gouaillardet	429bdf1af7	oob/tcp: fix a race condition when finalizing the oob/tcp component	2015-07-28 09:16:13 +09:00
Ralph Castain	93f7a51275	Update the orte/system/opal_hotel test	2015-07-24 07:34:59 -07:00
Howard Pritchard	70096d3753	plm/alps: fix orted based launch failures. Turns out that when one builds Open MPI with --disable-dlopen for Cray, a whole bunch of cray specific libraries get linked in to the orted executable. One of these is Cray PMI. The Cray PMI has a ctor which, if run, causes job launches using mpirun to fail. This commit suppresses the running of the ctor and thus prevents failure to launch. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-07-23 15:07:57 -07:00
Jeff Squyres	60609cbb79	orte/test/system: fix compiler warnings Note that the opal_hotel test still doesn't compile; it looks like it needs to be updated to the new requirement to pass an event base.	2015-07-23 06:19:33 -07:00
Ralph Castain	4853457b93	The RML posted recvs are controlled by the async progress thread when in an application process. The call to finalize and close the RML is done from the main thread, and so we need to shift the actual destruct of the posted recv list to the async thread for handling or else we encounter a race condition when accessing the posted recvs. Thanks to Gilles for providing the required debug info	2015-07-21 08:44:23 -07:00
Ralph Castain	219c4dfba5	Create a new opal_async_event_base and have the pmix/native and ORTE level use it. This reduces our thread count by one.	2015-07-12 08:23:34 -07:00
rhc54	bd91225cb5	Merge pull request #716 from rhc54/topic/alloc Default allocated nodes to the UP state	2015-07-11 12:30:32 -07:00
Ralph Castain	2c896c5a2d	Default allocated nodes to the UP state	2015-07-11 10:43:11 -07:00
Ralph Castain	683efcb850	Rename the current opal_event_base to opal_sync_event_base in preparation for adding an async progress thread to opal. No functional changes made here - just a simple rename.	2015-07-11 10:08:19 -07:00
rhc54	053d9b2a7c	Merge pull request #713 from rhc54/topic/errhandler Add an opal/errhandler so opal-level errors can be up-leveled	2015-07-11 07:58:57 -07:00
Ralph Castain	a2243dcddd	Add an opal/errhandler so opal-level errors can be up-leveled	2015-07-11 07:09:11 -07:00
Ralph Castain	61fb067f14	Update the opal_hotel class to support a given event base instead of defaulting to using opal_event_base	2015-07-11 06:42:23 -07:00
rhc54	c6bb227073	Merge pull request #692 from rhc54/topic/mapper Fix hetero operations. An error in the hwloc utilities only allocated…	2015-07-07 13:33:42 -07:00
Ralph Castain	ed93154e43	Fix hetero operations. An error in the hwloc utilities only allocated memory for the first display of a binding map, and then assumed that all nodes had the same number of cores in them. This resulted in memory corruption whenever someone displayed a binding pattern for a hetero cluster, and a smaller node was first in line.	2015-07-07 12:52:16 -07:00
rhc54	a4aff5e3d9	Merge pull request #691 from rhc54/topic/mapper Add a bunch of debug, and correct an error that caused us to use the …	2015-07-07 11:08:01 -07:00
Ralph Castain	7455802a36	Add a bunch of debug, and correct an error that caused us to use the wrong mapping policy when determining the default binding policy	2015-07-07 10:13:10 -07:00
Gilles Gouaillardet	409874eb47	remove trigraph '??)' from comment Fujitsu compilers issue way too many warnings because of this trigraph	2015-07-07 11:00:13 +09:00
Ralph Castain	eb582b8276	Minor whitespace cleanups	2015-07-06 09:38:33 -07:00
Ralph Castain	836f49597d	There is no reason for tools to have an async progress thread as they can loop the event library themselves. This has the added benefit of causing the tool to "block" while waiting for events so they don't use cpu. Also, fix orte-submit so it appropriately handles --help option	2015-07-05 10:45:28 -07:00
Ralph Castain	6829e192ad	Okay, that's it - trash it	2015-07-01 05:27:30 -05:00
Ralph Castain	6cd3ccd305	Update the OMP support per request from IBM and LLNL	2015-06-30 10:24:34 -05:00
Ralph Castain	a58171a974	Add some debug	2015-06-29 14:51:41 -05:00
Ralph Castain	a4557d4ed2	Add new component to support OpenMP envars per request from IBM and LLNL	2015-06-27 17:57:04 -07:00
Ralph Castain	4352123c26	Protect the oob/tcp component from port scanners	2015-06-26 01:40:57 -07:00
Nathan Hjelm	ee36d813dc	Merge pull request #657 from hjelmn/c99 more c99 updates	2015-06-25 11:21:09 -06:00
Nathan Hjelm	4d92c9989e	more c99 updates This commit does two things. It removes checks for C99 required headers (stdlib.h, string.h, signal.h, etc). Additionally it removes definitions for required C99 types (intptr_t, int64_t, int32_t, etc). Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2015-06-25 10:14:13 -06:00
Howard Pritchard	e49a37c034	ownership: update ownership files per discussions at OMPI devel workshop Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-06-25 10:04:42 -06:00
Ralph Castain	014a6a5969	Initialize variable to make clang happy	2015-06-24 22:01:09 -07:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Ralph Castain	db3c59b943	Silence a warning by converting the bitmap to a string prior to printing the error	2015-06-23 11:49:11 -07:00
Ralph Castain	706884652f	Silence Coverity warning about failing to check return code	2015-06-17 19:24:51 -07:00
Ralph Castain	869b2891c4	When doing comm-spawn, track the last object we bound to and ensure that we start the next job on the next object so we avoid overload situations when they aren't necessary	2015-06-17 09:20:08 -07:00
Gilles Gouaillardet	ec679b3fc2	orte/orted: fix misc memory leaks	2015-06-17 11:17:55 +09:00
Gilles Gouaillardet	b72e9288bc	rmaps: fix a misc memory leak as reported by Coverity with CID 1269887	2015-06-17 11:17:55 +09:00
Gilles Gouaillardet	27b4727fcf	orte/orted: fix misc memory leak as reported by Coverity with CID 743448	2015-06-17 11:17:55 +09:00
Gilles Gouaillardet	ac5921d7da	orte/util: fix misc memory leak as reported by Coverity with CID 1196738-1196739	2015-06-17 11:17:55 +09:00
Gilles Gouaillardet	e77d3057d6	orte-submit: fix a misc memory leak as reported by Coverity with CID 710651	2015-06-17 11:17:54 +09:00
Gilles Gouaillardet	67638690ea	orte/util: fix a misc memory leak as reported by Coverity with CID 710652	2015-06-17 11:17:54 +09:00
Gilles Gouaillardet	a43abceb88	fix dfs misc memory leaks as reported by Coverity with CIDs 739887, 747706, 1196707-1196709	2015-06-17 11:17:54 +09:00
rhc54	adbff46a13	Merge pull request #642 from rhc54/topic/hwloc Update hwloc to 1.11.0	2015-06-13 12:09:58 -07:00
Ralph Castain	ff92781ec4	Replace hwloc191 with hwloc1110 Fix hwloc compile. Ignore LAMA mapper due to deprecated hwloc functions	2015-06-13 10:11:45 -07:00
Ralph Castain	cebdf0b7c0	Add missing include	2015-06-09 22:08:05 -07:00
Howard Pritchard	05325b113e	odls/alps: fix busted build for cray. This commit fixes things broken by commit `ea35e47`. Fixes #616 Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-06-02 05:10:38 -07:00
Ralph Castain	6b93db6a9a	Grrr...not sure how this slipped thru	2015-05-29 19:37:24 -07:00
Ralph Castain	bac308b184	Remove stale header	2015-05-29 19:24:51 -07:00
Ralph Castain	ea35e47228	Fat SMPs (i.e., systems with nodes containing large numbers of cpus) were failing to start due to connection failures of the opal/pmix support. Root cause was that (a) we were setting the client socket to non-blocking before calling connect, and (b) the server was using the event library to harvest the accepts, and also did the handshake while in that event. So the server would backup beyond the connection backlog limit, and we would fail. Changing the client to leave its socket as blocking during the connect doesn't solve the problem by itself - you also have to introduce a sleep delay once the backlog is hit to avoid simply machine-gunning your way thru retries. This gets somewhat difficult to adjust as you don't want to unnecessarily prolong startup time. We've solved this before by adding a listening thread that simply reaps accepts and shoves them into the event library for subsequent processing. This would resolve the problem, but meant yet another daemon-level thread. So I centralized the listening thread support and let multiple elements register listeners on it. Thus, each daemon now has a single listening thread that reaps accepts from multiple sources - for now, the orte/pmix server and the oob/usock support are using it. I'll add in the oob/tcp component later. This still didn't fully resolve the SMP problem, especially on coprocessor cards (e.g., KNC). Removing the shared memory dstore support helped further improve the behavior - it looks like there is some kind of memory paging issue there that needs further understanding. Given that the shared memory support was about to be lost when I bring over the PMIx integration (until it is restored in that library), it seemed like a reasonable thing to just remove it at this point.	2015-05-29 14:37:14 -07:00
Nathan Hjelm	7db48c581d	orte_quit: Remove logically dead code CID 71993 Logically dead code (DEADCODE) As indicated by coverity proc can not be NULL at any point after the continue. Removed dead code. CID 1269682 Unchecked return value (CHECKED_RETURN) Check the return code of orte_get_attribute. I assume we still need to check for a NULL proc in case the aborted proc attribute is set to NULL. This might be better as an assert (). Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-05-26 12:16:12 -06:00
Ralph Castain	c21cd1c91e	Ensure the ssh session is dead	2015-05-23 08:14:29 -07:00
Ralph Castain	920562d9b4	Ensure that all ssh sessions are terminated when abnormally terminating the job	2015-05-23 08:14:29 -07:00
Jeff Squyres	5e52ce26b5	help-errmgr-base.txt: remove trailing newline Removed spurrious newline at end of file so that the emitted help message doesn't contain a blank line before the final "-----" output.	2015-05-23 03:33:23 -07:00
Ralph Castain	55cd2a07f6	Update exit code	2015-05-22 21:06:43 -07:00
Ralph Castain	3510bb4ced	Set the exit code when a daemon fails	2015-05-22 21:05:23 -07:00
Ralph Castain	bc7a7f3de5	Fix abnormal shutdown when a node dies	2015-05-22 17:29:06 -07:00
Ralph Castain	96cd42699e	Cleanup warnings for uninitialized vars and convert bare debug output to verbose	2015-05-21 07:41:26 -07:00
Jeff Squyres	3069daa015	oob_tcp_listener: slightly refactor EAGAIN/EWOULDBLOCK Have only a single level of "if" conditionals. Also, slightly change the logic such that we only die/break out of the loop if we get EMFILE -- all other errors are ok to go on to the next fd. Finally, use a real show_help() message to warn when other errors occur.	2015-05-20 21:10:11 -04:00
Jeff Squyres	e43c8dc291	oob tcp: label a few #endif's Only bother labeling the ones that are a little far away from their corresponding #if statements.	2015-05-20 21:10:11 -04:00
Jeff Squyres	4b2f0d4827	oob tcp: reset MCA params from level 9 Set various MCA param levels	2015-05-20 21:10:11 -04:00
Jeff Squyres	1a4c9960e1	oob tcp: set KEEPALIVE timeout 60s, retry interval 5s The timeout is frequency at which to send keepalive pings; the retry interval is how often to send successive pings once a keepalive has not replied. Also update comments and MCA param help strings. 60 seconds -- squashme	2015-05-20 21:08:37 -04:00
Jeff Squyres	c95215dfc2	oob_tcp: do not set KEEPALIVE on listening sockets	2015-05-20 17:28:45 -04:00
Jeff Squyres	32d81af35f	oob tcp: re-enable keepalive option for Mac Plus very minor #if/#endif reduction.	2015-05-20 17:28:45 -04:00
rhc54	95c40e64b9	Merge pull request #584 from nkogteva/oob_ud_stress_test oob ud: fixed a bug that prevented the work with QoS framework	2015-05-20 09:56:08 -06:00
Gilles Gouaillardet	dd28b1f680	orted/dfs: fix misc memory leaks as reported by Coverity with CIDs 739887, 747706, 1196707-1196709 and 1269849	2015-05-20 13:09:46 +09:00
Ralph Castain	d3d3e73099	Per request from George, use defined(__APPLE__) instead of OPAL_HAVE_MAC. Don't try to close a negative socket	2015-05-15 07:13:42 -06:00
Ralph Castain	0a345d34e6	Plug the memory leak identified by George	2015-05-14 21:33:48 -06:00
Howard Pritchard	578430c36d	oob/alps: remove comment with personal reference Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-05-14 20:06:21 -07:00
Ralph Castain	8e30579e6e	The Mac appears to have problems with the keepalive support - once keepalive starts, the memory footprint soars. So disable keepalive on the Mac	2015-05-14 18:09:13 -06:00
Nadezhda Kogteva	d9dcf8352e	oob ud: fixed a bug that prevented the work with QoS framework (oob_stress_channel test)	2015-05-13 11:40:01 +03:00
Jeff Squyres	8e8d104520	oob ud: ibv_get_device_list()==NULL can mean no devices present ...which is not an error. Don't complain about it.	2015-05-12 10:54:39 -07:00
Jeff Squyres	8f941a6613	oob ud: better error msgs, tolerate systems without UD devices It is perfectly ok to be on a system without UD devices. Also, make some of the error messages better -- so that the user has a clue about where the error messages are coming from, and what they should do.	2015-05-11 13:11:51 -07:00
Mike Dubman	894ba28390	Merge pull request #559 from nkogteva/oob_ud oob ud: made component more user adaptive; opal outputs were replaced by...	2015-05-11 21:09:28 +03:00
Ralph Castain	3cee4152fc	Fix the intercommunictor issue reported by Gilles. Instead of directly checking the reachability bitmap, ask the component if the proc is reachable when doing a send as the component is the final arbiter in such cases. Recirculate any messages that a daemon is trying to send to void race conditions. Cleanup listener sockets so we don't leak them	2015-05-11 09:16:25 -07:00
Howard Pritchard	3382d3ce61	ess/alps: remove unnecessary vpid calc There was a redundant computation of the vpid for orted's happening in ess/alps rte_init method. Keep the more efficient alps based method. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-05-09 20:07:38 -07:00
Ralph Castain	b5382c9bf9	Rework the OOB selection logic to allow a component (e.g., usock) to direct that it be the sole active component. Remove prior disqualifying code in the oob/tcp component as it was too restrictive - if usock wasn't able to run, it left apps with no way to communicate to their daemon. Have the local daemon check the global modex for the RML URI info of the local procs so it can route messages between them when tcp is the primary channel. A few other minor cleanups included.	2015-05-08 11:15:21 -07:00
Ralph Castain	6e95bcd583	Fix typo in oob_tcp.c when IPV6 enabled. Cleanup a few other warnings, including a type in coll_sm that prevented that component from registering its MCA params!	2015-05-07 21:05:08 -07:00
Gilles Gouaillardet	a80fda25d8	orte: rename the global variable component_map into orte_component_map Thanks @goodell for pointing this !	2015-05-08 10:11:59 +09:00
Gilles Gouaillardet	2e384a3b65	initialize common symbols from orte A few uninitialized common symbols are remaining (generated by flex) : * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_leng * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_text * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_leng * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_text	2015-05-08 10:11:58 +09:00
Ralph Castain	9cb2fcfa5c	Cleanup the qos code when --enable-timings is given	2015-05-06 20:24:27 -07:00
Ralph Castain	01a9bdf4cf	Cleanup of ud/oob component	2015-05-06 19:48:42 -07:00
Ralph Castain	1f8de276de	Consolidate all the QOS changes into one clean commit	2015-05-06 19:48:42 -07:00
Ralph Castain	8e3f0b1d33	Ensure the --tree-spawn option is inside any parens from the sh and ksh shell support	2015-05-06 15:18:15 -07:00
Ralph Castain	0bb73645f0	Silence Coverity warning	2015-04-30 20:49:28 -07:00
Ralph Castain	7d1980ba83	Add the ability to specify the number of desired slots in the --host option. Just giving a host name => one slot (multiple copies of the name yield one slot per copy). Giving "foo:3" indicates you want three slots - a shorthand notation for saying "foo" three times. Giving "foo:*" indicates you want the topology to set the number of slots based on the orte_set_slots param.	2015-04-30 20:35:23 -07:00
Ralph Castain	e26e7ad736	Better support automated tests for map, rank, and bind options	2015-04-30 14:01:13 -07:00
Ralph Castain	7d4f9970d8	Minor cleanup	2015-04-29 17:49:35 -07:00
Nadezhda Kogteva	01ce58391e	oob ud: made component more user adaptive; opal outputs were replaced by help messages.	2015-04-28 15:36:32 +03:00
Jeff Squyres	8fbf34b196	oob ud: put call to ibv_fork_init() before all ibv calls Move the call to opal_common_verbs_fork_test() to up before the call to ibv_get_device_list() (just curious -- why not use opal_ibv_get_device_list()?). This ensures that the call to ibv_fork_init() is before all other ibv_* calls.	2015-04-24 14:19:06 -07:00
Ralph Castain	9104e81958	When --map-by node, we should be unbound. Also remove dead code due to copy/paste error.	2015-04-23 20:35:54 -07:00
Ralph Castain	5003be5c5c	If the user specifies a --map-by <foo> option, then default to bind-to <foo> unless they specify a bind-to option. If they map-by slot/node, then use the default policy based on num_procs.	2015-04-23 13:30:21 -07:00
Ralph Castain	d5e4fd059f	Ensure the binding and locale strings are always defined	2015-04-23 07:43:37 -07:00
Ralph Castain	cb7330a543	Get the output to lineup properly	2015-04-23 07:38:51 -07:00
Jeff Squyres	79243aca4e	display-devel-map: minor output tweak hwloc output can get fairly long, especially on machines with lots of cores and/or hyperthreads. So put the Locale and Binding output on separate lines.	2015-04-23 06:14:57 -07:00
Ralph Castain	58e646ccfd	Reduce confusion by having the devel-map display in the same format as report-bindings	2015-04-23 04:30:00 -07:00
Ralph Castain	43229d056e	Protect one more place from a NULL object	2015-04-20 18:45:57 -07:00
Jeff Squyres	11e8c2096b	plm rsh: assign some levels to the rsh PLM MCA params	2015-04-20 16:18:57 -07:00
Nathan Hjelm	359a282e7d	ess/singleton: MCA variable synonyms can not currently have NULL for both framework and component Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-20 16:50:52 -06:00
Ralph Castain	e8387fcf88	Protect tools that can never run in distributed mode from getting confused by PMI.	2015-04-20 15:42:57 -07:00
Nathan Hjelm	45e053dbce	orte: use C99 subobject naming for component initialization This commit helps future-proof orte components by initializing each component member by name. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-18 10:29:58 -06:00
Ralph Castain	34b53ac3dc	Silence Coverity warnings	2015-04-18 07:48:22 -07:00
Ralph Castain	12bfb27161	Redo in cleaner form: Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command	2015-04-17 16:11:37 -07:00
Nadezhda Kogteva - nadezhda.kogteva@itseez.com	c2678b0cc9	oob ud: fixes and parameter adjustment	2015-04-17 16:22:43 +03:00
Nathan Hjelm	3436f2917d	Merge pull request #449 from hjelmn/mca_base_update mca/base update	2015-04-16 08:41:48 -06:00
Ralph Castain	d9c555b547	Revert "Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command" This reverts commit open-mpi/ompi@278324c52a. Revert "Add the ability to pass args to the rsh/ssh command line" This reverts commit open-mpi/ompi@6f227f8564.	2015-04-16 08:03:14 -06:00
rhc54	79b9c50717	Merge pull request #535 from rhc54/topic/rsh Add the ability to pass args to the rsh/ssh command line	2015-04-15 21:11:46 -06:00
Ralph Castain	278324c52a	Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command	2015-04-15 20:30:04 -06:00
Ralph Castain	0e23f76eee	Fix comment	2015-04-15 20:09:14 -06:00
Ralph Castain	6f227f8564	Add the ability to pass args to the rsh/ssh command line	2015-04-15 20:07:13 -06:00
Howard Pritchard	283ef4c05d	oob/config: if --with-verbs=no, no ud The oob/ud configure was not honoring the case if the ompi is configured with --with-verbs=no. This fixes that problems. Fixes #522 Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-04-14 06:31:18 -07:00
Nathan Hjelm	113c890ccf	Merge pull request #520 from hjelmn/valgrind_cleanness fix memory leaks and valgrind errors	2015-04-13 10:09:34 -06:00
Ralph Castain	9c6d452d6b	If we are using HT cpus and have <= 2 procs, then map-by hwthread by default	2015-04-11 21:18:05 -07:00
Ralph Castain	cd686057f6	If the HNP is on a coprocessor, record it so we don't get an error log later	2015-04-11 15:30:15 -07:00
Nathan Hjelm	a7b0c00ab6	fix memory leaks and valgrind errors This commit fixes several vagrind errors. Included: - installdirs did not correctly reinitialize all pointers to NULL at close. This causes valgrind errors on a subsequent call to opal_init_tool. - several opal strings were leaked by opal_deregister_params which was setting them to NULL instead of letting them be freed by the MCA variable system. - move opal_net_init to AFTER the variable system is initialized and opal's MCA variables have been registered. opal_net_init uses a variable registered by opal_register_params! - do not leak ompi_mpi_main_thread when it is allocated by MPI_T_init_thread. - do not overwrite ompi_mpi_main_thread if it is already set (by MPI_T_init_thread). - mca_base_var: read_files was overwritting mca_base_var_file_list even if it was non-NULL. - mca_base_var: set all file global variables to initial states on finalize. - btl/vader: decrement enumerator reference count to ensure that it is freed. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-11 09:28:35 -06:00
Ralph Castain	91e1cbf284	Init variable	2015-04-11 07:44:57 -07:00
Ralph Castain	033418f62a	Correct a typo that reversed the default binding pattern. Ensure we default bind to hwthread if user specified --use-hwthread-cpus if nprocs <= 2, and bind to hwthread if told to do so.	2015-04-10 15:58:35 -07:00
Ralph Castain	3e44d3c9e3	Enable singletons to run without any active OOB module until they attempt to comm_spawn	2015-04-10 14:06:42 -07:00
Ralph Castain	e4f6f83b9d	Attempt to silence new Coverity complaint by ensuring the string read from file is NULL terminated.	2015-04-10 07:54:37 -07:00
Ralph Castain	396700ad8b	Protect the notifier macro's against NULL job objects	2015-04-09 16:04:43 -07:00
Nathan Hjelm	c416c423bb	ess/singleton: do not put component strings into the environment putenv requires that any string put into the environment is not changed or freed. That is not the case with constant strings as they will go away when dlclose is called on the component. Instead, just use opal_setenv which does not have this restriction. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-09 11:00:47 -06:00
Nathan Hjelm	9cd955badf	opal: fix multiple bugs in MCA and opal This commit fixes the following bugs: - opal_output_finalize did not properly set internal state. This caused problems when calling the sequence opal_output_init (), opal_output_finalize (), opal_output_init (). - opal_info support called mca_base_open () but never called the matching mca_base_close (). mca_base_open () and mca_base_close () have been updated to use a open count instead of an open flag to allow mca_base_open to be called through multiple paths (as may be the case when MPI_T is in use). - orte_info support did not register opal variables. This can cause orte-info to not return opal variables. - opal_info, orte_info, and ompi_info support have been updated to use a register count. - When opening the dl framework the reference count was added to ensure the framework stuck around. The framework being closed prematurely was a bug in the MCA base that has since been corrected. The increment (and associated decrement) have been removed. - dl/dlopen did not set the value of mca_dl_dlopen_component.filename_suffixes_mca_storage on each call to register. Instead the value was set in the component structure. This caused the value to be lost when re-loading the component. Fixed by setting the default value in register. - Reset shmem framework state on close to avoid returning a stale component after reloading opal/shmem. - MCA base parameters were not properly deregistered when the MCA base was closed. This commit may fix #374. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-07 19:13:20 -06:00
Ralph Castain	0c043dbdc9	Fix typo in var name	2015-04-02 02:32:42 -07:00
Ralph Castain	a4b466efc4	Support attempts to connect async processes by allowing the oob/tcp connection to retry the attempt to connect to a peer. Off by default, operates if someone specifies how long to wait between retry attempts.	2015-04-01 20:21:23 -07:00
Ralph Castain	9f8ae59162	Properly enclose the different && clauses	2015-04-01 18:48:25 -07:00
Ralph Castain	57c21d5209	Ensure the DVM flows thru the "daemons reported" state	2015-04-01 16:47:34 -07:00
Jeff Squyres	99754afd25	orterun.c: re-justify the output message text The type-A personality / english lit major in me compells me to re-justify the text. :-)	2015-04-01 10:57:23 -07:00
Mike Dubman	8914a9c070	Merge pull request #494 from elenash/modifiers changed mindist mapping policy specifier	2015-04-01 16:31:46 +03:00
Elena	1e913c76c4	changed mindist mapping policy specifier from map-bt dist:device,modifiers to --map-by dist:modifiers -mca rmaps_dist_device device	2015-04-01 15:07:35 +03:00
Nadezhda Kogteva	2d49d9bd45	grpcomm rcd: remove unnecessary malloc warning for case when number of daemons == 1	2015-04-01 11:07:44 +03:00
Mike Dubman	58d002098b	Merge pull request #474 from elenash/master Introduce -tune command line option to set env vars and mca params from ...	2015-04-01 08:23:34 +03:00
Ralph Castain	b468f6a503	Okay, Jeff - use opal_setenv	2015-03-31 20:34:02 -07:00
Ralph Castain	6f9140a341	Add a little more debug to launch	2015-03-31 20:10:21 -07:00
Ralph Castain	e5d96417e7	Update warnings for run-as-root	2015-03-31 17:55:28 -07:00
Ralph Castain	41dd65d6cd	Per Jeff's request, tone down the comments and "standardize" the warning	2015-03-31 17:54:54 -07:00
Ralph Castain	f04eb6a9c0	Extend the root-user protection to some more ORTE tools	2015-03-31 10:34:35 -07:00
Ralph Castain	f863147b05	Per the telecon and chat with Jeff, let root only do the version option without warning. Otherwise, require that the user specifically indicate allow-use-as-root	2015-03-31 10:34:35 -07:00
Ralph Castain	b209c9efa5	Move the "dvm ready" message to stdout so it is easier to trap	2015-03-30 20:12:56 -07:00
Ralph Castain	6d205a3c80	Ensure that singletons pickup the oob/tcp component	2015-03-30 18:10:08 -07:00
Ralph Castain	2fa56fb329	Ensure that orte-submit picks the correct ess module as it is -never- allowed to be used as a distributed tool Thanks to Mark Santcroos for diagnosing this one.	2015-03-30 18:08:34 -07:00
rhc54	bc016617a0	Merge pull request #501 from rhc54/topic/sec2 Support authentication across security domains	2015-03-30 09:59:43 -07:00
Nadezhda Kogteva	a828eada98	sm dstore: set pmix segment size to proper value	2015-03-30 13:34:25 +03:00
Ralph Castain	d07dc362d5	Ensure we can authenticate when crossing security domains by including all available credentials, and letting the receiver use the highest priority one they have in common.	2015-03-28 20:34:26 -07:00
Ralph Castain	b67b3619fc	If we are using the default bindings, and one or more nodes are not setup to support binding, then don't error out - just don't bind. Thanks to Annu Desari for pointing out the problem.	2015-03-28 08:20:24 -07:00
Ralph Castain	2f365720b0	Allow root to request the version and help from mpirun without having to override the run-as-root protection. Thanks to Robert McLay for pointing this out	2015-03-28 08:17:44 -07:00
Ralph Castain	d2d02a1642	ckpt	2015-03-28 07:59:20 -07:00
Nathan Hjelm	b68d66bb9b	MCA: Add the project/project version to the MCA base component This commit adds support for project_framework_component_* parameter matching. This is the first step in allowing the same framework name in multiple projects. This change also bumps the MCA component version to 2.1.0. All master frameworks have been updated to use the new component versioning macro. An mca.h has been added to each project to add a project specific versioning macro of the form PROJECT_MCA_VERSION_2_1_0. Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2015-03-27 10:59:04 -06:00
Elena	90f5b2bb84	Introduce -tune command line option to set env vars and mca params from file	2015-03-26 18:33:53 +02:00
rhc54	2ff7575dde	Merge pull request #497 from rhc54/topic/sec Allow for different security domains.	2015-03-25 21:01:29 -07:00
Ralph Castain	6aa33deafb	Remove debug	2015-03-25 19:58:51 -07:00
Ralph Castain	10cf455080	Tools need to use the TCP OOB component	2015-03-25 19:56:49 -07:00
Ralph Castain	1b24536941	Allow for different security domains. Let the initiator of the connection determine the method to be used - if the receiver cannot support it, then that's an error that will cause the connection attempt to fail.	2015-03-25 13:22:01 -07:00
Ralph Castain	6ba76ed8d8	Per user request, we allow -host to specify a host that is not included in a hostfile (however, we reject it if we were given an allocation by a resource manager). Since we cannot know if an IP addr form references the same node that was previously given as a string name, we have no choice but to assume they are different. Get the topology from the right place in that situation so mpirun can succeed.	2015-03-25 06:16:01 -07:00
rhc54	df24816d64	Merge pull request #488 from lrrajesh/master Notification msg add severity to the message header.	2015-03-20 09:45:46 -07:00
Ralph Castain	095a8fa684	We don't need to know about non-fatal errors from setting socket options	2015-03-20 07:16:31 -07:00
Ralph Castain	a013f3059f	For scalability reasons, and to make life easier for the poor Cray-ites, don't bang on the system for the username - we'll just use the uid.	2015-03-19 21:24:13 -07:00
Howard Pritchard	990e9b47e0	Merge pull request #486 from hppritcha/topic/issue_484 orte/oob: implement alps oob component	2015-03-19 19:40:40 -06:00
Ralph Castain	43a3baad5e	Ensure we use the first compute node's topology for mapping Don't filter the topology by cpuset if you are mpirun until you know that no other compute nodes are involved. This deals with the corner case where mpirun is executing on a node of different topology from the compute nodes. Simplify - don't mandate that all cpus in the given cpuset be present on every node. We can then run everything thru the filter as before, which ensures that any procs run on mpirun are also contained within the specified cpuset. Correctly count the number of available PUs under each object when given a cpuset Fix the default binding settings, and correctly count PUs when no cpuset is given Ensure the binding policy gets set in all cases	2015-03-19 16:30:36 -07:00
Howard Pritchard	6054975913	oob/alps: add configure file for alps oob Have to have alps rpms installed on a system for alps component to build, even if separated by a level of indirection. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-03-19 15:38:14 -07:00
Howard Pritchard	b1f31a4364	orte/oob: implement alps oob component Implement an almost-do-nothing alps oob component. When using aprun to launch a job on Cray system, there is no reason to need an oob system, since ompi relies on Cray PMI for oob communication. Fixes #484	2015-03-19 14:11:40 -07:00
lrrajesh	4dc75687e2	Notification msg add severity to the output	2015-03-18 13:55:03 -07:00
Nadezhda Kogteva	7c25b4cea6	grpcomm: fixed brks and rcd algorithms - added enough space for masks in order to get them working in the large scale.	2015-03-18 14:33:04 +02:00
Ralph Castain	50277fec76	Adjust MCA param	2015-03-17 19:46:31 -07:00
rhc54	b41d2ad6c4	Merge pull request #481 from rhc54/topic/slurm Add new MCA parameter to support edge case with debugger at LLNL	2015-03-17 07:40:55 -07:00
Ralph Castain	b01e8c1063	Include the FQDN version and non-stripped version of the hostname in our list of aliases as these (plus localhost) are the most common aliases we see.	2015-03-17 06:26:26 -07:00
Ralph Castain	d7d8ae46ed	We no longer pass the RML URI for procs launched via mpirun as the daemon has no need for that info.	2015-03-17 06:10:20 -07:00
Ralph Castain	3e32c360c7	Add new MCA parameter to support edge case with debugger at LLNL	2015-03-16 20:04:05 -07:00
Ralph Castain	a0487e014c	Further reduce the RARP load by removing getaddrinfo for IPv6 connections. Correct typo when checking return on inet_pton. Don't consider the TCP component for apps that are launched via mpirun as it will never be used.	2015-03-16 19:42:05 -07:00
Ralph Castain	5ae42c816e	Attempt to reduce the RARP traffic during definition of allocations	2015-03-16 16:26:40 -07:00
Ralph Castain	64d11f170a	Adjust the default keepalive interval. Refactor the code when setting keepalive options	2015-03-16 12:32:58 -07:00
Ralph Castain	4ded049cbc	Modify MCA param description	2015-03-16 11:57:32 -07:00
Ralph Castain	019bba5caf	Cleanup a bit - don't need to lookup the protocol number if we just use the right define	2015-03-16 11:54:51 -07:00
Ralph Castain	69ac25bf55	Add support for TCP keepalive on inter-node sockets	2015-03-16 09:59:44 -07:00
adrianreber	714d9aa67e	Merge pull request #348 from adrianreber/topic/orte_cr_continue_like_restart Topic/orte cr continue like restart	2015-03-12 14:54:02 +01:00
Nathan Hjelm	695dcd5a28	oob/ud: fix compiler warning	2015-03-11 10:53:32 -06:00
Adrian Reber	c08e234af7	FT: fix compilation using --with-ft (5/5) Enabling the FT code breaks compilation (again). This series tries to fix the compiler errors. This is again only fixing the compiler errors without any warranty that the result might actually support FT again. With the changes introduced in the previous patches in this series some goto constructs for cleanup are no longer necessary and removed.	2015-03-11 14:23:33 +01:00

... 5 6 7 8 9 ...

5362 Коммитов