openmpi

Автор	SHA1	Сообщение	Дата
Nathan Hjelm	fe1c6bd881	Merge pull request #2840 from hjelmn/event_fix verbs: remove extra event user increment/decrement operation	2017-01-26 07:30:24 -08:00
Ralph Castain	399de0738e	Cleanup launch Given that we only set OOB contact info from inside of events, or before we begin threaded operations (e.g., in the ess), allow set_contact_info to directly update the oob/base framework globals. Correct the nidmap regex decompression routine. Ensure that rank=1 daemon always sends back its topology as this is the most common use-case. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-25 22:06:09 -08:00
Nathan Hjelm	9f28c0af39	verbs: remove extra event user increment/decrement operation Since the oob and connections systems do not work the same way they did in older versions of Open MPI these operations are no longer necessary. At best they do nothing and at worst they hurt performance by making us enter the event library more often in opal_progress(). Fixes #2839 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-01-25 18:37:06 -07:00
Ralph Castain	2f4e87eae9	Have rank=1 daemon always send its topology back as this is the most common use-case Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-25 09:33:11 -08:00
Jeff Squyres	230bbc597d	plm base: make sure to assign "node" early enough Make sure to assign "node" before using it in ORTE_FLAG_SET. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-25 08:02:59 -08:00
Ralph Castain	184ccc8e91	Cleanup some code so it is clear that it is executing in an event. Ensure that peer event base is properly set on incoming connections Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-25 06:55:11 -08:00
Gilles Gouaillardet	ef10d3fd7b	orte: add missing include file Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-25 16:15:20 +09:00
Joshua Hursey	0e9a06d2c3	orte/iof: Add app stderr to stdout redirection at source * Add an MCA parameter to combine stdout and stderr at the source - `iof_base_redirect_app_stderr_to_stdout` * Aids in user debugging when using libraries that mix stderr with stdout Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-01-24 16:23:48 -06:00
Joshua Hursey	dcd9801f7c	orte/iof: Add orte_map_stddiag_to_stdout option * Similar to `orte_map_stddiag_to_stderr` except it redirects `stddiag` to `stdout` instead of `stderr`. * Add protection so that the user canot supply both: - `orte_map_stddiag_to_stderr` - `orte_map_stddiag_to_stdout` Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-01-24 16:22:59 -06:00
Ralph Castain	ef86707fbe	Deprecate the --slot-list paramaeter in favor of --cpu-list. Remove the --cpu-set param (mark it as deprecated) and use --cpu-list instead as it was confusing having the two params. The --cpu-list param defines the cpus to be used by procs of this job, and the binding policy will be overlayed on top of it. Note: since the discovered cpus are filtered against this list, #slots will be set to the #cpus in the list if no slot values are given in a -host or -hostname specification. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-24 13:33:22 -08:00
Ralph Castain	4e9364b9a4	Merge pull request #2794 from rhc54/topic/regs Next step in reducing launch time	2017-01-24 03:19:57 -08:00
Ralph Castain	86ab751c5e	Next step in reducing launch time: begin reducing the size of the launch message itself. Start by expressing the daemon map as a set of three regular expression strings. On an 8k cluster, this reduces the nidmap contribution from over 200kBytes to 21 bytes in size. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-23 19:54:47 -08:00
Gilles Gouaillardet	0bdc594b2e	rml/base: plug a memory leak in orte_rml_API_recv_cancel() simply return when the orte event thread has gone Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-24 09:12:47 +09:00
Ralph Castain	a61f7bdb26	Merge pull request #2780 from rhc54/topic/conn Ensure we properly set the "shutting down" flag so connection drops by downstream peers are properly handled.	2017-01-23 06:40:28 -08:00
Ralph Castain	e7b12913b4	Ensure we properly set the "shutting down" flag so connection drops by downstream peers are properly handled. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-23 04:00:24 -08:00
Nathan Hjelm	954a4b7be3	oob/base: fix num_threads registration type This commit fixes a bug in the registration of the num_threads MCA variable. The variable is of type int and was being registered as a boolean. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2017-01-22 14:02:34 -07:00
Ralph Castain	ac4fcd3f97	Ensure that oob/base level data is always accessed in the oob/base event thread. Make debruijn the default routed component Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-22 10:33:32 -08:00
Ralph Castain	6560617c04	Fix comm_spawn and orte-dvm by resetting all used "node mapped" flags after building the child list Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-22 05:55:53 -08:00
Ralph Castain	639cdd4f9d	Add missing flag set to ensure nodes do not get double-added to job map. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-21 20:06:50 -08:00
Ralph Castain	be3ef77739	Improve packing efficiency by raising the initial buffer size and modifying the extension code. Flag if a job map has had its nodes added so we don't have to loop repeatedly to check it. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-21 14:03:19 -08:00
Ralph Castain	466cbd4d29	Rework the threading in oob/tcp so that daemons (including mpirun) use multiple progress threads to get messages out to their children, and so that the oob/base uses a separate one to setup sends. This allows the daemon cmd processor to execute in parallel with relay of messages, which significantly reduces launch times at scale Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-21 13:26:19 -08:00
Ralph Castain	668421b6ec	Compress the xcast message if bigger than a defined size to further improve launch performance at scale Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-19 22:08:02 -08:00
Ralph Castain	1f46e48b94	Have mpirun and orteds activate the oob/tcp progress thread by default, leaving a way to turn it off via MCA param. Provide a method by which the add_procs command can be processed in parallel with relaying the cmd message to the next daemons down the tree. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-19 18:52:58 -08:00
Ralph Castain	bb132f6d03	Merge pull request #2764 from rhc54/topic/dvm If a tool sees the HNP it is attached to die (thereby losing connecti…	2017-01-19 15:39:30 -08:00
Ralph Castain	ca50b31de1	Merge pull request #2762 from rhc54/topic/oobfast Speed-up the OOB/TCP communications by using writev instead of writing the header, and then separately write the body	2017-01-19 15:39:06 -08:00
Ralph Castain	19bb64cfb8	If a tool sees the HNP it is attached to die (thereby losing connection), then stop the event loop instead of going through the abort code path. This will allow the tool to cleanup before exiting Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-19 14:04:06 -08:00
Ralph Castain	e5f687f896	Speed-up the OOB/TCP communications by using writev instead of writing the header, and then separately write the body Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-19 13:03:44 -08:00
Ralph Castain	368684bd63	Revert `e9bc293` and try a different approach for scalably dealing with hetero clusters. Have each orted send back its topo "signature". If mpirun detects that this signature has not been seen before, then ask for that daemon to send back its full topology description. This allows the system to only get the topology once for each unique topo in the cluster. Cleanup a typo, and remove no longer needed MCA params for hetero nodes and hetero apps. Hetero nodes will always be automatically detected. We don't support a mix of 32 and 64 bit apps Modify the orte_node_t to use orte_topology_t instead of hwloc_topology_t, updating all the places that use it. Ensure that we properly update topology when we see a different one on a compute node. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-18 10:22:15 -08:00
Ralph Castain	e9bc2934be	Add an MCA param "hnp_on_smgmt_node" that mpirun can use to tell the orteds to ignore its topology signature as mpirun is executing on a system mgmt node, and hence a different topology than the compute nodes Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-16 19:32:01 -08:00
Ralph Castain	74a285be83	Cancel the waitpid callback once the waitpid on a process has fired to avoid multiple notifications Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-16 14:32:02 -08:00
Ralph Castain	9e8c7d6295	Silence Coverity warning Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-15 07:51:37 -08:00
Ralph Castain	6b34cc67d6	Correct typo Fixes #2691 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-15 07:48:31 -08:00
Ralph Castain	3a157f0496	One more time - we "push" IOF for stdout, stderr, and stddiag with separate calls. However, we were creating the sinks for all three of them each time, which caused them to leak. Create the sinks only once for each channel. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-14 17:40:36 -08:00
Ralph Castain	b55c03255a	Strange - I had created a new IOF API "complete" for cleaning up at the end of jobs, but somehow the implementation is missing. It also appears that the orted's never actually cleaned up their job-related information. These things are fine for normal mpirun-based operations, but cause significant resource leaks for the DVM. Complete the implementation and seal the leaks Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-12 19:54:18 -08:00
Ralph Castain	0e2df3be3e	Missed one spot - plug fd leaks in orteds Fixes #2691 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-12 13:45:46 -08:00
Ralph Castain	9ad02b5d13	Merge pull request #2718 from rhc54/topic/leaks Don't remove the IOF framework's tracking info for a proc until the state machine tells it to do so.	2017-01-12 09:57:17 -08:00
Nathan Hjelm	110840fc87	ess/hnp: add support for forwarding additional signals (#2712 ) * ess/hnp: add support for forwarding additional signals This commit adds support to the hnp ess module to forward additional signals beyond the default SIGUSR1, SIGUSR2, SIGSTP, and SIGCONT. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> * Generalize this a bit to allow a broader range of signals to be forwarded. Turns out that SIGURG is now a "standard" signal, though the value differs across systems. So setup to forward it (and some friends) if they are defined. Allow users to provide the signal name (instead of the integer value) as the value of even the more common signals does vary across systems. Don't limit the number that can be supported. Signed-off-by: Ralph Castain <rhc@open-mpi.org> * ess/hnp: fix some bugs in the signal forwarding code This commit fixes two bugs: - signals_set needs to be set even if no signals are being forwarded. If it is not set we will SEGV in libevent if ess_hnp_forward_signals == none. - SIGTERM and SIGHUP are handled with a different type of handler. Do not allow the user to specify these to be forwarded. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> * We are sure to get "dinged" if error messages aren't nicely output via show_help, so do so here Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-12 10:09:41 -07:00
Ralph Castain	fa419d3c0d	Don't remove the IOF framework's tracking info for a proc until the state machine tells it to do so. This plugs leaked file descriptors as we were losing track prior to destructing the resources. Fixes #2691 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-12 08:34:29 -08:00
Ralph Castain	aff3a00059	Protect default mapping/binding options for cases where no NUMA or SOCKET objects exist - like VMs Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-11 09:44:44 -08:00
Ralph Castain	93e4935902	Be a tad more cautious before releasing objects when running in DVM mode Fixes #2700 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-10 14:04:27 -08:00
Gilles Gouaillardet	44c1ff60f1	Merge pull request #2672 from ggouaillardet/topic/misc_memory_leaks Plug misc memory leaks	2017-01-10 13:16:04 +09:00
Joshua Ladd	3e23380bba	Merge pull request #2675 from artpol84/orte/state/exit_1_fix orte/odls: Fix ORTE state machine for the non-zero exit case	2017-01-09 12:32:37 -05:00
Joshua Ladd	7fc9f9bbac	Merge pull request #2620 from karasevb/fix_rmaps_mindist rmaps/mindist: fix pmix errors	2017-01-06 17:26:48 -05:00
Ralph Castain	684e69695f	Minor cleanups to eliminate warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-06 08:44:10 -08:00
Artem Polyakov	3eb6c98542	orte/odls: Fix ORTE state machine for the non-zero exit case This commit fixes rare race condition that occurs when the process that is calling `exit(-1)` has delay between fd cleanup and actual OS-level exit. This may happen if the process has some work to do `on_exit()`. Problem description: Consider an application process that has called `exit(nonzero)`, it's fd's was closed but it's actual termination at OS level is delayed by some cleanups (eg. in callbacks registered via `on_exit()`). Observed sequence of events was the following: * orted gets stdio disconnection and activating `IOF COMPLETE` state. * parallel OOB disconnection causes `COMMUNICATION FAILURE` state to be activated. * during `COMMUNICATION FAILURE` processing `odls_base_default_wait_local_proc` is called even though real waitpid wasn't yet called (code mentions that waitpid might not be called for unspecified reason). Because of that real exit code is unknown and set to 0. `odls_base_default_wait_local_proc` callback sees `IOF COMPLETE` flag and in conjunction with 0-exit-code it activates `WAITPID FIRED` state. * processing of `WAITPID FIRED` leads to `NORMALLY TERMINATED` to be activated. * `NORMALLY TERMINATED` state in particular leads `ORTE_PROC_FLAG_ALIVE` flag for this proc to be dropped. * when application process finally exits and `wait_signal_callback` is launched. It sets real exit code and calls `odls_base_default_wait_local_proc` again but at this time since the process has `ORTE_PROC_FLAG_ALIVE` flag dropped `WAITPID FIRED` state is activated (instead of `EXITED WITH NON-ZERO`) leading to a hang that was observed. Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2017-01-06 11:12:55 +02:00
Gilles Gouaillardet	6b9343a966	plm/rsh: plug a memory leak Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 15:38:45 +09:00
Gilles Gouaillardet	8ba92d7516	iof/base: plug a memory leak in orte_iof_base_close() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 15:38:45 +09:00
Gilles Gouaillardet	7fe6840232	state/hnp: plug a memory leak Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 15:38:45 +09:00
Gilles Gouaillardet	4d58b8dcae	ess/pmi: plug a memory leak Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 15:38:45 +09:00
Gilles Gouaillardet	c0c5dd8ccc	orte: plug a memory leak in orte_rml.recv_cancel do not invoke orte_rml.recv_cancel after the orte progress thread has gone Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 15:38:44 +09:00
Gilles Gouaillardet	17fac4bfd1	grpcomm/base: get rid of the seq_num field of the orte_grpcomm_signature_t struct Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 15:38:44 +09:00
Gilles Gouaillardet	fe25f50871	grpcomm/base: plug a memory leak on finalize manually allocate sequence numbers to be stored into the orte_grpcomm_base.sig_table hash table, and manually release them on orte_grpcomm_base_close() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 15:38:44 +09:00
Gilles Gouaillardet	0ee5d56ab1	grpcomm/direct: plug a memory leak in barrier_release() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 13:46:35 +09:00
Gilles Gouaillardet	f2d6584189	grpcomm/base: plug misc memory leaks - add a destructor to orte_grpcomm_caddy_t in order to plug a memory leak - plug a memory leak in barrier_release() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 13:46:21 +09:00
Gilles Gouaillardet	58f2a764f9	ess/hnp: plug memory leaks Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 11:35:59 +09:00
Gilles Gouaillardet	24c61b0625	oob/tcp: plug a memory leak in mca_oob_tcp_component_lost_connection() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 11:35:59 +09:00
Gilles Gouaillardet	c7d9e62d47	rml/base: plug a memory leak add a destructor to orte_rml_send_request_t in order to plug a memory leak Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 11:35:59 +09:00
Ralph Castain	6509f60929	Complete the memprobe support. This provides a new scaling tool called "mpi_memprobe" that samples the memory footprint of the local daemon and the client procs, and then reports the results. The output contains the footprint of the daemon on each node, plus the average footprint of the client procs on that node. Samples are taken after MPI_Init, and then again after MPI_Barrier. This allows the user to see memory consumption caused by add_procs, as well as any modex contribution from forming connections if pmix_base_async_modex is given. Using the probe simply involves executing it via mpirun, with however many copies you want per node. Example: $ mpirun -npernode 2 ./mpi_memprobe Sampling memory usage after MPI_Init Data for node rhc001 Daemon: 12.483398 Client: 6.514648 Data for node rhc002 Daemon: 11.865234 Client: 4.643555 Sampling memory usage after MPI_Barrier Data for node rhc001 Daemon: 12.520508 Client: 6.576660 Data for node rhc002 Daemon: 11.879883 Client: 4.703125 Note that the client value on node rhc001 is larger - this is where rank=0 is housed, and apparently it gets a larger footprint for some reason. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-05 10:32:17 -08:00
Ralph Castain	9eab9a1ed3	Remove stale global variables Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers. Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation). Begin first cut at memory profiler Some minor cleanups of memprobe Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-02 14:04:24 -08:00
Ralph Castain	fe68f23099	Only instantiate the HWLOC topology in an MPI process if it actually will be used. There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced: * shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string instead of the topology itself, if available, thus avoiding instantiating the topology * openib BTL. This uses the distance matrix. At present, I haven't developed a method for replacing that reference. Thus, this component will instantiate the topology * usnic BTL. Uses the distance matrix. * treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate the topology * ess base functions. If a process is direct launched and not bound at launch, this code attempts to bind it. Thus, procs in this scenario will instantiate the topology Note that instantiating the topology on complex chips such as KNL can consume megabytes of memory. Fix pernode binding policy Properly handle the unbound case Correct pointer usage Do not free static error messages! Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-29 10:33:29 -08:00
Ralph Castain	3a2d6a5ab6	Begin to reduce reliance of application procs on the topology tree itself by having the daemon provide more detailed info. In this case, provide the topology description string so that procs can readily determine the number of types of objects on the node, and a "locality" string that describes which objects this process is executing upon. The latter allows a process to compute the objects of overlap between itself and another proc without consulting the topology tree. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-28 09:14:26 -08:00
Ralph Castain	7866bb1119	Add debug, cleanup cpus/rank Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-27 21:25:52 -08:00
Ralph Castain	1e4bffd937	Fix mapping directive checks Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-27 20:42:47 -08:00
Ralph Castain	791f4f1ce3	Adjust debug output for clarity Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-26 14:04:20 -08:00
Ralph Castain	ef3f748d0d	Transfer some minor cleanups back from the PMIx reference server Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-23 08:46:04 -08:00
Boris Karasev	5fb3e0a9b6	rmaps/mindist: fix pmix errors Fixed the case were only part of the nodes in the allocation are used by the applicaton proccesses. Force PMIx nodemap key to only contain nodes that are actually used by the application proccesses. Signed-off-by: Boris Karasev <karasev.b@gmail.com>	2016-12-21 06:42:04 +02:00
Ralph Castain	ea133206ec	Sync the internal OMPI component to PMIx master Update external PMIx v2.x component Add missing Makefile Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-19 19:14:16 -08:00
Ralph Castain	256b5adac5	Transfer across final fixes from debugger attach work Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-19 00:34:27 -08:00
Ralph Castain	c6f6f40529	Transfer debugger support changes Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-17 18:14:46 -08:00
Ralph Castain	269753f5c1	Transfer back changes from debugger attach work Silence warning Remove debug Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-17 10:00:52 -08:00
Ralph Castain	215d6290e0	Add a flux component for LLNL Fine tuning of flux component Fix a few minor issues with the initial cut: * Job id could be obtained from the PMI kvsname like SLURM, but simpler to getenv (FLUX_JOB_ID) * Flux pmi-1 doesn't define PMI_BOOL, PMI_TRUE, PMI_FALSE * Flux pmi-1 maps the deprecated PMI_Get_kvs_domain_id() to PMI_KVS_Get_my_name() internally, so just call that instead. * Drop residual slurm references. Add wrappers for PMI functions so that if HAVE_FLUX_PMI_LIBRARY is not defined, the component can dlopen libpmi.so at location specified by the FLUX_PMI_LIBRARY_PATH env variable, which adds flexibility. If HAVE_FLUX_PMI_LIBRARY is defined, link with libpmi.so at build time in the usual way. Update configury for flux component Update m4 so the configure options work as follows: --with-flux-pmi Build Flux PMI support (default: yes) --with-flux-pmi-library Link Flux PMI support with PMI library at build time. Otherwise the library is opened at runtime at location specified by FLUX_PMI_LIBRARY_PATH environment variable. Use this option to enable Flux support when building statically or without dlopen support (default: no) If the latter option is provided, the library/header is located at build time using the pkg-config module 'flux-pmi'. Otherwise there is no library/header dependency. Handle the case where ompi is configured with --disable-dlopen or --enable-statkc. In those cases, don't build the component unless --with-flux-pmi-library is provided. It is fatal if the user explicitly requests --with-flux-pmi but it cannot be built (e.g. due to --disable-dlopen). Add a schizo/flux component Update schizo/flux component Eliminate slurm-specific usage cases. Since the module is only loaded if FLUX_JOB_ID is set, there are only two cases to handle: 1) App was launched indirectly through mpirun. This is not yet supported with Flux, but hook remains in case this mode is supported in the future. 2) App was launched directly by Flux, with Flux providing CPU binding, if any. Fix up white space in pmix/flux component Drop non-blocking fence from pmix:flux component The flux PMI-1 library is not thread safe, therefore register a regular blocking fence callback instead of the thread-shifting fencenb(). pmix/flux component avoids extra PMI_KVS_Gets Keys stored into the base cache under the wildcard rank are not intended to be part of the global key namespace. These keys therefore should not trigger a PMI_KVS_Get() if they are not found in the cache. Minor pmix/flux component cleanup pmix/flux: drop code for fetching unused pmix_id pmix/flux: err_exit must return error Problem: in flux_init(), although 'ret' (variable holding err_exit return code) is initialized to OPAL_ERROR, the variable is reused as a temporary result code, so if there are some successes followed by a failure that doesn't set 'ret', flux_init() could return success with PMI not initialized. Ensure that a "goto err_exit" returns OPAL_ERROR if 'ret' is not set to some other error code. pmix/flux: don't mix OPAL_ and PMI_ return codes Problem: flux_init() can return both PMI_ and OPAL_ return codes. Although OPAL_SUCCESS and PMI_SUCCESS are both defined as 0, other codes are not compatible. Ensure that flux_init() consistently uses 'rc' for PMI_ return codes and 'ret' for OPAL_ return codes. pmix/flux: factor out repeated code for cache put Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-16 18:26:38 -08:00
Ralph Castain	2af677b1cf	Ensure that we don't bind-by-default in an oversubscribed condition Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-15 07:58:52 -08:00
Ralph Castain	884fb7fcf2	Update the PMIx2 support to include the latest shared memory optimizations Update ORTE support for dynamic PMIx operations e.g., PMIx_Spawn Update to track master Ensure that --disable-pmix-dstore actually disables the dstore. Sync to a few debugger updates Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-14 15:00:10 -08:00
Ralph Castain	9f69b0183f	Ensure jobs that fail always return a non-zero exit code. Thanks to Ashley Pittman for the report. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-14 09:41:06 -08:00
rhc54	341ab683de	Merge pull request #2532 from rhc54/topic/pmixptl Update to latest PMIx master + PTL branch	2016-12-07 17:28:22 -08:00
Ralph Castain	e1aa7939ef	Correctly cleanup the local children and node map info on remote orteds upon job completion. Ensure that register_nspace only includes procs from that job in the proc map Thanks to Ashley Pittman for the report Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-07 13:53:00 -08:00
Gilles Gouaillardet	123036dbf8	ess/base: invoke orte_routed.update_routing_plan() earlier fix an issue that can be evidenced with two nodes n0$ mpirun --host n1:1 --mca oob_tcp_static_ipv4_ports 1234 -np 1 --mca routed radix --mca oob tcp true Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-12-07 17:19:25 +09:00
Ralph Castain	fbed2d794a	Update to latest PMIx master + PTL branch Update the usock component to disable it Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-06 20:47:44 -08:00
Ralph Castain	85a634926b	Update signal handling to introduce a pause between SIGCONT and SIGTERM, followed by another pause before SIGKILL. Do this within the odls/kill_local_procs function while we know we are blocked in an event, and before the daemon shuts down the event progress loop Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-06 12:34:42 -08:00
Ralph Castain	d8f262e39b	Resolve a duplicate symbol issue when the rml/ofi component is enabled Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-05 13:41:38 -08:00
Ralph Castain	79cde184ad	Allow a PMIx tool to spawn a job Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-03 16:00:47 -08:00
Ralph Castain	1a0bccb536	Now that PMIx has settled on its release strategy and numbering, update the OPAL pmix framework to track Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 15:44:43 -08:00
Ralph Castain	88313debc2	Per discussion on email thread, restore placement of child procs in their own process group so that any signal sent to one of our children is automatically propagated to any child process they might have spawned. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-02 03:36:22 -08:00
Ralph Castain	dd491db21f	Fix IOF when outputing to files - the remote orteds were failing to output stdout/err from their procs. Silence a warning in orted_submit Protect against a free'd value in an error path when forming oob tcp connections Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-01 14:12:47 -08:00
Artem Polyakov	58300afff2	orte/oob/tcp: Plug the memory leak. Plug coverity defect CID 1396541. Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2016-12-01 06:48:25 +07:00
Ralph Castain	47ed214458	Do not resend if max_retries is exceeded. Make a verbose output available to tell us where the intended message was to go. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-29 19:21:16 -08:00
rhc54	d31f173744	Merge pull request #2476 from rhc54/topic/dbgupdate Bring forward the debugger-related changes	2016-11-29 19:10:32 -08:00
Ralph Castain	d5fd635efe	Bring forward the debugger-related changes Refs https://github.com/open-mpi/ompi/pull/2425 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-29 13:15:20 -08:00
Ralph Castain	30ff8be9c9	Silence minor warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-29 08:33:22 -08:00
Jeff Squyres	a6d390fe7b	Merge pull request #2461 from artpol84/oob/msg_drop orte/oob/tcp: Fix message dropping in case of concurrent connection.	2016-11-29 11:23:15 -05:00
Ralph Castain	f7699a7eeb	Silence warnings in a .opal_ignore'd component Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-28 13:18:25 -08:00
Artem Polyakov	ada93e0c02	orte/oob/tcp: Fix message dropping in case of concurrent connection. The problem was observed for direct modex used with recursive doubling algorithm (used for collective ID calculation prior to d52a2d081e9598a9ac9a50fb4b013a6d2a72375b) that has pairwise nature and counter-connections are highly likely. The following scenario was uncovering the issue: * ranks `x` and `y` want to communicate with each other, `x` < `y`; * rank `x` initiates the connection and sends the ack; * rank `y` starts to `connect()` and gets the ack from `x`; * `y` identifies that it already started connecting and `y` > `x` so it rejects incoming connection. * `x` sees that his connection was rejected in `mca_oob_tcp_peer_recv_connect_ack()` when trying to read the message header using `tcp_peer_recv_blocking()` which calls `mca_oob_tcp_peer_close()` that effectively flushes all the messages in the peer->send_queue. * `y` send the ack to `x` and the connection is established, however all the messages for the peer at `x` are vanished (except the front one in peer->send_msg). This commit introduces a "nack" function that will be used at `y` side to tell `x` that `y` has the priority and `x`'s connection should be closed. This allows to avoid "guessing" on the unexpectedly closed connection. Signed-off-by: Artem Polyakov <artpol84@gmail.com>	2016-11-27 04:58:34 +07:00
Howard Pritchard	2cbc0e8472	pmix/cray: fix disable-dlopen problem PR open-mpi/ompi#2432 introduced a regression where configure and build with --disable-dlopn caused build failure owing to unresolved alps lli symbols in the libopal-pal shared library. This commit fixes this problem. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-21 13:45:10 -06:00
Ralph Castain	eb67c2fd44	Update OFI/rml component - still .opal_ignore'd Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-18 14:54:26 -08:00
Ralph Castain	9c6c2fa61d	Bring the v2.0.x debugger patch up to the master branch Ensure the personality gets set as specified by user, or defaults to "ompi" Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-18 12:45:45 -08:00
Ralph Castain	188880be3f	Since static ports are only used by ORTE if the runtime option is given, there is no need for a configure option as well - so remove the --enable-orte-static-ports configure option. When decoding the daemon nidmap, mark new daemons as ALIVE by default - we will discover dead ones as we go. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-04 05:01:42 -07:00
Gilles Gouaillardet	da0c873e14	oob/tcp: enhance debugging output display the hop node used to send a message (if the message is sent directly, then the hop is the destination) Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-11-04 14:16:06 +09:00
Josh Hursey	b18598f6c7	Merge pull request #2329 from jjhursey/topic/short-hostname-lsf-fix ras/*: Fix !orte_keep_fqdn_hostnames for RAS components	2016-11-02 10:49:08 -05:00
Ralph Castain	435d771e76	Fix the radix routed component to correctly handle connected tools - in such cases, the route must be direct to the tool. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-01 19:03:26 -07:00
Ralph Castain	64873487b4	Remove the max_connections parameter from the radix component as it is confusing. Modify PMIx client init so that it simply returns the nspace/rank if called by a server - this allows the server to retrieve its assigned ID. Register the server's nspace so client-side operations can succeed Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-01 12:17:11 -07:00
Joshua Hursey	ed5268a96a	ras/slurm: Fix !orte_keep_fqdn_hostnames for Slurm Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2016-11-01 13:21:30 -05:00
Joshua Hursey	5a4c52d9cb	ras/loadleveler: Fix !orte_keep_fqdn_hostnames for Loadleveler Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2016-11-01 13:21:30 -05:00
Joshua Hursey	8230201ad1	ras/gridengine: Fix !orte_keep_fqdn_hostnames for GridEngine Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2016-11-01 13:21:30 -05:00
Joshua Hursey	9643175e40	ras/tm: Fix !orte_keep_fqdn_hostnames for TORQUE Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2016-11-01 13:21:24 -05:00
Joshua Hursey	8d02a33639	ras/lsf: Fix !orte_keep_fqdn_hostnames for LSF * By default, make sure that we are using the short hostnames and not the fully qualified hostnames when running under LSF. * Related to commit open-mpi/ompi@d26dd2c20e Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2016-11-01 13:04:52 -05:00
rhc54	6074c2a2a9	Merge pull request #2322 from rhc54/topic/routed Update the routed components as we no longer need to init_routes.	2016-10-31 13:37:07 -07:00
Ralph Castain	b8c5d1ad88	Update the routed components as we no longer need to init_routes. Fixes case of direct launch via srun Signed-off-by: Ralph Castain <rhc@open-mpi.org> Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-10-31 12:38:13 -07:00
Jeff Squyres	773d6039e7	Merge pull request #2306 from hjelmn/alps_cores ras/alps: use cpuCnt if using hwthreads as cores	2016-10-31 15:22:13 -04:00
Gilles Gouaillardet	30298cc83c	oob/tcp: remove debug that should have never been commited Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-10-31 16:41:14 +09:00
Gilles Gouaillardet	75e96004a4	oob/tcp: fix a typo in mca_oob_tcp_component_no_route() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-10-31 16:30:24 +09:00
Gilles Gouaillardet	fb5bcc47ce	ess/singleton: use opal_setenv instead of putenv so it fixes a memory leak on finalize Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-10-28 09:32:30 +09:00
Gilles Gouaillardet	ef2b3ac8d2	rml/oob: fix misc memory leaks in open_conduit() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-10-28 09:28:42 +09:00
Gilles Gouaillardet	831f7d9c9d	rml/base: plug misc memory leaks plug leaks in orte_rml_API_get_contact_info() and orte_rml_base_close() Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-10-28 09:28:05 +09:00
Nathan Hjelm	c3614d30fa	ras/alps: use cpuCnt if using hwthreads as cores This commit updates the alps ras component to allow the use of hyperthreads on compute nodes. In this case we need to use the cpuCnt value from the node structure instead of numPEs. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-10-27 09:51:17 -06:00
Gilles Gouaillardet	3d4285b04d	oob/tcp: silence valgrind warning fully initialize allocated memory to keep valgrind happy Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2016-10-27 17:12:46 +09:00
rhc54	2b18044051	Merge pull request #2301 from rhc54/topic/update Update PMIx to latest master tarball. Ensure we set the HNP name for …	2016-10-26 16:42:15 -07:00
Ralph Castain	f298f294e1	Update PMIx to latest master tarball. Ensure we set the HNP name for orted's so that PMIx_Lookup can find the server Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-10-26 15:48:56 -07:00
Anandhi S Jayakumar	94593ca20b	Adding ofi plugin to allow for opening a conduit to use ethernet/fabric. modified: ../orte/mca/rml/base/rml_base_frame.c modified: ../orte/mca/rml/base/rml_base_stubs.c deleted: ../orte/mca/rml/ofi/.opal_ignore modified: ../orte/mca/rml/ofi/Makefile.am modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_send.c modified: ../orte/test/system/ofi_conduit_stress.c Removed stale include directive modified: ../orte/mca/rml/ofi/Makefile.am The ofi plugin supports multiple providers, and identifies them by ofi_prov_id, changed the previous name conduit_id to ofi_prov_id modified: ../orte/mca/rml/base/base.h modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_request.h modified: ../orte/mca/rml/ofi/rml_ofi_send.c Adding ofi plugin to allow for opening a conduit to use ethernet/fabric. modified: ../orte/mca/rml/base/rml_base_frame.c modified: ../orte/mca/rml/base/rml_base_stubs.c deleted: ../orte/mca/rml/ofi/.opal_ignore modified: ../orte/mca/rml/ofi/Makefile.am modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_send.c modified: ../orte/test/system/ofi_conduit_stress.c Removed stale include directive modified: ../orte/mca/rml/ofi/Makefile.am The ofi plugin supports multiple providers, and identifies them by ofi_prov_id, changed the previous name conduit_id to ofi_prov_id modified: ../orte/mca/rml/base/base.h modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_request.h modified: ../orte/mca/rml/ofi/rml_ofi_send.c Fixed merge issues, and minor pull-request comments modified: ../orte/mca/rml/base/base.h modified: ../orte/mca/rml/base/rml_base_frame.c modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c Adding ofi plugin to allow for opening a conduit to use ethernet/fabric. modified: ../orte/mca/rml/base/rml_base_frame.c modified: ../orte/mca/rml/base/rml_base_stubs.c deleted: ../orte/mca/rml/ofi/.opal_ignore modified: ../orte/mca/rml/ofi/Makefile.am modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_send.c modified: ../orte/test/system/ofi_conduit_stress.c Removed stale include directive modified: ../orte/mca/rml/ofi/Makefile.am The ofi plugin supports multiple providers, and identifies them by ofi_prov_id, changed the previous name conduit_id to ofi_prov_id modified: ../orte/mca/rml/base/base.h modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_request.h modified: ../orte/mca/rml/ofi/rml_ofi_send.c Adding ofi plugin to allow for opening a conduit to use ethernet/fabric. modified: ../orte/mca/rml/base/rml_base_frame.c modified: ../orte/mca/rml/base/rml_base_stubs.c deleted: ../orte/mca/rml/ofi/.opal_ignore modified: ../orte/mca/rml/ofi/Makefile.am modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_send.c modified: ../orte/test/system/ofi_conduit_stress.c Removed stale include directive modified: ../orte/mca/rml/ofi/Makefile.am Fixed merge issues, and minor pull-request comments modified: ../orte/mca/rml/base/base.h modified: ../orte/mca/rml/base/rml_base_frame.c modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c Removed trailing space modified: ../orte/mca/rml/ofi/rml_ofi_component.c Cleaned up test- ofi_conduit_stress.c modified: ../orte/test/system/ofi_conduit_stress.c cleaned up printing the provider info during initialisation modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> Fixing warnings modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_send.c Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> minor cleanup modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_send.c Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> more cleanup modified: ../orte/mca/rml/ofi/rml_ofi_component.c Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> Sending the ethernet address only in the get_contact_info, rest will be sent through modex modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> Adding error logging on failures modified: ../orte/mca/rml/ofi/rml_ofi_component.c Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> Handling the OPAL_MODEX_SEND/RECV generically for all ofi providers. modified: ../orte/mca/rml/ofi/rml_ofi.h modified: ../orte/mca/rml/ofi/rml_ofi_component.c modified: ../orte/mca/rml/ofi/rml_ofi_send.c Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> Adding to build ofi for limited people new file: ../orte/mca/rml/ofi/.opal_ignore new file: ../orte/mca/rml/ofi/.opal_unignore Signed-off-by: Anandhi S Jayakumar <anandhi.s.jayakumar@intel.com> Removign the error logging for now modified: ../orte/mca/rml/ofi/rml_ofi_component.c	2016-10-26 13:11:07 -07:00
Ralph Castain	d031946c46	When mpirun operates in --continuous mode, we won't terminate the job when a remote process dies. In that case, we have to activate both the waitpid _and_ the IOF complete states to ensure we properly mark the proc as dead and perform any required notifications Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-10-25 12:18:14 -07:00
Ralph Castain	227d4d9609	Open the conduits for application procs - we probably can remove all the RML-related frameworks from MPI applications now, but let's wait a bit to ensure we have cleaned up all the points where messaging might occur.	2016-10-24 16:53:19 -07:00
Ralph Castain	649301a3a2	Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier). Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.	2016-10-23 21:52:39 -07:00
Ralph Castain	df8ac7b747	Properly mark a node as down and decrease the number of daemons so any subsequent grpcomm collectives can correctly operate. Note that only the direct grpcomm component knows how to deal with down nodes.	2016-10-21 09:53:37 -07:00
Gilles Gouaillardet	1846c2d8ad	plm/rsh: use an alternate port if the ORTE_NODE_PORT attribute is set	2016-10-19 16:18:52 +09:00
Ralph Castain	16540c7422	Properly report failure to launch when someone mis-types the name of the application Fixes #2233	2016-10-18 10:09:30 -07:00
Ralph Castain	7be607582e	ORTE applications need to commit any modex send's prior to calling fence	2016-10-18 09:22:56 -07:00
Ralph Castain	57114a09ae	Pickup the npernode and npersocket options and include them in the job object	2016-10-17 12:26:21 -07:00
Gilles Gouaillardet	bd1b6fe661	rml/oob: add a missing include file	2016-10-16 10:25:00 +09:00
Gilles Gouaillardet	451b9dc467	ess: tear down pmix (if any) before oob	2016-10-13 14:08:02 +09:00
Ralph Castain	fca1556787	Some compilers apparently complain about this, so modify the typedef statements	2016-10-12 08:44:03 -07:00
Ralph Castain	a2919174d0	Bring the RML modifications across. This is the first step in a revamp of the ORTE messaging subsystem to support fabric-based communications during launch and wireup phases. When completed, the grpcomm and plm frameworks will each have their own "conduit" for communication - each conduit corresponds to a particular RML messaging transport. This can be the active OOB-based component, or a provider from within the RML/OFI component. Messages sent down the conduit will flow across the associated transport. Multiple conduits can exist at the same time, and can even point to the same base transport. Each conduit can have its own characteristics (e.g., flow control) based on the info keys provided to the "open_conduit" call. For ease during the transition period, the "legacy" RML interfaces remain as wrappers over the new conduit-based APIs using a default conduit opened during orte_init - this default conduit is tied to the OOB framework so that current behaviors are preserved. Once the transition has been completed, a one-time cleanup will be done to update all RML calls to the new APIs and the "legacy" interfaces will be deleted. While we are at it: Remove oob/usock component to eliminate the TMPDIR length problem - get all working, including oob_stress	2016-10-11 16:01:02 -07:00
Gilles Gouaillardet	c92e9a5406	use the new OPAL_HASH_TABLE_FOREACH convenience macro	2016-10-08 16:58:20 +09:00
Gilles Gouaillardet	0931d09afa	ess/singleton: silence a valgrind warning initialize a pointer and keep valgrind happy about it	2016-09-27 15:22:39 +09:00
Gilles Gouaillardet	f9ebba4668	ess/singleton: only realloc() when required in fork_hnp()	2016-09-23 16:35:59 +09:00
Gilles Gouaillardet	c7bf9a0ec9	ess/singleton: fix read on the pipe to spawn'ed orted and close the pipe on both ends when it is no more needed	2016-09-22 14:21:52 +09:00
Ralph Castain	de7b1494d9	Clean out old cruft from the ORCM project	2016-09-21 00:13:30 -07:00
Gilles Gouaillardet	83399adb3f	singleton: "safe" read/write to the pipe between (spawn'ed) orted and singleton	2016-09-20 14:56:58 +09:00
Gilles Gouaillardet	e7ae6975d0	orted: fix spawn in singleton mode in singleton mode, have the spawn'ed orted invoke orte_pre_condition_transports() and send the transport key back to the singleton	2016-09-20 14:39:22 +09:00
Ralph Castain	a16b3cc33d	Fix some minor complaints - missing "void" in function parameters	2016-09-15 15:18:42 -07:00
Ralph Castain	6f086189e6	Fix trivial typo	2016-09-15 13:10:55 -07:00
Gregory M. Kurtzer	16794cc260	Updates to support Singularity containers v2.2	2016-09-15 09:52:06 -07:00
Gilles Gouaillardet	11ebf3ab23	ess/singleton: when forking hnp, use the PMIX_NAMESPACE sent by the hnp as the jobid	2016-09-15 13:57:23 +09:00
Gilles Gouaillardet	e84b35217f	oob/tcp: plug a memory leak as reported by Coverity with CID 1196711	2016-09-08 18:50:18 +09:00
Gilles Gouaillardet	b2a2be0e5a	odls: fix memory leak plug This fixes commit open-mpi/ompi@e2c343cdfc.	2016-09-08 10:02:52 +09:00
Artem Polyakov	9eba1b0b75	Merge pull request #2042 from artpol84/pmix_sdirs Several fixes related to session directories:	2016-09-07 14:15:47 +07:00
Artem Polyakov	a9a7f39773	ess/pmi: fix the comments about MCA/PMIx setting conflict resolution.	2016-09-07 07:47:35 +03:00
Gilles Gouaillardet	e2c343cdfc	odls: plus memory leak as reported by Coverity with CID 710645	2016-09-07 10:08:44 +09:00
Gilles Gouaillardet	c09899f6af	plm: plus resource leaks as reported by Coverity with CIDs 72274 and 1196733	2016-09-07 10:08:44 +09:00
Josh Hursey	f6337f9eae	Merge pull request #2047 from jjhursey/topic/mixed-host2 orte: !FQDN implementation to use opal_net_isaddr	2016-09-06 13:08:54 -05:00
Ralph Castain	f85dcaee2a	Fixes CID 1369067 and CID 1196684 Fixes CID 1369648 Fixes CID 1372409	2016-09-06 08:43:15 -07:00
Artem Polyakov	74a11d7832	Fix session dir cleanup code.	2016-09-05 07:53:55 +03:00
Artem Polyakov	dc0ab674de	Add PMIx key to provide RM with ability to indicate that it will cleanup session directories provided at through OPAL_PMIX_TMPDIR, OPAL_PMIX_NSDIR, OPAL_PMIX_PROCDIR	2016-09-05 07:48:44 +03:00
Artem Polyakov	81195ab724	Several fixes related to session directories: * enable OMPI to retrieve paths from RM through PMIx * cleanups related to tempdirs.	2016-09-05 07:48:44 +03:00
Ralph Castain	fb51d65049	Minor change: check for NULL before using the job map to avoid segfault when erroring out prior to creating the map	2016-09-04 07:53:12 -07:00
Joshua Hursey	fe937d1e82	orte: !FQDN implementation to use opal_net_isaddr * Switch to use opal_net_isaddr() for checking if a name is an IP address - as it is a bit cleaner, and uses common functionality.	2016-09-02 13:31:49 -05:00
Ralph Castain	4e0788e9ad	Enable PSM to support dynamic processes Fix comm_spawn to correctly reference the actual parent process that requested the spawn when looking for the parent job object	2016-09-02 10:22:04 -07:00
Ralph Castain	0ea1cff733	Implement notification of completion on comm_spawn'd child jobs. Add a configure flag to enable PMIx 3's shared memory datastore, and set it disable by default so that comm_spawn functions again. Will reverse the default once that feature is fully functional	2016-09-01 13:10:10 -07:00
Gilles Gouaillardet	0b8c58298d	oob/usock: fix handling of orte_process_name_t * orte_process_name_t is aligned on 32 bits, so it cannot simply be casted into an int64_t. use memcpy() instead Thanks Paul Hargrove for the report	2016-09-01 13:18:02 +09:00
Ralph Castain	c1050bc01e	Provide a mechanism for obtaining memory profiles of daemons and application profiles for use in studying our memory footprint. Setting OMPI_MEMPROFILE=N causes mpirun to set a timer for N seconds. When the timer fires, mpirun will query each daemon in the job to report its own memory usage plus the average memory usage of its child processes. The Proportional Set Size (PSS) is used for this purpose.	2016-08-31 09:32:07 -07:00
Ralph Castain	9b991bd1f5	Ensure that the "running" state is correctly updated It is possible that one or more procs could get thru PMIx_Init, and thus be marked as in state "registered", before all local procs have been started. If that happens, then we would report some of the procs in state "running", and the others in state "registered" - which means that the HNP would miss the "running" stage of the state machine. Thanks to Jingchao Zhang for his patience in tracking this down on the 2.0 branch	2016-08-30 19:24:39 -07:00
Josh Hursey	b0d8638824	Merge pull request #2015 from jjhursey/topic/mixed-hostnames orte: Expand use of !orte_keep_fqdn_hostnames MCA parameter	2016-08-29 09:14:54 -05:00
Ralph Castain	2f6e0fec90	Provide the number of nodes in the job	2016-08-26 14:50:41 -07:00
Joshua Hursey	d26dd2c20e	orte: Expand the application of !orte_keep_fqdn_hostnames * Expand the use of the `orte_keep_fqdn_hostnames` MCA parameter when it is set to false. * If that parameter is set to false (default) then short hostnames (e.g., `node01`) will match with the long hostnames (e.g., `node01.mycluster.org`). This allows a user (or resource manager) to mix the use of short and long hostnames. - Note that this mechanism does _not_ perform a DNS lookup, but instead strips off the FQDN by truncating the hostname string at the first `.` character (when not an IP address). - By default (`false`) the following is true: `node01 == node01.mycluster.org == node01.bogus.com` since we use `node01` as the hostname.	2016-08-26 16:09:04 -05:00
Artem Polyakov	55ac3b0be3	orte/schizo: fix binding detection in slurm component in SLURM 16.05 the SLURM_CPU_BIND_TYPE is equal to "mask_cpu:" instead of "mask_cpu". Account for that.	2016-08-26 09:55:52 +03:00
rhc54	19b0f4db9f	Merge pull request #1995 from rhc54/topic/pe-per-rank Change the behavior of cpus-per-rank.	2016-08-25 14:38:12 -05:00
Ralph Castain	440eae90ec	Correct the binding algorithm to decouple it from oversubscribe. Oversubscribe stipulates that we allow more procs on the node than assigned slots - it has nothing to do with the number of available pe's. Let overload directives handle the pe situation.	2016-08-24 21:17:22 -07:00
Gilles Gouaillardet	93e73841f9	ess/singleton: push all PMIX_* environment variables, regardless how many there are	2016-08-23 09:46:55 +09:00
Gilles Gouaillardet	a1e8e58a8a	ess/singleton: expects 4 PMIX_* environment variables or more	2016-08-23 09:34:03 +09:00
Ralph Castain	7de4d6922b	Change the behavior of cpus-per-rank. We previously counted each cpu against the #slots. However, IBM has pointed out that "slot" is equated to the number of processes allowed to run on each node, and not the number of cpus on the node. This has been a continuing source of confusion, so make the distinction a "hard" one. Each process occupies a "slot". We automatically set #slots = #cpus if nothing else is told to us. If you want to run more procs and slots, you must tell us to allow oversubscription. A process can utilize multiple pe's if that option is given. If you try to bind more than one proc to a given pe, then we will error out unless you tell us to allow overloading.	2016-08-22 15:54:41 -07:00
Jeff Squyres	71ec5cfb43	rsh: robustify the check for plm_rsh_agent default value Don't strcmp against the default value -- the default value may change over time. Instead, check to see if the MCA var source is not DEFAULT. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-08-16 06:58:20 -05:00
rhc54	d7cd802426	Merge pull request #1971 from rhc54/topic/sesdir Update the session dir structure. Restore the creation of a top-level…	2016-08-16 03:14:08 -05:00
Ralph Castain	ae2af61ee3	Update the session dir structure. Restore the creation of a top-level dir based on userid so that everything is contained under the user's top-level dir. Make the next level down (the "job family" level) be either the pid (indicated by a name of "pid.N") or the job family if not launched by mpirun. This allows for proper rendezvous by direct-launched procs.	2016-08-15 22:46:46 -05:00
Ralph Castain	9f43db7303	Further cleanup getpwuid usage - try it first (unless completely disabled), and then silently failover to try other methods.	2016-08-15 07:51:36 -07:00
Ralph Castain	be8424b691	Provide backward compatible keys so that the non-PMIx components in the opal/pmix framework don't have to adjust as we continue to work on finalizing the PMIx reference scheme. Activate and utilize the new PMIx show_help capability to provide more meaningful error output when the server cannot start. Add a contrib script to cleanup permissions incorrectly modified due to things like smb mounts dd	2016-08-13 12:13:04 -07:00
Ralph Castain	08a0644df5	Fix shared memory rendezvous	2016-08-13 08:14:50 -07:00
rhc54	ddde154d28	Merge pull request #1962 from rhc54/topic/notify Ensure we properly convert pmix status to ORTE state before activatin…	2016-08-13 06:59:50 -07:00
Ralph Castain	48d35a9627	Ensure we properly convert pmix status to ORTE state before activating an error state upon notification. Cleanup some conversion issues on notification info. Add a new orte_notify.c test program	2016-08-12 21:14:29 -07:00
rhc54	9eed451916	Merge pull request #1960 from rhc54/topic/rsh Restore the rsh template creation code	2016-08-12 13:38:43 -07:00
rhc54	1ef3c86d44	Merge pull request #1931 from hjelmn/ess_fix ess/base: set up nidmap after pmix	2016-08-12 13:10:30 -07:00
Ralph Castain	5717b75b45	Restore the rsh template creation code	2016-08-12 12:43:40 -07:00
Ralph Castain	1c44543854	If the ssh agent hasn't been given, then check for qrsh and friends	2016-08-12 07:46:39 -07:00
Artem Polyakov	1351a7065c	ess/pmi: minor code readablility cleanup. Split process name variable "name" to - "wildcard_rank" for the cases where wildcard is used. - "pname" for the case where reference to particular process is needed.	2016-08-06 15:45:19 +06:00
Nathan Hjelm	3c23502dfe	ess/base: set up nidmap after pmix This fixes a SEGV when the nidmap code attempts to use opal_pmix.store_local before pmix is set up. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2016-08-02 09:50:00 -06:00
Ralph Castain	71de03fc67	Cleanup the new naming requirements to ensure that info is correctly retrieved Cleanup permissions Restore singleton operations	2016-07-21 09:46:03 -07:00
Ralph Castain	01a653d50a	Remove a debug print in comm_cid.c. Update PMIx2 to include the revised PMIx_Get logic for higher performance by reducing the number of hash table lookups. Fix a bug where requests for data from a proc in another nspace could hang, or result in "not found". Remove stale file reference Restore autogen pass thru pmix Remove generated file	2016-07-20 00:58:19 -07:00
rhc54	2414244171	Merge pull request #1872 from rhc54/topic/continuous Add support for continuously operating applications	2016-07-13 15:29:31 -07:00
Ralph Castain	20a91c2baf	Add a new --continuous flag to mpirun that directs ORTE to let a job continue running as app procs terminate. Don't attempt to restart them. Add event notification of abnormally terminating procs, and demonstrate that in the mpi_spin test program. Cleanup debug message	2016-07-13 15:28:33 -07:00
Ralph Castain	ddd0d05de3	Fix a bug in the handling of nper<foo> when -host or -hostfile was given. Correctly mark slots as "given" when we auto-assign them. Ensure we don't set the number of procs when using nper<foo> so the PPR mapper can correctly assing them.	2016-07-12 09:27:02 -07:00
Ralph Castain	ee56d9dc1a	Shorten the session directory name as some OS's are now providing unusually long temp directory names, causing us to overflow the sockaddr field	2016-07-05 14:59:50 -07:00
Ralph Castain	5d330d5220	Enable the PMIx event notification capability and use that for all error notifications, including debugger release. This capability requires use of PMIx 2.0 or above as the features are not available with earlier PMIx releases. When OMPI master is built against an earlier external version, it will fallback to the prior behavior - i.e., debugger will be released via RML and all notifications will go strictly to the default error handler. Add PMIx 2.0 Remove PMIx 1.1.4 Cleanup copying of component Add missing file Touchup a typo in the Makefile.am Update the pmix ext114 component Minor cleanups and resync to master Update to latest PMIx 2.x Update to the PMIx event notification branch latest changes	2016-06-14 13:08:41 -07:00
Ralph Castain	a6e6c37484	Remove stale map-reduce support	2016-06-12 07:41:57 -07:00
Ralph Castain	dd0f843843	Fix rare hangs observed on OS-X by properly thread-shifting upcalls from the PMIx server into ORTE	2016-06-05 21:39:44 -07:00
Ralph Castain	0ba9572f9f	Cleanup the forced termination a bit by restoring the delay before issuing the sigkill, and eliminating the large time loss spent checking if the proc died. The latter is responsible for a large number of test timeouts in MTT Update alps component	2016-06-02 17:48:21 -07:00
Gilles Gouaillardet	5f565dfec3	configury: clean the flex generated .c files	2016-06-01 11:13:31 +09:00
Ralph Castain	3913595e10	Enable simulation of large-scale clusters by allowing multiple daemons/node. Specifying the ras_base_multiplier parameter to be greater than 1 will cause ORTE to replicate each allocated node by that factor. A daemon will be spawned for each replica, thus letting ORTE function as if it were on a much larger cluster. Note that this cannot be used for MPI performance testing. It is really only useful for ORTE scaling tests. It also only works with the rsh/ssh launcher.	2016-05-29 18:56:18 -07:00
Ralph Castain	ebe159acef	Add a timeout cmd line option and an option to report state info upon timeout to assist with debugging Jenkins tests If requested, obtain stacktraces for each application process and report it to stderr upon timeout stack traces: minor improvements - Also include the hostname and PID of the each process for which we're sending the stack traces (vs. just including the ORTE process name) - Send a specific error message if we couldn't find "gstack" in the $PATH (e.g., on OS X) - Send a sepcific error message if gstack fails to run - Print a message that obtaining the stack traces may take a few seconds so that users don't wonder what's happening Signed-off-by: Jeff Squyres <jsquyres@cisco.com> help-orterun.txt: minor tweaks Trivial update: show "--timeout" (instead of "-timeout") in the help message, just to encourage the use of double-dash options. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> trivial: stacktrace -> stack trace Trivial word smything. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-05-28 08:36:25 -07:00
Jeff Squyres	dd9a819a1c	odls_default: do not opal_output() while creating a process! It is verbotten to use opal_output() after the fork() but before the exec()! It results in all manner of undefined behavior. For example, on some OS X systems, if you run a trivial "hello world" MPI program with a high level of ODLS verbosity: ```sh $ mpirun -np 3 --mca odls_base_verbose 100 ./hello_c ``` You will see a bunch of output from the mpirun ODLS base, but then it may hang in odls_default_module.c:do_child() -- after the fork() but before the exec() -- while trying to opal_output() some debugging statements. The solution is to remove these extraneous opal_output() statements. Indeed, the ODLS base is already outputting the same information that these opal_output() statements are trying to emit, anyway. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-05-24 21:28:57 -04:00
Ralph Castain	30aaf785a8	Fix the dist mapper option	2016-05-23 23:20:33 -07:00
George Bosilca	50b37758d4	Don't overwrite the function argument. In a MPMD setup the app in the jdata can be NULL, so make sure we don't leave the main argument to an inconsistent value.	2016-05-19 10:35:23 -04:00
Ralph Castain	7e5ef6a240	Fix the env_list support - the MCA param was being set way too early, so provide a "backdoor" way of providing the value	2016-05-06 15:38:39 -07:00
Ralph Castain	58dd41facf	Repair the processing of cmd line options that mapped to MCA params. This was responsible for breaking things like map-by <foo>. Remove debug, let orterun send terminate cmd to DVM Recover the DVM support	2016-05-06 13:14:03 -07:00
rhc54	ff8518853e	Merge pull request #1604 from rhc54/topic/psm2 Improve the transport key print statement to ensure that we don't get…	2016-05-03 13:43:10 -07:00
Jeff Squyres	265e5b9795	Merge pull request #1552 from kmroz/wip-hostname-len-cleanup-1 ompi/opal/orte/oshmem/test: max hostname length cleanup	2016-05-02 09:44:18 -04:00
rhc54	2fa8b6c6ac	Merge pull request #1525 from rhc54/topic/schizo Extend the schizo framework	2016-05-01 15:09:08 -07:00
Ralph Castain	6ac7929bd0	Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need. Cleanups per @jjhursey review	2016-05-01 11:30:25 -07:00
Ralph Castain	0f05893952	Ensure consistency between max_procs and univ_size values - since orte wants max_procs, have the proc get that value instead of univ_size Make the singleton module consistent as well	2016-05-01 11:13:33 -07:00
Ralph Castain	29bc24bdd5	Improve the transport key print statement to ensure that we don't get zero fields as this can be a problem for PSM	2016-04-28 20:11:12 -07:00
Ralph Castain	e6ad1ad621	Up-port of change for 2.x: if user directs oversubscribe, then do not bind as we will otherwise overload resources	2016-04-28 13:21:10 -07:00
Ralph Castain	75dc4c305a	Correctly set the #procs in the job to "job_size", and the max_procs to "univ_size"	2016-04-27 12:00:19 -07:00
Gilles Gouaillardet	6bf57c799f	orte/rml: ORTE_RML_SEND_COMPLETE handles messages with both NULL iov and cbfunc.buffer	2016-04-26 09:19:31 +09:00
Karol Mroz	5c11bdb251	orte: fixup hostname max length usage Also removes orte specific max hostname value. Signed-off-by: Karol Mroz <mroz.karol@gmail.com>	2016-04-25 07:08:23 +02:00
Joshua Hursey	29b49351af	ras/lsf: Fix affinity for MPMD jobs running under LSF	2016-04-22 11:18:34 -05:00
Jeff Squyres	68c1a5eb6c	Merge pull request #1567 from jsquyres/pr/fix-ompi-to-opal-name-conversion m4: rename OMPI_SUMMARY_* macros to OPAL_SUMMARY_*	2016-04-20 13:10:06 -04:00
Jeff Squyres	6800ef9ec0	m4: rename OMPI_SUMMARY_* macros to OPAL_SUMMARY_* These macros should really be named OPAL_SUMMARY_*; they're used in all projects, and therefore should be in the lowest later project (OPAL). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-04-20 08:40:00 -07:00
Ralph Castain	449ec41532	Roll to PMIx 1.1.4rc1 and remove the PMIx 1.2.0 directory as the community has decided to not do that release version. This incorporates a number of bug fixes that have been identified and repaired in the PMIx and OMPI code bases. Also includes several minor corrections to the PMIx code so it now supports run-thru without hanging on collectives involving a process that exits	2016-04-15 10:11:11 -07:00
Ralph Castain	1fa236b26c	Ensure that we exit with a non-zero status when oversubscribe fails	2016-04-14 05:51:10 -07:00
Ralph Castain	437f5b4289	Fix map-by node and do-not-launch	2016-04-13 09:21:19 -07:00
Ralph Castain	2432daf065	Some minor cleanups of a memory leak and error output	2016-04-08 07:46:18 -07:00
Rainer Keller	52080a5736	As per the pull request to pmix/master: https://github.com/pmix/master/pull/71 Have OMPI's current version of pmix120 nicely fail in case of too long sun_path (longer than 108 or in case of OSX 103 chars). And have OMPI return proper error messages with hints how to amend.	2016-04-07 22:12:53 +02:00
rhc54	a95de6e8ef	Merge pull request #1353 from rhc54/topic/host Per the discussion on the telecon, change the -host behavior yet again	2016-04-04 10:30:36 -07:00
Gilles Gouaillardet	d757fbba5d	oob/usock: drop message to be sent in process_send()	2016-04-04 16:04:54 +09:00
Gilles Gouaillardet	170734182b	oob/usock: mca_oob_usock_peer_close() sets peer->sd = -1 after close() so usock_peer_create_socket know it must re-create the socket /* assuming it is ever supposed to occur */ also fix a typo (peer->sd >= 0) in usock_peer_create_socket	2016-04-04 16:02:05 +09:00
Ralph Castain	503e1274a9	Per the discussion on the telecon, change the -host behavior so we only run one instance if no slots were provided and the user didn't specify #procs to run. However, if no slots are given and the user does specify #procs, then let the number of slots default to the #found processing elements Ensure the returned exit status is non-zero if we fail to map If no -np is given, but either -host and/or -hostfile was given, then error out with a message telling the user that this combination is not supported. If -np is given, and -host is given with only one instance of each host, then default the #slots to the detected #pe's and enforce oversubscription rules. If -np is given, and -host is given with more than one instance of a given host, then set the #slots for that host to the number of times it was given and enforce oversubscription rules. Alternatively, the #slots can be specified via "-host foo:N". I therefore believe that row #7 on Jeff's spreadsheet is incorrect. With that one correction, this now passes all the given use-cases on that spreadsheet. Make things behave under unmanaged allocations more like their managed cousins - if the #slots is given, then no-np shall fill things up. Fixes #1344	2016-03-29 11:21:57 -07:00
Ralph Castain	bd18d9c9d5	Ensure the compiler knows that a critical variable is volatile	2016-03-29 09:18:25 -07:00
Howard Pritchard	e7433fcb44	Merge pull request #1486 from hppritcha/topic/fix_wlm_detect_code plm/alps: fix usage of cray wlm_detect methods	2016-03-26 13:22:50 -06:00
Ralph Castain	0e1350f5b7	Add missing header files	2016-03-25 09:06:51 -07:00
Ralph Castain	a3fea58d1c	Minor cleanups to prior PR commit	2016-03-24 15:55:14 -07:00
rhc54	6756e19aa2	Merge pull request #1457 from anandhis/master rml changes	2016-03-24 15:17:29 -07:00
rhc54	ba8c8700aa	Merge pull request #1493 from rhc54/topic/sing Update singularity support to track changes in upstream Singularity code	2016-03-24 15:16:38 -07:00
Ralph Castain	8c14df2328	Revert "Modify singularity support per patch from Greg Kurtzer" This reverts commit open-mpi/ompi@f7257a8310. Ensure that we properly cleanup the session directory tree. Prior code had issues with symlinks, especially if the file that the link points to was already removed as we traverse the tree. Also found that the dirent checks for directory type weren't fully portable, and so fall back to the stat-based approach which is known to be portable. Fix singularity singletons by detecting we are in a container and properly setting the pmix selection to pick the isolated component. Remove a stale restriction blocking use of the sm btl	2016-03-24 11:27:18 -07:00
Ralph Castain	378d9cbb5e	Extend the abort on non zero status flag to apply to processes which die as the result of signals.	2016-03-24 08:33:55 -07:00
Ralph Castain	cdd3dc99ca	Correct the binding for the --map-by node case - we should still use our default binding algorithms	2016-03-23 09:55:24 -07:00
Ralph Castain	6e6bbfda91	Very minor typo	2016-03-23 08:31:47 -07:00
Howard Pritchard	69200e6229	plm/alps: fix usage of cray wlm_detect methods Turns out there are some cases where the Cray wlm_detect_get_active may return NULL, in which case fallback to wlm_detect_get_default method is suggested. Make use of the fallback to avoid segfaults under some circumstances in the ALPS plm selection method. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-03-22 11:40:56 -07:00
Ralph Castain	c146c4969b	Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation. Fixes #1467 Restore debugger attach operations Fixes #1225	2016-03-18 21:49:04 -07:00
Ralph Castain	2970becd6b	Revert "Merge pull request #1451 from ggouaillardet/topic/orte_fork_wrapper_fullname" This reverts commit `efafd62d38`, reversing changes made to `a93b849f13`.	2016-03-18 07:18:36 -07:00
Gilles Gouaillardet	589924c4aa	odls/base: use the full app name when using an orte fork agent	2016-03-14 11:18:21 +09:00
Anandhi S Jayakumar	a31292abc7	fixes to ud for removing qos channel	2016-03-10 18:03:17 -08:00
Ralph Castain	a4c8e8c28a	Cleanup the proposed change: * qos framework is moving to the scon layer and is no longer required in ORTE * remove the rml/ftrm component as we now have multiple active components, and so the wrapper needs to be rethought * no need for separating the "base" from "API" module definition. The two are identical * move the "stub" functions into their own file for cleanliness * general cleanup to meet coding standards * cleanup some logic in the stubs	2016-03-10 13:14:17 -08:00
Jeff Squyres	48c650c47a	configury: minor updates to config summary output	2016-03-10 13:02:52 -08:00
Anandhi S Jayakumar	0188c3cf81	Adding commit for multiple plugin loading support in RML	2016-03-09 18:13:48 -08:00
Ralph Castain	f7257a8310	Modify singularity support per patch from Greg Kurtzer	2016-03-09 07:52:11 -08:00
Ralph Castain	f3ae30ff39	Fix singletons yet again...	2016-03-08 10:33:35 -08:00
Ralph Castain	d72c1c72ff	Do not push child processes into separate process groups so that any host RM can still "see" them, and ensure that any signal sent to the orted's themselves will be provided to all child processes. Forward all signals from mpirun to the child processes, removing the old MCA parameter required to turn that behavior "on".	2016-03-06 17:55:09 -08:00
Ralph Castain	4d0cc27eb7	Update the singularity support to match that of the latest singularity master. Remove the restriction on shared memory components by instructing singularity to not isolate the PID space. Add a new schizo API to allow setting up the original app_context. Ensure the container is installed prior to execution.	2016-03-05 21:47:42 -08:00
Ralph Castain	ce0a05d7d1	Minor cleanup - Singularity now has an internal check for installed, so we no longer need to do so.	2016-03-04 19:07:53 -08:00
Gilles Gouaillardet	80bdbfd9e7	add missing include file	2016-03-03 13:46:28 +09:00
Ralph Castain	4a55fba414	Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration.	2016-03-02 15:01:01 -08:00
Ralph Castain	1b81d90eaa	Minor cleanups required for orte-dvm operation	2016-03-01 18:12:53 -08:00
Ralph Castain	c9f7bb6751	Add the include file to all the schizo components	2016-03-01 13:18:23 -08:00
Ralph Castain	625083fe18	Add include file	2016-03-01 13:04:20 -08:00

... 3 4 5 6 7 ...

4257 Коммитов