openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	105fb152e1	Silence Coverity warnings Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-13 08:38:51 -07:00
Ralph Castain	48fc339718	Create an alternative mapping method that pushes responsibility onto the backend daemons. By default, let mpirun only pack the app_context info and send that to the backend daemons where the mapping will be done. This significantly reduces the computational time on mpirun as it isn't running up/down the topology tree computing thousands of binding locations, and it reduces the launch message to a very small number of bytes. When running -novm, fall back to the old way of doing things where mpirun computes the entire map and binding, and then sends the full info to the backend daemon. Add a new cmd line option/mca param --fwd-mpirun-port that allows mpirun to dynamically select a port, but then passes that back to all the other daemons so they will use that port as a static port for their own wireup. In this mode, we no longer "phone home" directly to mpirun, but instead use the static port to wireup at daemon start. We then use the routing tree to rollup the initial launch report, and limit the number of open sockets on mpirun's node. Update ras simulator to track the new nidmap code Cleanup some bugs in the nidmap regex code, and enhance the error message for not enough slots to include the host on which the problem is found. Update gadget platform file Initialize the range count when starting a new range Fix the no-np case in managed allocation Ensure DVM node usage gets cleaned up after each job Update scaling.pl script to use --fwd-mpirun-port. Pre-connect the daemon to its parent during launch while we are otherwise waiting for the daemon's children to send their "phone home" rollup messages Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-03-07 20:43:12 -08:00
Jeff Squyres	fec519a793	hwloc: rename opal/mca/hwloc/hwloc.h -> hwloc-internal.h Per a prior commit, the presence of "hwloc.h" can cause ambiguity when using --with-hwloc=external (i.e., whether to include opal/mca/hwloc/hwloc.h or whether to include the system-installed hwloc.h). This commit: 1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h. 2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc. 3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the rest of the code base. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-02-28 07:48:42 -08:00
Josh Hursey	a17b547430	Merge pull request #2957 from jjhursey/topic/ibm/rsh-sigint-fix plm/rsh: Fix signal handling for rsh launcher	2017-02-14 15:29:00 -06:00
Joshua Hursey	843fcca03c	plm/rsh: Fix signal handling for rsh launcher * Similar to the other launchers (i.e., slurm, alps) we need to put the children in a separate process group to prevent SIGINT (from a CTRL-C) from being delivered to the whole process group and prematurely killing the rsh/ssh connections to the remote daemons. Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-02-14 08:54:17 -06:00
Ralph Castain	dee2d8646d	Fix plm/rsh runtime check Fix the check for rsh/ssh so we allow the check for SGE and LoadLeveler to occur if user doesn't specify their own launch agent. Fix a Coverity warning Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-13 16:54:03 -08:00
Ralph Castain	230d15f0d9	Cleanup the ras simulator capability, and the relay route thru grpcomm direct. Don't resend wireup info if nothing has changed Fix release of buffer Correct the unpacking order Fix the DVM - now minimized data transfer to it Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-02-01 15:01:58 -08:00
Ralph Castain	b59ae14a2a	Fix static port and partial allocation operations Fix static port wireup by recording the TCP port mpirun is using and correctly passing the regex of hosts to the daemons. Do a better job of closing sockets on failed connection attempts. Correctly identify the remote host in the associated error message. Fix partial allocation operations by not attempting to set #slots on nodes that were not used, and thus don't have a daemon or topology assigned to them Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-28 10:09:44 -08:00
Ralph Castain	c803af5d3d	Minor change to allow qrsh to tree spawn, if supported Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-27 16:34:08 -08:00
Ralph Castain	7c795f4416	If the HNP is going to request topology info, it cannot do so via a routed OOB message as the intervening daemons may not be ready. So disable routing until the VM is ready, and have daemons start routing as they receive the xcast launch msg (which includes the data they need to talk to their peers). Do a little optimization and minimize recomputation of the routing plan. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-27 15:37:16 -08:00
Ralph Castain	d672fad849	Repair rsh/ssh tree spawn Repair rsh/ssh tree spawn by unpacking and updating the nidmap in remote_spawn. Add more specific error messages so the cause of a messaging problem is a little clearer. Remove some stale code. Ensure we stop trying to send a message after a few times. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-27 11:35:00 -08:00
Josh Hursey	2e64bf42fb	Merge pull request #2810 from jjhursey/fix/ibm/stdiag-to-stdout Extend options for stddiag routing	2017-01-26 14:29:16 -06:00
Ralph Castain	399de0738e	Cleanup launch Given that we only set OOB contact info from inside of events, or before we begin threaded operations (e.g., in the ess), allow set_contact_info to directly update the oob/base framework globals. Correct the nidmap regex decompression routine. Ensure that rank=1 daemon always sends back its topology as this is the most common use-case. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-25 22:06:09 -08:00
Ralph Castain	2f4e87eae9	Have rank=1 daemon always send its topology back as this is the most common use-case Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-25 09:33:11 -08:00
Jeff Squyres	230bbc597d	plm base: make sure to assign "node" early enough Make sure to assign "node" before using it in ORTE_FLAG_SET. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2017-01-25 08:02:59 -08:00
Joshua Hursey	dcd9801f7c	orte/iof: Add orte_map_stddiag_to_stdout option * Similar to `orte_map_stddiag_to_stderr` except it redirects `stddiag` to `stdout` instead of `stderr`. * Add protection so that the user canot supply both: - `orte_map_stddiag_to_stderr` - `orte_map_stddiag_to_stdout` Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>	2017-01-24 16:22:59 -06:00
Ralph Castain	86ab751c5e	Next step in reducing launch time: begin reducing the size of the launch message itself. Start by expressing the daemon map as a set of three regular expression strings. On an 8k cluster, this reduces the nidmap contribution from over 200kBytes to 21 bytes in size. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-23 19:54:47 -08:00
Ralph Castain	368684bd63	Revert `e9bc293` and try a different approach for scalably dealing with hetero clusters. Have each orted send back its topo "signature". If mpirun detects that this signature has not been seen before, then ask for that daemon to send back its full topology description. This allows the system to only get the topology once for each unique topo in the cluster. Cleanup a typo, and remove no longer needed MCA params for hetero nodes and hetero apps. Hetero nodes will always be automatically detected. We don't support a mix of 32 and 64 bit apps Modify the orte_node_t to use orte_topology_t instead of hwloc_topology_t, updating all the places that use it. Ensure that we properly update topology when we see a different one on a compute node. Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-18 10:22:15 -08:00
Ralph Castain	e9bc2934be	Add an MCA param "hnp_on_smgmt_node" that mpirun can use to tell the orteds to ignore its topology signature as mpirun is executing on a system mgmt node, and hence a different topology than the compute nodes Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-16 19:32:01 -08:00
Gilles Gouaillardet	6b9343a966	plm/rsh: plug a memory leak Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 15:38:45 +09:00
Gilles Gouaillardet	c0c5dd8ccc	orte: plug a memory leak in orte_rml.recv_cancel do not invoke orte_rml.recv_cancel after the orte progress thread has gone Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>	2017-01-06 15:38:44 +09:00
Ralph Castain	9eab9a1ed3	Remove stale global variables Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers. Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation). Begin first cut at memory profiler Some minor cleanups of memprobe Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2017-01-02 14:04:24 -08:00
Ralph Castain	791f4f1ce3	Adjust debug output for clarity Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-26 14:04:20 -08:00
Ralph Castain	79cde184ad	Allow a PMIx tool to spawn a job Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-12-03 16:00:47 -08:00
Ralph Castain	d5fd635efe	Bring forward the debugger-related changes Refs https://github.com/open-mpi/ompi/pull/2425 Signed-off-by: Ralph Castain <rhc@open-mpi.org>	2016-11-29 13:15:20 -08:00
Howard Pritchard	2cbc0e8472	pmix/cray: fix disable-dlopen problem PR open-mpi/ompi#2432 introduced a regression where configure and build with --disable-dlopn caused build failure owing to unresolved alps lli symbols in the libopal-pal shared library. This commit fixes this problem. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-11-21 13:45:10 -06:00
Ralph Castain	649301a3a2	Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier). Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.	2016-10-23 21:52:39 -07:00
Gilles Gouaillardet	1846c2d8ad	plm/rsh: use an alternate port if the ORTE_NODE_PORT attribute is set	2016-10-19 16:18:52 +09:00
Ralph Castain	de7b1494d9	Clean out old cruft from the ORCM project	2016-09-21 00:13:30 -07:00
Gilles Gouaillardet	c09899f6af	plm: plus resource leaks as reported by Coverity with CIDs 72274 and 1196733	2016-09-07 10:08:44 +09:00
Ralph Castain	4e0788e9ad	Enable PSM to support dynamic processes Fix comm_spawn to correctly reference the actual parent process that requested the spawn when looking for the parent job object	2016-09-02 10:22:04 -07:00
Joshua Hursey	d26dd2c20e	orte: Expand the application of !orte_keep_fqdn_hostnames * Expand the use of the `orte_keep_fqdn_hostnames` MCA parameter when it is set to false. * If that parameter is set to false (default) then short hostnames (e.g., `node01`) will match with the long hostnames (e.g., `node01.mycluster.org`). This allows a user (or resource manager) to mix the use of short and long hostnames. - Note that this mechanism does _not_ perform a DNS lookup, but instead strips off the FQDN by truncating the hostname string at the first `.` character (when not an IP address). - By default (`false`) the following is true: `node01 == node01.mycluster.org == node01.bogus.com` since we use `node01` as the hostname.	2016-08-26 16:09:04 -05:00
Jeff Squyres	71ec5cfb43	rsh: robustify the check for plm_rsh_agent default value Don't strcmp against the default value -- the default value may change over time. Instead, check to see if the MCA var source is not DEFAULT. Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-08-16 06:58:20 -05:00
Ralph Castain	9f43db7303	Further cleanup getpwuid usage - try it first (unless completely disabled), and then silently failover to try other methods.	2016-08-15 07:51:36 -07:00
Ralph Castain	5717b75b45	Restore the rsh template creation code	2016-08-12 12:43:40 -07:00
Ralph Castain	1c44543854	If the ssh agent hasn't been given, then check for qrsh and friends	2016-08-12 07:46:39 -07:00
Ralph Castain	ddd0d05de3	Fix a bug in the handling of nper<foo> when -host or -hostfile was given. Correctly mark slots as "given" when we auto-assign them. Ensure we don't set the number of procs when using nper<foo> so the PPR mapper can correctly assing them.	2016-07-12 09:27:02 -07:00
Ralph Castain	dd0f843843	Fix rare hangs observed on OS-X by properly thread-shifting upcalls from the PMIx server into ORTE	2016-06-05 21:39:44 -07:00
Ralph Castain	3913595e10	Enable simulation of large-scale clusters by allowing multiple daemons/node. Specifying the ras_base_multiplier parameter to be greater than 1 will cause ORTE to replicate each allocated node by that factor. A daemon will be spawned for each replica, thus letting ORTE function as if it were on a much larger cluster. Note that this cannot be used for MPI performance testing. It is really only useful for ORTE scaling tests. It also only works with the rsh/ssh launcher.	2016-05-29 18:56:18 -07:00
Ralph Castain	6ac7929bd0	Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need. Cleanups per @jjhursey review	2016-05-01 11:30:25 -07:00
Jeff Squyres	6800ef9ec0	m4: rename OMPI_SUMMARY_* macros to OPAL_SUMMARY_* These macros should really be named OPAL_SUMMARY_*; they're used in all projects, and therefore should be in the lowest later project (OPAL). Signed-off-by: Jeff Squyres <jsquyres@cisco.com>	2016-04-20 08:40:00 -07:00
Ralph Castain	437f5b4289	Fix map-by node and do-not-launch	2016-04-13 09:21:19 -07:00
Ralph Castain	503e1274a9	Per the discussion on the telecon, change the -host behavior so we only run one instance if no slots were provided and the user didn't specify #procs to run. However, if no slots are given and the user does specify #procs, then let the number of slots default to the #found processing elements Ensure the returned exit status is non-zero if we fail to map If no -np is given, but either -host and/or -hostfile was given, then error out with a message telling the user that this combination is not supported. If -np is given, and -host is given with only one instance of each host, then default the #slots to the detected #pe's and enforce oversubscription rules. If -np is given, and -host is given with more than one instance of a given host, then set the #slots for that host to the number of times it was given and enforce oversubscription rules. Alternatively, the #slots can be specified via "-host foo:N". I therefore believe that row #7 on Jeff's spreadsheet is incorrect. With that one correction, this now passes all the given use-cases on that spreadsheet. Make things behave under unmanaged allocations more like their managed cousins - if the #slots is given, then no-np shall fill things up. Fixes #1344	2016-03-29 11:21:57 -07:00
Howard Pritchard	69200e6229	plm/alps: fix usage of cray wlm_detect methods Turns out there are some cases where the Cray wlm_detect_get_active may return NULL, in which case fallback to wlm_detect_get_default method is suggested. Make use of the fallback to avoid segfaults under some circumstances in the ALPS plm selection method. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2016-03-22 11:40:56 -07:00
Ralph Castain	c146c4969b	Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation. Fixes #1467 Restore debugger attach operations Fixes #1225	2016-03-18 21:49:04 -07:00
Jeff Squyres	48c650c47a	configury: minor updates to config summary output	2016-03-10 13:02:52 -08:00
Ralph Castain	4a55fba414	Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration.	2016-03-02 15:01:01 -08:00
Ralph Castain	64b7728f33	Fix typo - do not look at daemon job when considering completion of launch	2016-02-21 14:44:51 -08:00
Ralph Castain	d653cf2847	Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM.	2016-02-21 11:55:49 -08:00
Ralph Castain	6e68d758b9	Cleanup some valgrind complaints about jumps with uninitialized values. Fix a few IOF issues reported by Mark Santcroos when submitting jobs from tools. Add the ability to pass directives to the --output-filename option that tell ORTE to (a) not include the jobid in the path to the output files, and (b) not to copy the output to the tool (i.e., just store it in the files). ck Remove stale debug Fix a segfault if no subscribers are present	2016-02-18 16:30:37 -08:00
Ralph Castain	60a7bc2e50	Enable the PMIx notification callback system. This currently is only supported by the pmix120 component, which is not selected by default. All other components will ignore error registration requests, and thus do not support debugger attach when launched via mpirun. Note that direct launched applications will support such attachment, but may not do so in a scalable fashion. Fixes ##1225	2016-02-18 09:29:12 -08:00
Ralph Castain	e0de4423ba	Remove debug	2016-02-16 20:58:53 -08:00
Mark Santcroos	14f0390b7d	Release child object when we are recording someone's relatives. (Thanks to Mark Santcroos!) Release routing list entries. (Thanks to Mark Santcroos!) Address some Coverity concerns	2016-02-15 20:50:42 -08:00
Ralph Castain	06c3dfc052	Refactor the ORTE DVM code so that external codes can submit multiple jobs using only a single connection to the HNP. * Clean up the DVM so it continues to run even when applications error out and we would ordinarily abort the daemons. * Create a new errmgr component for the DVM to handle the differences. * Cleanup the DVM state component. * Add ORTE bindings directory and brief README * Pass a local tool index around to match jobs. * Pass the jobid on job completion. * Fix initialization logic. * Add framework for python wrapper. * Fix terminate-with-non-zero-exit behavior so it properly terminates only the indicated procs, notifies orte-submit, and orte-dvm continues executing. * Add some missing options to orte-dvm * Fix a bug in -host processing that caused us to ignore the #slots designator. Add a new attribute to indicate "do not expand the DVM" when submitting job spawn requests. * It actually makes no sense that we treat the termination of all children differently than terminating the children of a specific job - it only creates confusion over the difference in behavior. So terminate children the same way regardless. Extend the cmd_line utility to easily allow layering of command line definitions Catch up with ORTE interface change and make build more generic. Disable "fixed dvm" logic for now. Add another cmd_line function to merge a table of cmd line options with another one, reporting as errors any duplicate entries. Use this to allow orterun to reuse the orted_submit code Fix the "fixed_dvm" logic by ensuring we reset num_new_daemons to zero. Also ensure that the nidmap is sent with the first job so the downstream daemons get the node info. Remove a duplicate cmd line entry in orterun. Revise the DVM startup procedure to pass the nidmap only once, at the startup of the DVM. This reduces the overhead on each job launch and ensures that the nidmap doesn't get overwritten. Add new commands to get_orted_comm_cmd_str(). Move ORTE command line options to orte_globals.[ch]. Catch up with extra orte_submit_init parameter. Add example code. Add documentation. Bump version. The nidmap and routing data must be updated prior to propagating the xcast or else the xcast will fail. Fix the return code so it is something more expected when an error occurs. Ensure we get an error returned to us when we fail to launch for some reason. In this case, we will always get a launch_cb as we did indeed attempt to spawn it. The error code will be returned in the complete_cb. Fix the return code from orte_submit_job - it was returning the tracker index instead of "success". Take advantage of ORTE's pretty-print capabilities to provide a nice error output explaining why we failed to launch. Ensure we always get a launch_cb when we fail to launch, but no complete_cb as the job never launched. Extend the error reporting capability to job completion as well. Add index parameter to orte_submit_job(). Add orte_job_cancel and implement ORTE_DAEMON_TERMINATE_JOB_CMD. Factor out dvm termination. Parse the terminate option at tool level. Add error string for ORTE_ERR_JOB_CANCELLED. Add some safeguards. Cleanup and/of comments. Enable the return. Properly ORTE_DECLSPEC orte_submit_halt. Add orte_submit_halt and orte_submit_cancel to interface. Use the plm interface to terminate the job	2016-02-13 08:10:44 -08:00
Howard Pritchard	39367ca0bf	plm/alps: only use srun for Native SLURM Turns out that the way the SLURM plm works is not compatible with the way MPI processes on Cray XC obtain RDMA credentials to use the high speed network. Unlike with ALPS, the mpirun process is on the first compute node in the job. With the current PLM launch system, mpirun (HNP daemon) launches the MPI ranks on that node rather than relying on srun. This will probably require a significant amount of effort to rework to support Native SLURM on Cray XC's. As a short term alternative, have the alps plm (which gets selected by default again on Cray systems regardless of the launch system) check whether or not srun or alps is being used on the system. If alps is not being used, print a helpful message for the user and abort the job launch. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-12-22 11:03:42 -08:00
Ralph Castain	64b695669a	Cleanup warnings in opal and orte layers when building optimized on Mac	2015-12-17 07:51:24 -08:00
Howard Pritchard	cb7c26ce96	plm/slurm: add support for cray native slurm Cray has added plugins to slurm to support the Cray programming env (alpslli, cray pmi, etc). Some of the workarounds needed with plm/alps to avoid issues with Cray PMI getting mixed up with orte launch system are also required in a cray native slurm environment. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-12-08 13:47:20 -06:00
Howard Pritchard	9548b8a9e8	plm/alps: add wlm detect infrastructure Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-12-07 07:43:20 -08:00
Ralph Castain	986a8c1d48	If an executable isn't found, it's possible for the state machine to hit the grpcomm with a zero-node map before we actually terminate with error. Silence the annoying malloc warning about zero-byte requests. In a novm operation that only has the HNP, ensure the #nodes gets set Clean up the error reporting	2015-11-11 14:24:13 -08:00
Ralph Castain	24419b6523	Fix relative node syntax for dash-host option	2015-10-31 19:00:46 -07:00
Nathan Hjelm	8b5810f7f7	mca/base: add priority output to mca_base_select The mca_base_select function uses returned priorities to select the best component/module. This priority may be of use to the caller so pass that information back in an optional argument. If the priority is not needed pass NULL. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-10-19 12:32:41 -06:00
Ralph Castain	0140ff048d	Now that we have an "isolated" PLM component, we cannot just let rsh silently decline to run when it cannot find a launch agent - if we do, then we will -always- run on the local node. So if the user specifies a launch agent and we can't find it, then generate a pretty error message, report a fatal error back to the component select, and exit out. This required modifying the mca_component_select function to actually check the return code on a component query - it was blissfully ignoring it. Also do a little cleanup to avoid bombarding the user with multiple error messages. Thanks to Patrick Begou for reporting the problem	2015-09-24 07:16:48 -07:00
Ralph Castain	c1bbbb5e2f	Remove the last involvement of the OOB system from the MPI layer, remove the no-longer-needed usock/oob component, and have procs no longer open the RML, OOB, ROUTED, and GRPCOMM frameworks as PMIx now provides all required app-mpirun cmds	2015-09-15 13:08:35 -07:00
Ralph Castain	d97bc29102	Remove OPAL_HAVE_HWLOC qualifier and error out if --without-hwloc is given	2015-09-04 16:54:40 -07:00
Ralph Castain	cf6137b530	Integrate PMIx 1.0 with OMPI. Bring Slurm PMI-1 component online Bring the s2 component online Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways. Bring the OMPI pubsub/pmi component online Get comm_spawn working again Ensure we always provide a cpuset, even if it is NULL pmix/cray: adjust cray pmix component for pmix Make changes so cray pmix can work within the integrated ompi/pmix framework. Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet Cleanup comm_spawn - procs now starting, error in connect_accept Complete integration	2015-08-29 16:04:10 -07:00
Ralph Castain	0b1d4b62be	Cleanup some cruft and update to coordinate with CM operations: * don't pass --tree-spawn to the orted cmd line. If someone doesn't want tree-spawn, it shows up as an MCA param anyway * ensure state/orted component disqualifies itself from CM operations * clarify the DVM proc_type definitions * ensure we stop littering the tmp dir with session directories	2015-08-12 10:32:14 -07:00
Howard Pritchard	1b55d14dff	plm/alps: remove unneded env. variable setting In order to address issue #741, the orted's now are always launched with the Cray PMI environment variables PMI_NO_FORK PMI_NO_PREINITIALIZE set to disable running of the library's ctor. So there's no longer a need to set these for the application(s) being launched by the orted's. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-08-05 13:27:18 -07:00
Ralph Castain	023936e84b	Silence coverity warnings	2015-07-29 07:28:08 -07:00
Howard Pritchard	70096d3753	plm/alps: fix orted based launch failures. Turns out that when one builds Open MPI with --disable-dlopen for Cray, a whole bunch of cray specific libraries get linked in to the orted executable. One of these is Cray PMI. The Cray PMI has a ctor which, if run, causes job launches using mpirun to fail. This commit suppresses the running of the ctor and thus prevents failure to launch. Signed-off-by: Howard Pritchard <howardp@lanl.gov>	2015-07-23 15:07:57 -07:00
Nathan Hjelm	4d92c9989e	more c99 updates This commit does two things. It removes checks for C99 required headers (stdlib.h, string.h, signal.h, etc). Additionally it removes definitions for required C99 types (intptr_t, int64_t, int32_t, etc). Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2015-06-25 10:14:13 -06:00
Ralph Castain	869041f770	Purge whitespace from the repo	2015-06-23 20:59:57 -07:00
Ralph Castain	869b2891c4	When doing comm-spawn, track the last object we bound to and ensure that we start the next job on the next object so we avoid overload situations when they aren't necessary	2015-06-17 09:20:08 -07:00
Ralph Castain	c21cd1c91e	Ensure the ssh session is dead	2015-05-23 08:14:29 -07:00
Ralph Castain	920562d9b4	Ensure that all ssh sessions are terminated when abnormally terminating the job	2015-05-23 08:14:29 -07:00
Gilles Gouaillardet	2e384a3b65	initialize common symbols from orte A few uninitialized common symbols are remaining (generated by flex) : * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_leng * orte/mca/rmaps/rank_file/rmaps_rank_file_lex.c: orte_rmaps_rank_file_text * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_leng * orte/util/hostfile/hostfile_lex.c: orte_util_hostfile_text	2015-05-08 10:11:58 +09:00
Ralph Castain	8e3f0b1d33	Ensure the --tree-spawn option is inside any parens from the sh and ksh shell support	2015-05-06 15:18:15 -07:00
Ralph Castain	7d1980ba83	Add the ability to specify the number of desired slots in the --host option. Just giving a host name => one slot (multiple copies of the name yield one slot per copy). Giving "foo:3" indicates you want three slots - a shorthand notation for saying "foo" three times. Giving "foo:*" indicates you want the topology to set the number of slots based on the orte_set_slots param.	2015-04-30 20:35:23 -07:00
Jeff Squyres	11e8c2096b	plm rsh: assign some levels to the rsh PLM MCA params	2015-04-20 16:18:57 -07:00
Nathan Hjelm	45e053dbce	orte: use C99 subobject naming for component initialization This commit helps future-proof orte components by initializing each component member by name. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>	2015-04-18 10:29:58 -06:00
Ralph Castain	34b53ac3dc	Silence Coverity warnings	2015-04-18 07:48:22 -07:00
Ralph Castain	12bfb27161	Redo in cleaner form: Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command	2015-04-17 16:11:37 -07:00
Nathan Hjelm	3436f2917d	Merge pull request #449 from hjelmn/mca_base_update mca/base update	2015-04-16 08:41:48 -06:00
Ralph Castain	d9c555b547	Revert "Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command" This reverts commit open-mpi/ompi@278324c52a. Revert "Add the ability to pass args to the rsh/ssh command line" This reverts commit open-mpi/ompi@6f227f8564.	2015-04-16 08:03:14 -06:00
Ralph Castain	278324c52a	Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command	2015-04-15 20:30:04 -06:00
Ralph Castain	6f227f8564	Add the ability to pass args to the rsh/ssh command line	2015-04-15 20:07:13 -06:00
Ralph Castain	91e1cbf284	Init variable	2015-04-11 07:44:57 -07:00
Ralph Castain	3e44d3c9e3	Enable singletons to run without any active OOB module until they attempt to comm_spawn	2015-04-10 14:06:42 -07:00
Ralph Castain	9f8ae59162	Properly enclose the different && clauses	2015-04-01 18:48:25 -07:00
Ralph Castain	57c21d5209	Ensure the DVM flows thru the "daemons reported" state	2015-04-01 16:47:34 -07:00
Mike Dubman	58d002098b	Merge pull request #474 from elenash/master Introduce -tune command line option to set env vars and mca params from ...	2015-04-01 08:23:34 +03:00
Ralph Castain	6f9140a341	Add a little more debug to launch	2015-03-31 20:10:21 -07:00
Nathan Hjelm	b68d66bb9b	MCA: Add the project/project version to the MCA base component This commit adds support for project_framework_component_* parameter matching. This is the first step in allowing the same framework name in multiple projects. This change also bumps the MCA component version to 2.1.0. All master frameworks have been updated to use the new component versioning macro. An mca.h has been added to each project to add a project specific versioning macro of the form PROJECT_MCA_VERSION_2_1_0. Signed-off-by: Nathan Hjelm <hjelmn@me.com>	2015-03-27 10:59:04 -06:00
Elena	90f5b2bb84	Introduce -tune command line option to set env vars and mca params from file	2015-03-26 18:33:53 +02:00
Ralph Castain	6aa33deafb	Remove debug	2015-03-25 19:58:51 -07:00
Ralph Castain	6ba76ed8d8	Per user request, we allow -host to specify a host that is not included in a hostfile (however, we reject it if we were given an allocation by a resource manager). Since we cannot know if an IP addr form references the same node that was previously given as a string name, we have no choice but to assume they are different. Get the topology from the right place in that situation so mpirun can succeed.	2015-03-25 06:16:01 -07:00
Ralph Castain	43a3baad5e	Ensure we use the first compute node's topology for mapping Don't filter the topology by cpuset if you are mpirun until you know that no other compute nodes are involved. This deals with the corner case where mpirun is executing on a node of different topology from the compute nodes. Simplify - don't mandate that all cpus in the given cpuset be present on every node. We can then run everything thru the filter as before, which ensures that any procs run on mpirun are also contained within the specified cpuset. Correctly count the number of available PUs under each object when given a cpuset Fix the default binding settings, and correctly count PUs when no cpuset is given Ensure the binding policy gets set in all cases	2015-03-19 16:30:36 -07:00
Gilles Gouaillardet	2ab9a411f8	plm/base: fix misc memory leaks as reported by Coverity with CIDs 1196733 and 1196745	2015-03-09 16:25:07 +09:00
Gilles Gouaillardet	7de3f35b90	pml/rsh: fix misc memory leaks as reported by Coverity with CIDs 71091, 71230, 71231, 72274, 72389, 1196718 and 1196719	2015-03-05 20:03:37 +09:00
Jeff Squyres	05f00aface	plm base: ensure mca_base_var_get_value() and mca_base_var_find() succeed This was CID 993712	2015-02-24 15:48:50 -05:00
Jeff Squyres	e2223cd9bf	plm_rsh: ensure cwd array is \0-terminated This was CID 72257	2015-02-24 15:24:08 -05:00

1 2 3 4 5 ...

747 Коммитов