1
1

199 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
dee2d8646d Fix plm/rsh runtime check
Fix the check for rsh/ssh so we allow the check for SGE and LoadLeveler to occur if user doesn't specify their own launch agent. Fix a Coverity warning

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-02-13 16:54:03 -08:00
Ralph Castain
b59ae14a2a Fix static port and partial allocation operations
Fix static port wireup by recording the TCP port mpirun is using and correctly passing the regex of hosts to the daemons. Do a better job of closing sockets on failed connection attempts. Correctly identify the remote host in the associated error message.

Fix partial allocation operations by not attempting to set #slots on nodes that were not used, and thus don't have a daemon or topology assigned to them

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-28 10:09:44 -08:00
Ralph Castain
c803af5d3d Minor change to allow qrsh to tree spawn, if supported
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-27 16:34:08 -08:00
Ralph Castain
7c795f4416 If the HNP is going to request topology info, it cannot do so via a routed OOB message as the intervening daemons may not be ready. So disable routing until the VM is ready, and have daemons start routing as they receive the xcast launch msg (which includes the data they need to talk to their peers).
Do a little optimization and minimize recomputation of the routing plan.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-27 15:37:16 -08:00
Ralph Castain
d672fad849 Repair rsh/ssh tree spawn
Repair rsh/ssh tree spawn by unpacking and updating the nidmap in remote_spawn.

Add more specific error messages so the cause of a messaging problem is a little clearer. Remove some stale code. Ensure we stop trying to send a message after a few times.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-27 11:35:00 -08:00
Ralph Castain
86ab751c5e Next step in reducing launch time: begin reducing the size of the launch message itself. Start by expressing the daemon map as a set of three regular expression strings. On an 8k cluster, this reduces the nidmap contribution from over 200kBytes to 21 bytes in size.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-23 19:54:47 -08:00
Gilles Gouaillardet
6b9343a966 plm/rsh: plug a memory leak
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:45 +09:00
Ralph Castain
649301a3a2 Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier).
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
2016-10-23 21:52:39 -07:00
Gilles Gouaillardet
1846c2d8ad plm/rsh: use an alternate port if the ORTE_NODE_PORT attribute is set 2016-10-19 16:18:52 +09:00
Ralph Castain
de7b1494d9 Clean out old cruft from the ORCM project 2016-09-21 00:13:30 -07:00
Gilles Gouaillardet
c09899f6af plm: plus resource leaks
as reported by Coverity with CIDs 72274 and 1196733
2016-09-07 10:08:44 +09:00
Jeff Squyres
71ec5cfb43 rsh: robustify the check for plm_rsh_agent default value
Don't strcmp against the default value -- the default value may change
over time.  Instead, check to see if the MCA var source is not
DEFAULT.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-16 06:58:20 -05:00
Ralph Castain
9f43db7303 Further cleanup getpwuid usage - try it first (unless completely disabled), and then silently failover to try other methods. 2016-08-15 07:51:36 -07:00
Ralph Castain
5717b75b45 Restore the rsh template creation code 2016-08-12 12:43:40 -07:00
Ralph Castain
1c44543854 If the ssh agent hasn't been given, then check for qrsh and friends 2016-08-12 07:46:39 -07:00
Jeff Squyres
6800ef9ec0 m4: rename OMPI_SUMMARY_* macros to OPAL_SUMMARY_*
These macros should really be named OPAL_SUMMARY_*; they're used in
all projects, and therefore should be in the lowest later project (OPAL).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-04-20 08:40:00 -07:00
Jeff Squyres
48c650c47a configury: minor updates to config summary output 2016-03-10 13:02:52 -08:00
Ralph Castain
0140ff048d Now that we have an "isolated" PLM component, we cannot just let rsh silently decline to run when it cannot find a launch agent - if we do, then we will -always- run on the local node. So if the user specifies a launch agent and we can't find it, then generate a pretty error message, report a fatal error back to the component select, and exit out.
This required modifying the mca_component_select function to actually check the return code on a component query - it was blissfully ignoring it.

Also do a little cleanup to avoid bombarding the user with multiple error messages.

Thanks to Patrick Begou for reporting the problem
2015-09-24 07:16:48 -07:00
Ralph Castain
0b1d4b62be Cleanup some cruft and update to coordinate with CM operations:
* don't pass --tree-spawn to the orted cmd line. If someone doesn't want tree-spawn, it shows up as an MCA param anyway
* ensure state/orted component disqualifies itself from CM operations
* clarify the DVM proc_type definitions
* ensure we stop littering the tmp dir with session directories
2015-08-12 10:32:14 -07:00
Nathan Hjelm
4d92c9989e more c99 updates
This commit does two things. It removes checks for C99 required
headers (stdlib.h, string.h, signal.h, etc). Additionally it removes
definitions for required C99 types (intptr_t, int64_t, int32_t, etc).

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-06-25 10:14:13 -06:00
Ralph Castain
869041f770 Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
Ralph Castain
c21cd1c91e Ensure the ssh session is dead 2015-05-23 08:14:29 -07:00
Ralph Castain
920562d9b4 Ensure that all ssh sessions are terminated when abnormally terminating the job 2015-05-23 08:14:29 -07:00
Ralph Castain
8e3f0b1d33 Ensure the --tree-spawn option is inside any parens from the sh and ksh shell support 2015-05-06 15:18:15 -07:00
Jeff Squyres
11e8c2096b plm rsh: assign some levels to the rsh PLM MCA params 2015-04-20 16:18:57 -07:00
Nathan Hjelm
45e053dbce orte: use C99 subobject naming for component initialization
This commit helps future-proof orte components by initializing each
component member by name.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-18 10:29:58 -06:00
Ralph Castain
34b53ac3dc Silence Coverity warnings 2015-04-18 07:48:22 -07:00
Ralph Castain
12bfb27161 Redo in cleaner form: Per request from Andy Rieb, add ability to pass PATH and LD_LIBRARY_PATH elements to ssh command 2015-04-17 16:11:37 -07:00
Nathan Hjelm
b68d66bb9b MCA: Add the project/project version to the MCA base component
This commit adds support for project_framework_component_* parameter
matching. This is the first step in allowing the same framework name
in multiple projects. This change also bumps the MCA component version
to 2.1.0.

All master frameworks have been updated to use the new component
versioning macro. An mca.h has been added to each project to add a
project specific versioning macro of the form
PROJECT_MCA_VERSION_2_1_0.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-03-27 10:59:04 -06:00
Gilles Gouaillardet
7de3f35b90 pml/rsh: fix misc memory leaks
as reported by Coverity with CIDs 71091, 71230, 71231, 72274, 72389,
1196718 and 1196719
2015-03-05 20:03:37 +09:00
Jeff Squyres
e2223cd9bf plm_rsh: ensure cwd array is \0-terminated
This was CID 72257
2015-02-24 15:24:08 -05:00
Howard Pritchard
bf89131f9e add owner files to opa/ompi/orte mca directories
This commit adds an owner file in each of the component directories
for each framework.  This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page.  Currently there are two
"fields" in the file, an owner and a status.  A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
2015-02-22 15:10:23 -07:00
Ralph Castain
894acb0aa8 configury: new OPAL_SET_MCA_PREFIX/ORTE_SET_MCA_CMD_LINE_ID macros
These two macros set the MCA prefix and MCA cmd line id,
   respectively.  Specifically, MCA parameters will be named
   PREFIX<foo> in the environment, and the cmd line will use
   -ID foo bar.

   These macros must be called during configure.ac and a value
   supplied. In the case of Open MPI, the values given are
   PREFIX=OMPI_MCA_ and ID=mca.

   Other projects (such as ORCM) will call these macros with
   their own unique values.  For example, ORCM uses PREFIX=ORCM_MCA_
   and ID=omca

   This scheme is necessary to allow running Open MPI applications under
   systems that use their own versions of ORTE and OPAL.  For example,
   when running OMPI applications under ORCM, we need the MCA params passed
   to the ORCM daemons to be separated from those recognized by the OMPI application.
2014-10-22 18:57:40 -07:00
Ralph Castain
b6aa691e0a Fix incorrect implementation of new MCA param mca_base_env_list - it was not picking up envars and forwarding them, but only worked if you explicitly set a value for the envar. Ensure it works for both direct and indirect launch modes. Remove stale code as this replaced orte_forward_envars. Ensure it doesn't get passed to the ORTE daemons. 2014-10-16 12:58:56 -07:00
Jeff Squyres
e95ed94a94 plm_rsh_module.c: output to the framework output
Trivial fix from r32686: don't output to stream 0, but rather to
orte_plm_base_framework.framework_output (this is the way it was
before r32686).  In reality, this is going to end up being stream 0,
anyway, but we might as well be pedantically correct...

Refs trac:4897.

This commit was SVN r32726.

The following SVN revision numbers were found above:
  r32686 --> open-mpi/ompi@4df1aa63f7

The following Trac tickets were found above:
  Ticket 4897 --> https://svn.open-mpi.org/trac/ompi/ticket/4897
2014-09-13 00:46:35 +00:00
Ralph Castain
4df1aa63f7 Since we've run into the situation where someone puts a script wrapper around a launcher such as srun, we need to always protect MCA cmd line params with quotes. This means we also need to protect the backend from quotes coming into the system as part of a value, or else the parser gets confused.
So add a new function for wrapping MCA arguments, and tell the backend parser to ignore/remove leading/trailing quotes.

cmr=v1.8.3:reviewer=jsquyres

This commit was SVN r32686.
2014-09-08 20:38:46 +00:00
Ralph Castain
039b7acfb5 Fix the quoting algorithm so only rsh command lines get quoted values
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r32586.
2014-08-22 22:47:38 +00:00
Ralph Castain
aec5cd08bd Per the PMIx RFC:
WHAT:    Merge the PMIx branch into the devel repo, creating a new
               OPAL “lmix” framework to abstract PMI support for all RTEs.
               Replace the ORTE daemon-level collectives with a new PMIx
               server and update the ORTE grpcomm framework to support
               server-to-server collectives

WHY:      We’ve had problems dealing with variations in PMI implementations,
               and need to extend the existing PMI definitions to meet exascale
               requirements.

WHEN:   Mon, Aug 25

WHERE:  https://github.com/rhc54/ompi-svn-mirror.git

Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.

All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.

Accordingly, we have:

* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.

* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.

* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint

* removed the prior OMPI/OPAL modex code

* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.

* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand

This commit was SVN r32570.
2014-08-21 18:56:47 +00:00
Jeff Squyres
1551339eba rsh: revert part of r32517: keep the quoting
As part of reviewing CMR #4860, I talked through r32517 with Ralph.

In attempt to fix various rsh quoting problems, r32517 removed all the
quoting from the main code path and then only added it back in at the
end in some cases.

This commit puts back the quoting parts that were removed in r32517
(r32517 fixed 2 other important bugs: a) change "--<foo>" to "--mca
<foo_equivalent> 1" so that de-duplication works, and b) change a !=
to ==).

refs trac:4860

This commit was SVN r32524.

The following SVN revision numbers were found above:
  r32517 --> open-mpi/ompi@7342bce58f

The following Trac tickets were found above:
  Ticket 4860 --> https://svn.open-mpi.org/trac/ompi/ticket/4860
2014-08-13 19:27:10 +00:00
Ralph Castain
7342bce58f Cleanup the over-aggressive quoting of params on the orted cmd line. Remove duplicates caused by passing on both cmd line shortcuts and the mca param version of the same thing.
Fixes trac:4857

cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r32517.

The following Trac tickets were found above:
  Ticket 4857 --> https://svn.open-mpi.org/trac/ompi/ticket/4857
2014-08-13 03:51:04 +00:00
Ralph Castain
0cad281a92 Single-word cmd line values for orted are dealt with in orte_plm_base_orted_append_basic_args, so protect against special characters there. Have the rsh module only deal with multi-word arguments as those were skipped by orte_plm_base_orted_append_basic_args.
Refs trac:4802

This commit was SVN r32293.

The following Trac tickets were found above:
  Ticket 4802 --> https://svn.open-mpi.org/trac/ompi/ticket/4802
2014-07-23 17:06:51 +00:00
Ralph Castain
a94a97bd50 Cleanup the passing of MCA params on the orted cmd line in ssh by ensuring that we quote all values since they could be multi-word and/or contain special characters. Thanks to Dirk Schubert for pointing it out.
cmr=v1.8.2:reviewer=jsquyres

This commit was SVN r32280.
2014-07-22 18:22:06 +00:00
Ralph Castain
5216bd5558 Multiple sigchld reports can occur within a single event callback, so have to reap them until none remain. Also, need to ensure the daemon is flagged as alive prior to calling wait_cb
Refs trac:4717

This commit was SVN r32020.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 18:46:40 +00:00
Ralph Castain
42bf7466fc This isn't as big a change as it appears - a change in one place caused a whole bunch of files to require updated #include's due to some arcane linkage. Rework the orte_wait code to reflect the introduction of the state machine. If we are in cleanup mode and just want to kill all our local children, then there is no reason to be polite about it as that introduces *very* long delays at scale. Just kill the procs and move on.
Refs trac:4717

This commit was SVN r32019.

The following Trac tickets were found above:
  Ticket 4717 --> https://svn.open-mpi.org/trac/ompi/ticket/4717
2014-06-17 17:57:51 +00:00
Ralph Castain
f1978fba7c Cleanup a set of typos on the orte_get_attribute call
This commit was SVN r31942.
2014-06-03 20:36:38 +00:00
Ralph Castain
5668f085a3 Silence some useless warnings, and fix a missed updated in the tm plm
This commit was SVN r31930.
2014-06-02 17:57:56 +00:00
Ralph Castain
8736a1c138 Per RFC:
http://www.open-mpi.org/community/lists/devel/2014/05/14822.php

Revamp the ORTE global data structures to reduce memory footprint and add new features. Add ability to control/set cpu frequency, though this can only be done if the sys admin has setup the system to support it (or you run as root).

This commit was SVN r31916.
2014-06-01 16:14:10 +00:00
Ralph Castain
5602156a1c Use the correct abstraction layer name for the data dirs
This commit was SVN r31684.
2014-05-08 14:32:24 +00:00
Ralph Castain
11faab1091 The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees.
This commit was SVN r31679.
2014-05-08 02:01:35 +00:00
Ralph Castain
0aa23cdc35 Cleanup copy/paste errors to ensure we progress the launch
cmr=v1.7.5:reviewer=rhc

This commit was SVN r31102.
2014-03-18 01:24:49 +00:00