1
1

747 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
105fb152e1 Silence Coverity warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-13 08:38:51 -07:00
Ralph Castain
48fc339718 Create an alternative mapping method that pushes responsibility
onto the backend daemons. By default, let mpirun only pack the app_context
info and send that to the backend daemons where the mapping will
be done. This significantly reduces the computational time on mpirun as it isn't
running up/down the topology tree computing thousands of binding
locations, and it reduces the launch message to a very small number of
bytes.

When running -novm, fall back to the old way of doing things
where mpirun computes the entire map and binding, and then sends
the full info to the backend daemon.

Add a new cmd line option/mca param --fwd-mpirun-port that allows
mpirun to dynamically select a port, but then passes that back to
all the other daemons so they will use that port as a static port
for their own wireup. In this mode, we no longer "phone home" directly
to mpirun, but instead use the static port to wireup at daemon
start. We then use the routing tree to rollup the initial
launch report, and limit the number of open sockets on mpirun's node.

Update ras simulator to track the new nidmap code

Cleanup some bugs in the nidmap regex code, and enhance the error message for not enough slots to include the host on which the problem is found.

Update gadget platform file

Initialize the range count when starting a new range

Fix the no-np case in managed allocation

Ensure DVM node usage gets cleaned up after each job

Update scaling.pl script to use --fwd-mpirun-port. Pre-connect the daemon to its parent during launch while we are otherwise waiting for the daemon's children to send their "phone home" rollup messages

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-07 20:43:12 -08:00
Jeff Squyres
fec519a793 hwloc: rename opal/mca/hwloc/hwloc.h -> hwloc-internal.h
Per a prior commit, the presence of "hwloc.h" can cause ambiguity when
using --with-hwloc=external (i.e., whether to include
opal/mca/hwloc/hwloc.h or whether to include the system-installed
hwloc.h).

This commit:

1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h.
2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to
   find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc.
3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the
   rest of the code base.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-02-28 07:48:42 -08:00
Josh Hursey
a17b547430 Merge pull request #2957 from jjhursey/topic/ibm/rsh-sigint-fix
plm/rsh: Fix signal handling for rsh launcher
2017-02-14 15:29:00 -06:00
Joshua Hursey
843fcca03c plm/rsh: Fix signal handling for rsh launcher
* Similar to the other launchers (i.e., slurm, alps) we need to put the
   children in a separate process group to prevent SIGINT (from a CTRL-C)
   from being delivered to the whole process group and prematurely
   killing the rsh/ssh connections to the remote daemons.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-02-14 08:54:17 -06:00
Ralph Castain
dee2d8646d Fix plm/rsh runtime check
Fix the check for rsh/ssh so we allow the check for SGE and LoadLeveler to occur if user doesn't specify their own launch agent. Fix a Coverity warning

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-02-13 16:54:03 -08:00
Ralph Castain
230d15f0d9 Cleanup the ras simulator capability, and the relay route thru grpcomm
direct. Don't resend wireup info if nothing has changed

Fix release of buffer

Correct the unpacking order

Fix the DVM - now minimized data transfer to it

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-02-01 15:01:58 -08:00
Ralph Castain
b59ae14a2a Fix static port and partial allocation operations
Fix static port wireup by recording the TCP port mpirun is using and correctly passing the regex of hosts to the daemons. Do a better job of closing sockets on failed connection attempts. Correctly identify the remote host in the associated error message.

Fix partial allocation operations by not attempting to set #slots on nodes that were not used, and thus don't have a daemon or topology assigned to them

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-28 10:09:44 -08:00
Ralph Castain
c803af5d3d Minor change to allow qrsh to tree spawn, if supported
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-27 16:34:08 -08:00
Ralph Castain
7c795f4416 If the HNP is going to request topology info, it cannot do so via a routed OOB message as the intervening daemons may not be ready. So disable routing until the VM is ready, and have daemons start routing as they receive the xcast launch msg (which includes the data they need to talk to their peers).
Do a little optimization and minimize recomputation of the routing plan.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-27 15:37:16 -08:00
Ralph Castain
d672fad849 Repair rsh/ssh tree spawn
Repair rsh/ssh tree spawn by unpacking and updating the nidmap in remote_spawn.

Add more specific error messages so the cause of a messaging problem is a little clearer. Remove some stale code. Ensure we stop trying to send a message after a few times.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-27 11:35:00 -08:00
Josh Hursey
2e64bf42fb Merge pull request #2810 from jjhursey/fix/ibm/stdiag-to-stdout
Extend options for stddiag routing
2017-01-26 14:29:16 -06:00
Ralph Castain
399de0738e Cleanup launch
Given that we only set OOB contact info from inside of events, or before we begin threaded operations (e.g., in the ess), allow set_contact_info to directly update the oob/base framework globals.

Correct the nidmap regex decompression routine.

Ensure that rank=1 daemon always sends back its topology as this is the most common use-case.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-25 22:06:09 -08:00
Ralph Castain
2f4e87eae9 Have rank=1 daemon always send its topology back as this is the most common use-case
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-25 09:33:11 -08:00
Jeff Squyres
230bbc597d plm base: make sure to assign "node" early enough
Make sure to assign "node" before using it in ORTE_FLAG_SET.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-01-25 08:02:59 -08:00
Joshua Hursey
dcd9801f7c orte/iof: Add orte_map_stddiag_to_stdout option
* Similar to `orte_map_stddiag_to_stderr` except it redirects `stddiag`
   to `stdout` instead of `stderr`.
 * Add protection so that the user canot supply both:
   - `orte_map_stddiag_to_stderr`
   - `orte_map_stddiag_to_stdout`

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-01-24 16:22:59 -06:00
Ralph Castain
86ab751c5e Next step in reducing launch time: begin reducing the size of the launch message itself. Start by expressing the daemon map as a set of three regular expression strings. On an 8k cluster, this reduces the nidmap contribution from over 200kBytes to 21 bytes in size.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-23 19:54:47 -08:00
Ralph Castain
368684bd63 Revert e9bc293 and try a different approach for scalably dealing with hetero clusters. Have each orted send back its topo "signature". If mpirun detects that this signature has not been seen before, then ask for that daemon to send back its full topology description. This allows the system to only get the topology once for each unique topo in the cluster.
Cleanup a typo, and remove no longer needed MCA params for hetero nodes and hetero apps. Hetero nodes will always be automatically detected. We don't support a mix of 32 and 64 bit apps

Modify the orte_node_t to use orte_topology_t instead of hwloc_topology_t, updating all the places that use it. Ensure that we properly update topology when we see a different one on a compute node.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-18 10:22:15 -08:00
Ralph Castain
e9bc2934be Add an MCA param "hnp_on_smgmt_node" that mpirun can use to tell the orteds to ignore its topology signature as mpirun is executing on a system mgmt node, and hence a different topology than the compute nodes
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-16 19:32:01 -08:00
Gilles Gouaillardet
6b9343a966 plm/rsh: plug a memory leak
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:45 +09:00
Gilles Gouaillardet
c0c5dd8ccc orte: plug a memory leak in orte_rml.recv_cancel
do not invoke orte_rml.recv_cancel after the orte progress thread has gone

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:44 +09:00
Ralph Castain
9eab9a1ed3 Remove stale global variables
Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers.

Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation).

Begin first cut at memory profiler

Some minor cleanups of memprobe

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-02 14:04:24 -08:00
Ralph Castain
791f4f1ce3 Adjust debug output for clarity
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-26 14:04:20 -08:00
Ralph Castain
79cde184ad Allow a PMIx tool to spawn a job
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-03 16:00:47 -08:00
Ralph Castain
d5fd635efe Bring forward the debugger-related changes
Refs https://github.com/open-mpi/ompi/pull/2425

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-11-29 13:15:20 -08:00
Howard Pritchard
2cbc0e8472 pmix/cray: fix disable-dlopen problem
PR open-mpi/ompi#2432 introduced a regression where configure
and build with --disable-dlopn caused build failure owing
to unresolved alps lli symbols in the libopal-pal shared library.

This commit fixes this problem.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-11-21 13:45:10 -06:00
Ralph Castain
649301a3a2 Revise the routed framework to be multi-select so it can support the new conduit system. Update all calls to rml.send* to the new syntax. Define an orte_mgmt_conduit for admin and IOF messages, and an orte_coll_conduit for all collective operations (e.g., xcast, modex, and barrier).
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
2016-10-23 21:52:39 -07:00
Gilles Gouaillardet
1846c2d8ad plm/rsh: use an alternate port if the ORTE_NODE_PORT attribute is set 2016-10-19 16:18:52 +09:00
Ralph Castain
de7b1494d9 Clean out old cruft from the ORCM project 2016-09-21 00:13:30 -07:00
Gilles Gouaillardet
c09899f6af plm: plus resource leaks
as reported by Coverity with CIDs 72274 and 1196733
2016-09-07 10:08:44 +09:00
Ralph Castain
4e0788e9ad Enable PSM to support dynamic processes
Fix comm_spawn to correctly reference the actual parent process that requested the spawn when looking for the parent job object
2016-09-02 10:22:04 -07:00
Joshua Hursey
d26dd2c20e orte: Expand the application of !orte_keep_fqdn_hostnames
* Expand the use of the `orte_keep_fqdn_hostnames` MCA parameter when
   it is set to false.
 * If that parameter is set to false (default) then short hostnames
   (e.g., `node01`) will match with the long hostnames (e.g.,
   `node01.mycluster.org`). This allows a user (or resource manager)
    to mix the use of short and long hostnames.
  - Note that this mechanism does _not_ perform a DNS lookup, but
    instead strips off the FQDN by truncating the hostname string at
    the first `.` character (when not an IP address).
     - By default (`false`) the following is true:
       `node01 == node01.mycluster.org == node01.bogus.com`
       since we use `node01` as the hostname.
2016-08-26 16:09:04 -05:00
Jeff Squyres
71ec5cfb43 rsh: robustify the check for plm_rsh_agent default value
Don't strcmp against the default value -- the default value may change
over time.  Instead, check to see if the MCA var source is not
DEFAULT.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-08-16 06:58:20 -05:00
Ralph Castain
9f43db7303 Further cleanup getpwuid usage - try it first (unless completely disabled), and then silently failover to try other methods. 2016-08-15 07:51:36 -07:00
Ralph Castain
5717b75b45 Restore the rsh template creation code 2016-08-12 12:43:40 -07:00
Ralph Castain
1c44543854 If the ssh agent hasn't been given, then check for qrsh and friends 2016-08-12 07:46:39 -07:00
Ralph Castain
ddd0d05de3 Fix a bug in the handling of nper<foo> when -host or -hostfile was given. Correctly mark slots as "given" when we auto-assign them. Ensure we don't set the number of procs when using nper<foo> so the PPR mapper can correctly assing them. 2016-07-12 09:27:02 -07:00
Ralph Castain
dd0f843843 Fix rare hangs observed on OS-X by properly thread-shifting upcalls from the PMIx server into ORTE 2016-06-05 21:39:44 -07:00
Ralph Castain
3913595e10 Enable simulation of large-scale clusters by allowing multiple daemons/node. Specifying the ras_base_multiplier parameter to be greater than 1 will cause ORTE to replicate each allocated node by that factor. A daemon will be spawned for each replica, thus letting ORTE function as if it were on a much larger cluster.
Note that this cannot be used for MPI performance testing. It is really only useful for ORTE scaling tests. It also only works with the rsh/ssh launcher.
2016-05-29 18:56:18 -07:00
Ralph Castain
6ac7929bd0 Extend the schizo framework to allow definition of CLI options by environment. Refactor orterun to mesh with the orted_submit code, thus improving code reuse. Eliminate the orte-submit tool as orterun can now meet that need.
Cleanups per @jjhursey review
2016-05-01 11:30:25 -07:00
Jeff Squyres
6800ef9ec0 m4: rename OMPI_SUMMARY_* macros to OPAL_SUMMARY_*
These macros should really be named OPAL_SUMMARY_*; they're used in
all projects, and therefore should be in the lowest later project (OPAL).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-04-20 08:40:00 -07:00
Ralph Castain
437f5b4289 Fix map-by node and do-not-launch 2016-04-13 09:21:19 -07:00
Ralph Castain
503e1274a9 Per the discussion on the telecon, change the -host behavior so we only run one instance if no slots were provided and the user didn't specify #procs to run. However, if no slots are given and the user does specify #procs, then let the number of slots default to the #found processing elements
Ensure the returned exit status is non-zero if we fail to map

If no -np is given, but either -host and/or -hostfile was given, then error out with a message telling the user that this combination is not supported.

If -np is given, and -host is given with only one instance of each host, then default the #slots to the detected #pe's and enforce oversubscription rules.

If -np is given, and -host is given with more than one instance of a given host, then set the #slots for that host to the number of times it was given and enforce oversubscription rules. Alternatively, the #slots can be specified via "-host foo:N". I therefore believe that row #7 on Jeff's spreadsheet is incorrect.

With that one correction, this now passes all the given use-cases on that spreadsheet.

Make things behave under unmanaged allocations more like their managed cousins - if the #slots is given, then no-np shall fill things up.

Fixes #1344
2016-03-29 11:21:57 -07:00
Howard Pritchard
69200e6229 plm/alps: fix usage of cray wlm_detect methods
Turns out there are some cases where the Cray
wlm_detect_get_active may return NULL, in which
case fallback to wlm_detect_get_default method
is suggested.  Make use of the fallback to
avoid segfaults under some circumstances in the
ALPS plm selection method.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2016-03-22 11:40:56 -07:00
Ralph Castain
c146c4969b Revert part of open-mpi/ompi@c1bbbb5e2f to restore the usock component, thus fixing show_help aggregation.
Fixes #1467

Restore debugger attach operations

Fixes #1225
2016-03-18 21:49:04 -07:00
Jeff Squyres
48c650c47a configury: minor updates to config summary output 2016-03-10 13:02:52 -08:00
Ralph Castain
4a55fba414 Fix registration of error handlers thru the pmix120 component. A thread-shift operation was hanging on the sync_event_base, which made it dependent on someone calling opal_progress. Unfortunately, a process in "sleep" or spinning outside the MPI library won't do that, and so we never complete errhandler registration. 2016-03-02 15:01:01 -08:00
Ralph Castain
64b7728f33 Fix typo - do not look at daemon job when considering completion of launch 2016-02-21 14:44:51 -08:00
Ralph Castain
d653cf2847 Convert the orte_job_data pointer array to a hash table so it doesn't grow forever as we run lots and lots of jobs in the persistent DVM. 2016-02-21 11:55:49 -08:00
Ralph Castain
6e68d758b9 Cleanup some valgrind complaints about jumps with uninitialized values. Fix a few IOF issues reported by Mark Santcroos when submitting jobs from tools. Add the ability to pass directives to the --output-filename option that tell ORTE to (a) not include the jobid in the path to the output files, and (b) not to copy the output to the tool (i.e., just store it in the files).
ck

Remove stale debug

Fix a segfault if no subscribers are present
2016-02-18 16:30:37 -08:00