1
1
Граф коммитов

634 Коммитов

Автор SHA1 Сообщение Дата
Gilles Gouaillardet
799152e7fb plm/base: add the orte_plm_base_node_regex_threshold MCA parameter
This parameter can be used to set the node regex max length that can
be passed to the orted command line.
For testing purpose, it can be set to zero in order to force the node regex
being retrieved by orted from its parent.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2018-01-04 09:33:46 +09:00
Ralph Castain
3269f2de66 Detect/warn of illegal node names
If we detect that someone has given us an incorrect node name, provide a helpful message telling them as it is almost certainly a typo.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-12-19 12:55:04 -08:00
Gilles Gouaillardet
f3e2a313af orte/nidmap: correctly handle '-' as a valid hostname character
'-' is not an alpha character nor a digit, but it is a valid hostname
character and should be handled as an alpha character, otherwise, nodes
such as node-001 do not get "compressed" in the regex.

Refs open-mpi/ompi#4621

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-12-15 15:28:50 +09:00
Ralph Castain
7c7d8a69a0 Backport changes from PMIx reference server
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-14 11:48:56 -07:00
Ralph Castain
2fbce9d93c Fix hostfile filtering in allocated environments to preserve slot assignments
Refs #3984

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-15 14:41:12 -07:00
Ralph Castain
b225366012 Bring the ofi/rml component online by completing the wireup protocol for the daemons. Cleanup the current confusion over how connection info gets created and
passed to make it all flow thru the opal/pmix "put/get" operations. Update the PMIx code to latest master to pickup some required behaviors.

Remove the no-longer-required get_contact_info and set_contact_info from the RML layer.

Add an MCA param to allow the ofi/rml component to route messages if desired. This is mainly for experimentation at this point as we aren't sure if routing wi
ll be beneficial at large scales. Leave it "off" by default.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-20 21:01:57 -07:00
Ralph Castain
543c16b28d Fix the isolated pmix component. Cleanup the ess/singleton component - we shouldn't be automatically discovering the local topology as that is now done on-demand.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-19 12:14:29 -07:00
Mark Allen
efc25168cd symbol name pollution: making some vars static
As part of addressing symbol name pollution, I'm switching a few
vars/functions to static.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
2017-07-11 02:13:22 -04:00
Ralph Castain
168e50bc13 Also need to avoid calling destruct on the opal_process_info struct after finalize
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-23 07:49:14 -07:00
Ralph Castain
3af9344764 Remove stale field
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-23 06:22:31 -07:00
Ralph Castain
38636f4f0a Ensure we properly cleanup on termination, including when terminating due to ctrl-c
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-21 06:33:37 -07:00
Ralph Castain
952726c121 Update to latest PMIx master - equivalent to 2.0rc2. Update the thread support in the opal/pmix framework to protect the framework-level structures.
This now passes the loop test, and so we believe it resolves the random hangs in finalize.

Changes in PMIx master that are included here:

* Fixed a bug in the PMIx_Get logic
* Fixed self-notification procedure
* Made pmix_output functions thread safe
* Fixed a number of thread safety issues
* Updated configury to use 'uname -n' when hostname is unavailable

Work on cleaning up the event handler thread safety problem
Rarely used functions, but protect them anyway
Fix the last part of the intercomm problem
Ensure we don't cover any PMIx calls with the framework-level lock.
Protect against NULL argv comm_spawn

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-20 09:02:15 -07:00
Ralph Castain
00ba6a1be6 Protect against NULL topology
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-08 20:56:44 -07:00
Ralph Castain
93cf3c7203 Update OPAL and ORTE for thread safety
(I swear, if I look this over one more time, I'll puke)

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-06 12:30:57 -07:00
Ralph Castain
6b3bbd30c5 Clean up the conduit open code so we return detectable errors when conduit not opened.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-02 10:40:51 -07:00
Ralph Castain
87201a80ff Silence coverity warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-27 11:45:53 -07:00
Ralph Castain
657e701c65 Add debug verbosity to the orte data server and pmix pub/lookup functions
Start updating the various mappers to the new procedure. Remove the stale lama component as it is now very out-of-date. Bring round_robin and PPR online, and modify the mindist component (but cannot test/debug it).

Remove unneeded test

Fix memory corruption by re-initializing variable to NULL in loop

Resolve the race condition identified by @ggouaillardet by resetting the
mapped flag within the same event where it was set. There is no need to
retain the flag beyond that point as it isn't used again.

Add a new job attribute ORTE_JOB_FULLY_DESCRIBED to indicate that all the job information (including locations and binding) is included in the launch message. Thus, the backend daemons do not need to do any map computation for the job. Use this for the seq, rankfile, and mindist mappers until someone decides to update them.

Note that this will maintain functionality, but means that users of those three mappers will see large launch messages and less performant scaling than those using the other mappers.

Have the mindist module add procs to the job's proc array as it is a fully described module

Protect the hnp-not-in-allocation case

Per path suggested by Gilles - protect the HNP node when it gets added in the absence of any other allocation or hostfile

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-25 18:41:27 -07:00
Ralph Castain
f47124e4d3 Finally fix the problem - the key was knowing there were more than 2 topologies involved, and that the HNP is not allocated. Give up on being cute and just search the darned list of topologies - there won't be that many, and if there are (so the scan takes awhile), then too bad.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-10 16:44:19 -07:00
Ralph Castain
55f4b825af Add verbose output to nidmap code for debugging as this is a new, and sometimes fragile, feature
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-10 12:40:02 -07:00
Ralph Castain
442e307a6e Fix the nidmap computation to deal with hetero nodes
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-10 08:43:28 -07:00
Ralph Castain
42d31454a5 Merge pull request #3469 from rhc54/topic/nidmap
Do not pass topologies during tree spawn of daemons as there is no wa…
2017-05-08 06:22:50 -07:00
Gilles Gouaillardet
e101f2b3f9 orte/util: fix vpids parsing in orte_util_nidmap_parse()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-05-08 16:46:13 +09:00
Ralph Castain
180809f2ef Do not pass topologies during tree spawn of daemons as there is no way the HNP can know the backend topologies at that point. Any needed topologies will be sent along with the launch_apps command
Do not pass param file MCA params if the user has requested that no param files be read - required when trying to avoid launch time penalties from large numbers of processes reading default param files. The daemon picks them up and passes them along anyway, so it isn't clear what value we gain from having them all read the defaults

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-07 21:14:43 -07:00
Ralph Castain
b526bca56c Fix a potential segfault by avoiding NULL topologies prior to launching the VM.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-06 20:51:19 -07:00
Ralph Castain
a29ca2bb0d Enable slurm operations on Cray with constraints
Cleanup some errors in the nidmap code that caused us to send unnecessary topologies

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-06 08:58:06 -07:00
Ralph Castain
92c996487c Update how we pass the node regex so we pass _all_ nodes, even those without daemons. This allows the backend daemons to form a complete picture of the allocation. Include info on which nodes have daemons on them, and populate that info on the backend as well.
Set the daemons' state to "running" and mark them as "alive" by default when constructing the nidmap

Get the DVM running again

Fix direct modex by eliminating race condition caused by releasing data while sending it

Up the size limit before compressing

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-03 19:25:15 -07:00
Ralph Castain
d645557fa0 Update to include the PMIx 2.0 APIs for monitoring and job control. Include required integration, but leave the monitors off for now. Move the sensor framework out of ORTE as it is being absorbed into PMIx
Fix typo and silence warnings

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-21 17:47:08 -07:00
Ralph Castain
dc85e7fde7 Provide a little more help on the error messages when an executable isn't found so we have some better idea where we were looking for it. Don't double-report such errors. Ensure the ORTE_ERROR_NAME doesn't get a NULL back for the string name of an error code as that might cause some systems to segfault
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-17 09:54:37 -07:00
Ralph Castain
61a71e25ef Ensure the backend daemons know if we are in a managed allocation and if
the HNP was included in the allocation

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-14 10:06:43 -07:00
Ralph Castain
c6bc3ccb76 Sync to latest PMIx master and PMIx reference server
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-11 12:50:38 -08:00
Ralph Castain
48fc339718 Create an alternative mapping method that pushes responsibility
onto the backend daemons. By default, let mpirun only pack the app_context
info and send that to the backend daemons where the mapping will
be done. This significantly reduces the computational time on mpirun as it isn't
running up/down the topology tree computing thousands of binding
locations, and it reduces the launch message to a very small number of
bytes.

When running -novm, fall back to the old way of doing things
where mpirun computes the entire map and binding, and then sends
the full info to the backend daemon.

Add a new cmd line option/mca param --fwd-mpirun-port that allows
mpirun to dynamically select a port, but then passes that back to
all the other daemons so they will use that port as a static port
for their own wireup. In this mode, we no longer "phone home" directly
to mpirun, but instead use the static port to wireup at daemon
start. We then use the routing tree to rollup the initial
launch report, and limit the number of open sockets on mpirun's node.

Update ras simulator to track the new nidmap code

Cleanup some bugs in the nidmap regex code, and enhance the error message for not enough slots to include the host on which the problem is found.

Update gadget platform file

Initialize the range count when starting a new range

Fix the no-np case in managed allocation

Ensure DVM node usage gets cleaned up after each job

Update scaling.pl script to use --fwd-mpirun-port. Pre-connect the daemon to its parent during launch while we are otherwise waiting for the daemon's children to send their "phone home" rollup messages

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-07 20:43:12 -08:00
Jeff Squyres
fec519a793 hwloc: rename opal/mca/hwloc/hwloc.h -> hwloc-internal.h
Per a prior commit, the presence of "hwloc.h" can cause ambiguity when
using --with-hwloc=external (i.e., whether to include
opal/mca/hwloc/hwloc.h or whether to include the system-installed
hwloc.h).

This commit:

1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h.
2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to
   find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc.
3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the
   rest of the code base.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-02-28 07:48:42 -08:00
Ralph Castain
22c88f5ab5 Fix launch_id matching of -hosts
Need to check the entire value instead of just the last N digits. Otherwise, "-host 15" will match nid0015, nid0115, and any other launch id ending in 15

It appears strtol can return either a NULL or a zero-length string, so check for both cases

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-02-20 07:03:53 -08:00
Ralph Castain
bf0f274f06 Allow -host to look for the number of a host when running in a managed environment that supports launch id's. For example, this will allow someone who has been allocated a node of "nid0015" to refer to it with "-host 15".
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-02-17 18:35:54 -08:00
Gilles Gouaillardet
d4d4cab5bf orte/util: fix OPAL_HAVE_ZLIB usage
use #if instead of #ifdef

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-02-05 16:24:10 +01:00
Ralph Castain
b59ae14a2a Fix static port and partial allocation operations
Fix static port wireup by recording the TCP port mpirun is using and correctly passing the regex of hosts to the daemons. Do a better job of closing sockets on failed connection attempts. Correctly identify the remote host in the associated error message.

Fix partial allocation operations by not attempting to set #slots on nodes that were not used, and thus don't have a daemon or topology assigned to them

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-28 10:09:44 -08:00
Ralph Castain
7c795f4416 If the HNP is going to request topology info, it cannot do so via a routed OOB message as the intervening daemons may not be ready. So disable routing until the VM is ready, and have daemons start routing as they receive the xcast launch msg (which includes the data they need to talk to their peers).
Do a little optimization and minimize recomputation of the routing plan.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-27 15:37:16 -08:00
Ralph Castain
d672fad849 Repair rsh/ssh tree spawn
Repair rsh/ssh tree spawn by unpacking and updating the nidmap in remote_spawn.

Add more specific error messages so the cause of a messaging problem is a little clearer. Remove some stale code. Ensure we stop trying to send a message after a few times.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-27 11:35:00 -08:00
Ralph Castain
399de0738e Cleanup launch
Given that we only set OOB contact info from inside of events, or before we begin threaded operations (e.g., in the ess), allow set_contact_info to directly update the oob/base framework globals.

Correct the nidmap regex decompression routine.

Ensure that rank=1 daemon always sends back its topology as this is the most common use-case.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-25 22:06:09 -08:00
Ralph Castain
ef86707fbe Deprecate the --slot-list paramaeter in favor of --cpu-list. Remove the --cpu-set param (mark it as deprecated) and use --cpu-list instead as it was confusing having the two params. The --cpu-list param defines the cpus to be used by procs of this job, and the binding policy will be overlayed on top of it.
Note: since the discovered cpus are filtered against this list, #slots will be set to the #cpus in the list if no slot values are given in a -host or -hostname specification.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-24 13:33:22 -08:00
Ralph Castain
0bfdc0057a Extend the -host:N syntax to accept "*" or "auto" to indicate "auto-detect the #cpus and set #slots to that value"
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-24 10:21:01 -08:00
Ralph Castain
d3907dec98 Make master continue the -host behavior of prior releases: use of -host <foo> specifies a single slot. Requests to run more than one process will require either specifying slots using the "-host foo:N" syntax, or adding --oversubscribe to the cmd line.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-24 10:11:56 -08:00
Ralph Castain
86ab751c5e Next step in reducing launch time: begin reducing the size of the launch message itself. Start by expressing the daemon map as a set of three regular expression strings. On an 8k cluster, this reduces the nidmap contribution from over 200kBytes to 21 bytes in size.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-23 19:54:47 -08:00
Ralph Castain
6560617c04 Fix comm_spawn and orte-dvm by resetting all used "node mapped" flags after building the child list
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-22 05:55:53 -08:00
Ralph Castain
be3ef77739 Improve packing efficiency by raising the initial buffer size and modifying the extension code. Flag if a job map has had its nodes added so we don't have to loop repeatedly to check it.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-21 14:03:19 -08:00
Ralph Castain
668421b6ec Compress the xcast message if bigger than a defined size to further improve launch performance at scale
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-19 22:08:02 -08:00
Gilles Gouaillardet
a1a0e324b3 util/hostfile: plug a memory leak
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:45 +09:00
Gilles Gouaillardet
88535b6200 orte/util: revamp orte_attr_unload() to keep valgrind happy
reorder tests to avoid valgrind complaining about uninitialized variables

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 11:35:59 +09:00
Ralph Castain
6509f60929 Complete the memprobe support. This provides a new scaling tool called "mpi_memprobe" that samples the memory footprint of the local daemon and the client procs, and then reports the results. The output contains the footprint of the daemon on each node, plus the average footprint of the client procs on that node.
Samples are taken after MPI_Init, and then again after MPI_Barrier. This allows the user to see memory consumption caused by add_procs, as well as any modex contribution from forming connections if pmix_base_async_modex is given.

Using the probe simply involves executing it via mpirun, with however many copies you want per node. Example:

$ mpirun -npernode 2 ./mpi_memprobe
Sampling memory usage after MPI_Init
Data for node rhc001
	Daemon: 12.483398
	Client: 6.514648

Data for node rhc002
	Daemon: 11.865234
	Client: 4.643555

Sampling memory usage after MPI_Barrier
Data for node rhc001
	Daemon: 12.520508
	Client: 6.576660

Data for node rhc002
	Daemon: 11.879883
	Client: 4.703125

Note that the client value on node rhc001 is larger - this is where rank=0 is housed, and apparently it gets a larger footprint for some reason.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-05 10:32:17 -08:00
Ralph Castain
2af677b1cf Ensure that we don't bind-by-default in an oversubscribed condition
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-15 07:58:52 -08:00