1
1

153 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
a63904d47f Updates to support cross-version operations with OMPI v2.x
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-22 08:38:33 -07:00
Ralph Castain
6ffb0d0507 Ensure that the pmix server system-level rendezvous file is only output by the HNP as (at least for slurm on cray) a daemon could be colocated with the HNP and overwrite the file. Update the scaling.pl script to only use the system-level rendezvous so it doesn't get rejected by a colocated daemon
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-10-14 10:16:49 -07:00
Ralph Castain
d5ce3c38e1 Begin cleaning up debugger support
Debugger daemons do not count against available slots. Clean up some leftover errors from the upgrade to HWLOC 2 in the mappers. Properly flag debugger jobs that come in via PMIx.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-27 16:18:43 -05:00
Ralph Castain
fe9b584c05 Fully support OMPI spawn options. Fix a bug in the round-robin mappers where we weren't adding nodes to the job map node array, and so resources were not released
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 285d8cfef74ffc899e9c51e1d9c597b7fb2ceb89)
2017-09-21 10:29:27 -07:00
Ralph Castain
5708872112 Implement support for "local" range when publishing data
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
(cherry picked from commit 2d54f7e0dd3a47260b0b2634aae3361316005933)
2017-09-18 19:34:08 -07:00
Ralph Castain
3c914a7a97 Complete the fix of the ORTE DVM. We will now use "prun" instead of "orterun -hnp foo" to execute jobs. This provides the feature of automatic discovery of the orte-dvm so you don't need to manually enter URI's or contact file locations. All IO is forwarded to prun.
Still in the "needs to be done" category:

* mapping/ranking/binding options aren't correctly supported

* if the DVM encounters some errors (e.g., not enough resources for the job), the resulting error is globally set and impacts any subsequent job submission

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-16 13:13:07 -07:00
Ralph Castain
7c7d8a69a0 Backport changes from PMIx reference server
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-14 11:48:56 -07:00
Ralph Castain
3477079804 Repair the ORTE DVM
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-09-11 17:38:21 -07:00
Joshua Hursey
420ca65f4f orte/pmix: Always seed environment with global rank
* Even if we are only launching one app context, we might call spawn
   later and the remote groups might want their global rank information.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
2017-09-08 08:53:49 -05:00
Ralph Castain
d80b0c7990 If the HWLOC shared memory system is unable to connect, then fallback to providing the topology via XML. Do not automatically provide the XML to every process as that defeats the purpose of the shared memory system. Instead, use PMIx_Query_info_nb to get the info from the server when required.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-22 18:12:26 -07:00
Ralph Castain
d85239e052 Cleanup some issues in connect/accept support across jobs started by different mpirun commands. Still not fully operational, but someone else will have to finish debugging it
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-17 11:58:48 -07:00
Ralph Castain
65fb6070d9 Update tool support by adding MCA params to direct orted's to drop
session and/or system-level tool rendezous files. Ensure PMIx is
enabled for tools

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-15 17:49:47 -07:00
Ralph Castain
edccb258cb Provide the mapping, ranking, binding patterns
Apps might want to make use of the relative patterns used to place/assign their procs

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-08-09 11:34:43 -07:00
Ralph Castain
af85e48dd7 Silence Coverity warning, silence pmix_error_log of success
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-21 15:33:16 -07:00
Ralph Castain
b225366012 Bring the ofi/rml component online by completing the wireup protocol for the daemons. Cleanup the current confusion over how connection info gets created and
passed to make it all flow thru the opal/pmix "put/get" operations. Update the PMIx code to latest master to pickup some required behaviors.

Remove the no-longer-required get_contact_info and set_contact_info from the RML layer.

Add an MCA param to allow the ofi/rml component to route messages if desired. This is mainly for experimentation at this point as we aren't sure if routing wi
ll be beneficial at large scales. Leave it "off" by default.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-07-20 21:01:57 -07:00
Artem Polyakov
79c10c884d orte/pmix/server: Fix direct modex response with error status
`send_error()` is only packing status and peer info in the reply.
While remote counterpart in `pmix_server_dmdx_resp()` expects
the "hotel room number" to proceed correctly.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2017-07-20 23:50:57 +07:00
Gilles Gouaillardet
60aa9cfcb6 hwloc: add support for hwloc v2 API
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-07-20 17:39:44 +09:00
Ralph Castain
f4411c4393 Enable use of OFI fabrics for launch and other collective operations. Update the PMIx repo to the latest master to get the required support for the server to "push" modex info, and to retrieve all its own "modex" values for sending back to mpirun. Have mpirun cache them in its local modex hash as OFI goes point-to-point direct and doesn't route - so the remote daemons don't need a copy of this connection info.
Remove the opal_ignore from the RML/OFI component, but disable that component unless the user specifically requests it via the "rml_ofi_desired=1" MCA param. This will let us test compile in various environments without interfering with operations while we continue to debug

Fix an error when computing the number of infos during server init

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-23 19:57:21 -07:00
Ralph Castain
952726c121 Update to latest PMIx master - equivalent to 2.0rc2. Update the thread support in the opal/pmix framework to protect the framework-level structures.
This now passes the loop test, and so we believe it resolves the random hangs in finalize.

Changes in PMIx master that are included here:

* Fixed a bug in the PMIx_Get logic
* Fixed self-notification procedure
* Made pmix_output functions thread safe
* Fixed a number of thread safety issues
* Updated configury to use 'uname -n' when hostname is unavailable

Work on cleaning up the event handler thread safety problem
Rarely used functions, but protect them anyway
Fix the last part of the intercomm problem
Ensure we don't cover any PMIx calls with the framework-level lock.
Protect against NULL argv comm_spawn

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-20 09:02:15 -07:00
Ralph Castain
93cf3c7203 Update OPAL and ORTE for thread safety
(I swear, if I look this over one more time, I'll puke)

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-06-06 12:30:57 -07:00
Ralph Castain
9a8811a246 Ensure that data from a job that was stored in ompi-server is purged once that job completes. Cleanup a few typos. Silence a Coverity warning
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-30 09:43:01 -07:00
Ralph Castain
9f60cd0fe7 Update the connect/accept support so we check to see if we have the proper infrastructure and RTE support, including whether we have ompi-server available if the connect/accept spans multiple applications. Print pretty help messages in all cases where we do not have support
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-27 10:47:08 -07:00
Ralph Castain
8c2a06477c Fix ompi-server operations
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-26 08:57:55 -07:00
Ralph Castain
657e701c65 Add debug verbosity to the orte data server and pmix pub/lookup functions
Start updating the various mappers to the new procedure. Remove the stale lama component as it is now very out-of-date. Bring round_robin and PPR online, and modify the mindist component (but cannot test/debug it).

Remove unneeded test

Fix memory corruption by re-initializing variable to NULL in loop

Resolve the race condition identified by @ggouaillardet by resetting the
mapped flag within the same event where it was set. There is no need to
retain the flag beyond that point as it isn't used again.

Add a new job attribute ORTE_JOB_FULLY_DESCRIBED to indicate that all the job information (including locations and binding) is included in the launch message. Thus, the backend daemons do not need to do any map computation for the job. Use this for the seq, rankfile, and mindist mappers until someone decides to update them.

Note that this will maintain functionality, but means that users of those three mappers will see large launch messages and less performant scaling than those using the other mappers.

Have the mindist module add procs to the job's proc array as it is a fully described module

Protect the hnp-not-in-allocation case

Per path suggested by Gilles - protect the HNP node when it gets added in the absence of any other allocation or hostfile

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-05-25 18:41:27 -07:00
Artem Polyakov
4af7a0827f orte/pmix: Do not set orted exit status to one from proc abort
The fact that application proc called Abort (read failed) doesn't
mean that ORTE subsystem has failed - vice versa it does it's work
to gracefuly exit the whole application.

orted exiting with non-zero status creates a problem for at least
plm/slurm environments where orteds are launched via `srun` with
"--kill-on-bad-exit" flag. If one of orteds has exited with non-
zero status slurm will immediately kill all other orteds. As the
result we see a lot of leftover in the `/tmp` directory.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2017-04-13 01:37:36 +07:00
Ralph Castain
db8943cedd Provide further (hopefully) helpful messages about the hotel size
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-05 04:27:32 -07:00
Ralph Castain
b7e9711f45 Resolve the direct modex race condition. The request hotel was running out of rooms, thereby returning an error upon checkin - and we had missed error_logging a couple of those places. Hence no error message and things just hung.
Output a (hopefully) helpful message when we timeout an operation

Thanks to Nathan for tracking it down.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-04 21:32:44 -07:00
Ralph Castain
734b90aa6b Adjust the timeout for direct modex requests to reflect the size of the job. It can take several seconds to start all the procs, and we don't want to timeout due to differences in start times of the various procs
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-04-04 18:20:51 -07:00
Ralph Castain
d645557fa0 Update to include the PMIx 2.0 APIs for monitoring and job control. Include required integration, but leave the monitors off for now. Move the sensor framework out of ORTE as it is being absorbed into PMIx
Fix typo and silence warnings

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-21 17:47:08 -07:00
Ralph Castain
c6bc3ccb76 Sync to latest PMIx master and PMIx reference server
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-03-11 12:50:38 -08:00
Jeff Squyres
fec519a793 hwloc: rename opal/mca/hwloc/hwloc.h -> hwloc-internal.h
Per a prior commit, the presence of "hwloc.h" can cause ambiguity when
using --with-hwloc=external (i.e., whether to include
opal/mca/hwloc/hwloc.h or whether to include the system-installed
hwloc.h).

This commit:

1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h.
2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to
   find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc.
3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the
   rest of the code base.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2017-02-28 07:48:42 -08:00
Ralph Castain
68b53e2179 Fix comm_spawn by registering nspace info only when needed - either when we have local procs, or when job-level info is required by connecting jobs
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-02-14 19:47:56 -08:00
Ralph Castain
060cc09474 Revert "orte: Fix MPI_Spawn"
This reverts commit 9f7e2098ac639bc42ec0e02150d1d5d488b160c7.
2017-02-14 13:32:28 -08:00
Artem Polyakov
9f7e2098ac orte: Fix MPI_Spawn
Register namespace even if there is no node-local processes that
belongs to it. We need this for the MPI_Spawn case.

Addressing https://github.com/open-mpi/ompi/issues/2920.
Was introduced in be3ef777392347aa4560fb4eaa13075d2e77ed6e.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2017-02-04 12:07:00 +07:00
Ralph Castain
b59ae14a2a Fix static port and partial allocation operations
Fix static port wireup by recording the TCP port mpirun is using and correctly passing the regex of hosts to the daemons. Do a better job of closing sockets on failed connection attempts. Correctly identify the remote host in the associated error message.

Fix partial allocation operations by not attempting to set #slots on nodes that were not used, and thus don't have a daemon or topology assigned to them

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-28 10:09:44 -08:00
Ralph Castain
be3ef77739 Improve packing efficiency by raising the initial buffer size and modifying the extension code. Flag if a job map has had its nodes added so we don't have to loop repeatedly to check it.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-21 14:03:19 -08:00
Gilles Gouaillardet
44c1ff60f1 Merge pull request #2672 from ggouaillardet/topic/misc_memory_leaks
Plug misc memory leaks
2017-01-10 13:16:04 +09:00
Ralph Castain
e25e69dc2f Resolve Coverity issues
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-07 10:45:52 -08:00
Gilles Gouaillardet
6b90b03c28 orted/pmix: plug a memoy leak in pmix_server_fencenb_fn()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:45 +09:00
Ralph Castain
6509f60929 Complete the memprobe support. This provides a new scaling tool called "mpi_memprobe" that samples the memory footprint of the local daemon and the client procs, and then reports the results. The output contains the footprint of the daemon on each node, plus the average footprint of the client procs on that node.
Samples are taken after MPI_Init, and then again after MPI_Barrier. This allows the user to see memory consumption caused by add_procs, as well as any modex contribution from forming connections if pmix_base_async_modex is given.

Using the probe simply involves executing it via mpirun, with however many copies you want per node. Example:

$ mpirun -npernode 2 ./mpi_memprobe
Sampling memory usage after MPI_Init
Data for node rhc001
	Daemon: 12.483398
	Client: 6.514648

Data for node rhc002
	Daemon: 11.865234
	Client: 4.643555

Sampling memory usage after MPI_Barrier
Data for node rhc001
	Daemon: 12.520508
	Client: 6.576660

Data for node rhc002
	Daemon: 11.879883
	Client: 4.703125

Note that the client value on node rhc001 is larger - this is where rank=0 is housed, and apparently it gets a larger footprint for some reason.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-05 10:32:17 -08:00
Ralph Castain
91d714fe93 Add flags to direct PMIx to only use one listener, but without directing which one (tcp or usock) to use. This allows the user to set PMIX_MCA_ptl in their environment to select the transport method.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-04 09:16:44 -08:00
Ralph Castain
9eab9a1ed3 Remove stale global variables
Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers.

Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation).

Begin first cut at memory profiler

Some minor cleanups of memprobe

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-02 14:04:24 -08:00
Ralph Castain
fe68f23099 Only instantiate the HWLOC topology in an MPI process if it actually will be used.
There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced:

* shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string
  instead of the topology itself, if available, thus avoiding instantiating the topology

* openib BTL. This uses the distance matrix. At present, I haven't developed a method
  for replacing that reference. Thus, this component will instantiate the topology

* usnic BTL. Uses the distance matrix.

* treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate
  the topology

* ess base functions. If a process is direct launched and not bound at launch, this
  code attempts to bind it. Thus, procs in this scenario will instantiate the
  topology

Note that instantiating the topology on complex chips such as KNL can consume
megabytes of memory.

Fix pernode binding policy

Properly handle the unbound case

Correct pointer usage

Do not free static error messages!

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-29 10:33:29 -08:00
Ralph Castain
52533f755e Remove debug
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-28 13:24:39 -08:00
Ralph Castain
3a2d6a5ab6 Begin to reduce reliance of application procs on the topology tree itself by having the daemon provide more detailed info. In this case, provide the topology description string so that procs can readily determine the number of types of objects on the node, and a "locality" string that describes which objects this process is executing upon. The latter allows a process to compute the objects of overlap between itself and another proc without consulting the topology tree.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-28 09:14:26 -08:00
Ralph Castain
ea133206ec Sync the internal OMPI component to PMIx master
Update external PMIx v2.x component
Add missing Makefile

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-19 19:14:16 -08:00
Ralph Castain
c6f6f40529 Transfer debugger support changes
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-17 18:14:46 -08:00
Ralph Castain
269753f5c1 Transfer back changes from debugger attach work
Silence warning

Remove debug

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-17 10:00:52 -08:00
Ralph Castain
884fb7fcf2 Update the PMIx2 support to include the latest shared memory optimizations
Update ORTE support for dynamic PMIx operations e.g., PMIx_Spawn
Update to track master
Ensure that --disable-pmix-dstore actually disables the dstore. Sync to a few debugger updates

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-14 15:00:10 -08:00
rhc54
341ab683de Merge pull request #2532 from rhc54/topic/pmixptl
Update to latest PMIx master + PTL branch
2016-12-07 17:28:22 -08:00