1
1

5497 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
368684bd63 Revert e9bc293 and try a different approach for scalably dealing with hetero clusters. Have each orted send back its topo "signature". If mpirun detects that this signature has not been seen before, then ask for that daemon to send back its full topology description. This allows the system to only get the topology once for each unique topo in the cluster.
Cleanup a typo, and remove no longer needed MCA params for hetero nodes and hetero apps. Hetero nodes will always be automatically detected. We don't support a mix of 32 and 64 bit apps

Modify the orte_node_t to use orte_topology_t instead of hwloc_topology_t, updating all the places that use it. Ensure that we properly update topology when we see a different one on a compute node.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-18 10:22:15 -08:00
Ralph Castain
c8768e3dab Merge pull request #2740 from rhc54/topic/hnp
Add an MCA param "hnp_on_smgmt_node"
2017-01-17 05:49:51 -08:00
Ralph Castain
e9bc2934be Add an MCA param "hnp_on_smgmt_node" that mpirun can use to tell the orteds to ignore its topology signature as mpirun is executing on a system mgmt node, and hence a different topology than the compute nodes
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-16 19:32:01 -08:00
Ralph Castain
2fb9e7cc2b Add some missing qualifiers to the wrapper compilers for -lopen-rte and -lopen-pal
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-16 19:24:16 -08:00
Ralph Castain
74a285be83 Cancel the waitpid callback once the waitpid on a process has fired to avoid multiple notifications
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-16 14:32:02 -08:00
Ralph Castain
9e8c7d6295 Silence Coverity warning
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-15 07:51:37 -08:00
Ralph Castain
6b34cc67d6 Correct typo
Fixes #2691

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-15 07:48:31 -08:00
Ralph Castain
3a157f0496 One more time - we "push" IOF for stdout, stderr, and stddiag with separate calls. However, we were creating the sinks for all three of them each time, which caused them to leak. Create the sinks only once for each channel.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-14 17:40:36 -08:00
Ralph Castain
d9fc88c2c7 Merge pull request #2733 from rhc54/topic/dvm
Cleanup DVM leaks
2017-01-12 21:29:54 -08:00
Ralph Castain
5c87fc10dc Update the mpirun man page to accurately reflect the use of -host
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-12 20:28:53 -08:00
Ralph Castain
b55c03255a Strange - I had created a new IOF API "complete" for cleaning up at the end of jobs, but somehow the implementation is missing. It also appears that the orted's never actually cleaned up their job-related information. These things are fine for normal mpirun-based operations, but cause significant resource leaks for the DVM.
Complete the implementation and seal the leaks

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-12 19:54:18 -08:00
Ralph Castain
0e2df3be3e Missed one spot - plug fd leaks in orteds
Fixes #2691

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-12 13:45:46 -08:00
Ralph Castain
9ad02b5d13 Merge pull request #2718 from rhc54/topic/leaks
Don't remove the IOF framework's tracking info for a proc until the state machine tells it to do so.
2017-01-12 09:57:17 -08:00
Nathan Hjelm
110840fc87 ess/hnp: add support for forwarding additional signals (#2712)
* ess/hnp: add support for forwarding additional signals

This commit adds support to the hnp ess module to forward additional
signals beyond the default SIGUSR1, SIGUSR2, SIGSTP, and SIGCONT.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

* Generalize this a bit to allow a broader range of signals to be forwarded. Turns out that SIGURG is now a "standard" signal, though the value differs across systems. So setup to forward it (and some friends) if they are defined. Allow users to provide the signal name (instead of the integer value) as the value of even the more common signals does vary across systems. Don't limit the number that can be supported.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>

* ess/hnp: fix some bugs in the signal forwarding code

This commit fixes two bugs:

 - signals_set needs to be set even if no signals are being
   forwarded. If it is not set we will SEGV in libevent if
   ess_hnp_forward_signals == none.

 - SIGTERM and SIGHUP are handled with a different type of handler. Do
   not allow the user to specify these to be forwarded.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

* We are sure to get "dinged" if error messages aren't nicely output via show_help, so do so here

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-12 10:09:41 -07:00
Ralph Castain
fa419d3c0d Don't remove the IOF framework's tracking info for a proc until the
state machine tells it to do so. This plugs leaked file descriptors as
we were losing track prior to destructing the resources.

Fixes #2691

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-12 08:34:29 -08:00
Ralph Castain
aff3a00059 Protect default mapping/binding options for cases where no NUMA or
SOCKET objects exist - like VMs

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-11 09:44:44 -08:00
Ralph Castain
93e4935902 Be a tad more cautious before releasing objects when running in DVM mode
Fixes #2700

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-10 14:04:27 -08:00
Gilles Gouaillardet
44c1ff60f1 Merge pull request #2672 from ggouaillardet/topic/misc_memory_leaks
Plug misc memory leaks
2017-01-10 13:16:04 +09:00
Joshua Ladd
3e23380bba Merge pull request #2675 from artpol84/orte/state/exit_1_fix
orte/odls: Fix ORTE state machine for the non-zero exit case
2017-01-09 12:32:37 -05:00
Ralph Castain
e25e69dc2f Resolve Coverity issues
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-07 10:45:52 -08:00
Joshua Ladd
7fc9f9bbac Merge pull request #2620 from karasevb/fix_rmaps_mindist
rmaps/mindist: fix pmix errors
2017-01-06 17:26:48 -05:00
Ralph Castain
684e69695f Minor cleanups to eliminate warnings
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-06 08:44:10 -08:00
Artem Polyakov
3eb6c98542 orte/odls: Fix ORTE state machine for the non-zero exit case
This commit fixes rare race condition that occurs when the process
that is calling `exit(-1)` has delay between fd cleanup and actual
OS-level exit. This may happen if the process has some work to do
`on_exit()`.

**Problem description**:
Consider an application process that has called `exit(nonzero)`, it's
fd's was closed
but it's actual termination at OS level is delayed by some cleanups (eg.
in callbacks registered via `on_exit()`).
Observed sequence of events was the following:

* orted gets stdio disconnection and activating `IOF COMPLETE` state.
* parallel OOB disconnection causes `COMMUNICATION FAILURE` state to be
activated.
* during `COMMUNICATION FAILURE` processing `odls_base_default_wait_local_proc`
is called even though real waitpid wasn't yet called (code mentions that
waitpid might not be called for unspecified reason). Because of that real exit
code is unknown and set to 0. `odls_base_default_wait_local_proc` callback sees
`IOF COMPLETE` flag and in conjunction with 0-exit-code it activates
`WAITPID FIRED` state.
* processing of `WAITPID FIRED` leads to `NORMALLY TERMINATED` to be
activated.
* `NORMALLY TERMINATED` state in particular leads `ORTE_PROC_FLAG_ALIVE` flag
for this proc to be dropped.
* when application process finally exits and `wait_signal_callback` is
launched. It sets real exit code and calls `odls_base_default_wait_local_proc`
again but at this time since the process has `ORTE_PROC_FLAG_ALIVE` flag
dropped `WAITPID FIRED` state is activated (instead of `EXITED WITH NON-ZERO`)
leading to a hang that was observed.

Signed-off-by: Artem Polyakov <artpol84@gmail.com>
2017-01-06 11:12:55 +02:00
Gilles Gouaillardet
a1a0e324b3 util/hostfile: plug a memory leak
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:45 +09:00
Gilles Gouaillardet
6b9343a966 plm/rsh: plug a memory leak
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:45 +09:00
Gilles Gouaillardet
8ba92d7516 iof/base: plug a memory leak in orte_iof_base_close()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:45 +09:00
Gilles Gouaillardet
e396b17a7f orte/orted: plug a memory leak
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:45 +09:00
Gilles Gouaillardet
6b90b03c28 orted/pmix: plug a memoy leak in pmix_server_fencenb_fn()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:45 +09:00
Gilles Gouaillardet
7fe6840232 state/hnp: plug a memory leak
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:45 +09:00
Gilles Gouaillardet
4d58b8dcae ess/pmi: plug a memory leak
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:45 +09:00
Gilles Gouaillardet
c0c5dd8ccc orte: plug a memory leak in orte_rml.recv_cancel
do not invoke orte_rml.recv_cancel after the orte progress thread has gone

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:44 +09:00
Gilles Gouaillardet
17fac4bfd1 grpcomm/base: get rid of the seq_num field of the orte_grpcomm_signature_t struct
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:44 +09:00
Gilles Gouaillardet
fe25f50871 grpcomm/base: plug a memory leak on finalize
manually allocate sequence numbers to be stored into the
orte_grpcomm_base.sig_table hash table, and manually release
them on orte_grpcomm_base_close()

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:44 +09:00
Gilles Gouaillardet
a988ad24eb orte/runtime: plug a leak in orte_finalize()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 15:38:44 +09:00
Gilles Gouaillardet
0ee5d56ab1 grpcomm/direct: plug a memory leak in barrier_release()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 13:46:35 +09:00
Gilles Gouaillardet
f2d6584189 grpcomm/base: plug misc memory leaks
- add a destructor to orte_grpcomm_caddy_t in order to plug a memory leak

- plug a memory leak in barrier_release()

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 13:46:21 +09:00
Gilles Gouaillardet
c4a47ae9a9 orte/orted: plug misc memory leaks
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 11:35:59 +09:00
Gilles Gouaillardet
88535b6200 orte/util: revamp orte_attr_unload() to keep valgrind happy
reorder tests to avoid valgrind complaining about uninitialized variables

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 11:35:59 +09:00
Gilles Gouaillardet
58f2a764f9 ess/hnp: plug memory leaks
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 11:35:59 +09:00
Gilles Gouaillardet
24c61b0625 oob/tcp: plug a memory leak in mca_oob_tcp_component_lost_connection()
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 11:35:59 +09:00
Gilles Gouaillardet
c7d9e62d47 rml/base: plug a memory leak
add a destructor to orte_rml_send_request_t in order
to plug a memory leak

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2017-01-06 11:35:59 +09:00
Ralph Castain
6509f60929 Complete the memprobe support. This provides a new scaling tool called "mpi_memprobe" that samples the memory footprint of the local daemon and the client procs, and then reports the results. The output contains the footprint of the daemon on each node, plus the average footprint of the client procs on that node.
Samples are taken after MPI_Init, and then again after MPI_Barrier. This allows the user to see memory consumption caused by add_procs, as well as any modex contribution from forming connections if pmix_base_async_modex is given.

Using the probe simply involves executing it via mpirun, with however many copies you want per node. Example:

$ mpirun -npernode 2 ./mpi_memprobe
Sampling memory usage after MPI_Init
Data for node rhc001
	Daemon: 12.483398
	Client: 6.514648

Data for node rhc002
	Daemon: 11.865234
	Client: 4.643555

Sampling memory usage after MPI_Barrier
Data for node rhc001
	Daemon: 12.520508
	Client: 6.576660

Data for node rhc002
	Daemon: 11.879883
	Client: 4.703125

Note that the client value on node rhc001 is larger - this is where rank=0 is housed, and apparently it gets a larger footprint for some reason.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-05 10:32:17 -08:00
Ralph Castain
91d714fe93 Add flags to direct PMIx to only use one listener, but without directing which one (tcp or usock) to use. This allows the user to set PMIX_MCA_ptl in their environment to select the transport method.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-04 09:16:44 -08:00
Ralph Castain
9eab9a1ed3 Remove stale global variables
Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers.

Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation).

Begin first cut at memory profiler

Some minor cleanups of memprobe

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2017-01-02 14:04:24 -08:00
Ralph Castain
e8aea2ebfc Minor cleanups
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-30 16:19:42 -08:00
Ralph Castain
08c76a42bb Update to latest PMIx master
Signed-off-by: Ralph Castain <rhc@open-mpi.org>

Plug a minor memory leak. Tell the PMIx server not to create a dstore memory region for the daemon job as there is nobody to share it with.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>

Protect users of hwloc membind functions

Signed-off-by: Ralph Castain <rhc@open-mpi.org>

Update PMIx to include NULL string protection

Signed-off-by: Ralph Castain <rhc@open-mpi.org>

Update to PMIx master to include key overwrite protection

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-30 12:44:47 -08:00
Ralph Castain
fe68f23099 Only instantiate the HWLOC topology in an MPI process if it actually will be used.
There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced:

* shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string
  instead of the topology itself, if available, thus avoiding instantiating the topology

* openib BTL. This uses the distance matrix. At present, I haven't developed a method
  for replacing that reference. Thus, this component will instantiate the topology

* usnic BTL. Uses the distance matrix.

* treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate
  the topology

* ess base functions. If a process is direct launched and not bound at launch, this
  code attempts to bind it. Thus, procs in this scenario will instantiate the
  topology

Note that instantiating the topology on complex chips such as KNL can consume
megabytes of memory.

Fix pernode binding policy

Properly handle the unbound case

Correct pointer usage

Do not free static error messages!

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-29 10:33:29 -08:00
Ralph Castain
52533f755e Remove debug
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-28 13:24:39 -08:00
Ralph Castain
3a2d6a5ab6 Begin to reduce reliance of application procs on the topology tree itself by having the daemon provide more detailed info. In this case, provide the topology description string so that procs can readily determine the number of types of objects on the node, and a "locality" string that describes which objects this process is executing upon. The latter allows a process to compute the objects of overlap between itself and another proc without consulting the topology tree.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-28 09:14:26 -08:00
Ralph Castain
7866bb1119 Add debug, cleanup cpus/rank
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
2016-12-27 21:25:52 -08:00