* ess/hnp: add support for forwarding additional signals
This commit adds support to the hnp ess module to forward additional
signals beyond the default SIGUSR1, SIGUSR2, SIGSTP, and SIGCONT.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* Generalize this a bit to allow a broader range of signals to be forwarded. Turns out that SIGURG is now a "standard" signal, though the value differs across systems. So setup to forward it (and some friends) if they are defined. Allow users to provide the signal name (instead of the integer value) as the value of even the more common signals does vary across systems. Don't limit the number that can be supported.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
* ess/hnp: fix some bugs in the signal forwarding code
This commit fixes two bugs:
- signals_set needs to be set even if no signals are being
forwarded. If it is not set we will SEGV in libevent if
ess_hnp_forward_signals == none.
- SIGTERM and SIGHUP are handled with a different type of handler. Do
not allow the user to specify these to be forwarded.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
* We are sure to get "dinged" if error messages aren't nicely output via show_help, so do so here
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Revamp the event notification integration to rely on the PMIx event chaining and remove the duplicate chaining in OPAL. This ensures we get system-level events that target non-default handlers.
Restore the hostname entries for MPI-level error messages, but provide an MCA param (orte_hostname_cutoff) to remove them for large clusters where the memory footprint is problematic. Set the default at 1000 nodes in the job (not the allocation).
Begin first cut at memory profiler
Some minor cleanups of memprobe
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced:
* shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string
instead of the topology itself, if available, thus avoiding instantiating the topology
* openib BTL. This uses the distance matrix. At present, I haven't developed a method
for replacing that reference. Thus, this component will instantiate the topology
* usnic BTL. Uses the distance matrix.
* treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate
the topology
* ess base functions. If a process is direct launched and not bound at launch, this
code attempts to bind it. Thus, procs in this scenario will instantiate the
topology
Note that instantiating the topology on complex chips such as KNL can consume
megabytes of memory.
Fix pernode binding policy
Properly handle the unbound case
Correct pointer usage
Do not free static error messages!
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
fix an issue that can be evidenced with two nodes
n0$ mpirun --host n1:1 --mca oob_tcp_static_ipv4_ports 1234 -np 1 --mca routed radix --mca oob tcp true
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
PR open-mpi/ompi#2432 introduced a regression where configure
and build with --disable-dlopn caused build failure owing
to unresolved alps lli symbols in the libopal-pal shared library.
This commit fixes this problem.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
there is no need for a configure option as well - so remove the
--enable-orte-static-ports configure option. When decoding the daemon
nidmap, mark new daemons as ALIVE by default - we will discover dead
ones as we go.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
Multiple conduits can exist at the same time, and can even point to the same base transport. Each conduit can have its own characteristics (e.g., flow control) based on the info keys provided to the "open_conduit" call. For ease during the transition period, the "legacy" RML interfaces remain as wrappers over the new conduit-based APIs using a default conduit opened during orte_init - this default conduit is tied to the OOB framework so that current behaviors are preserved. Once the transition has been completed, a one-time cleanup will be done to update all RML calls to the new APIs and the "legacy" interfaces will be deleted.
While we are at it: Remove oob/usock component to eliminate the TMPDIR length problem - get all working, including oob_stress
Split process name variable "name" to
- "wildcard_rank" for the cases where wildcard is used.
- "pname" for the case where reference to particular process is needed.
Note that this cannot be used for MPI performance testing. It is really only useful for ORTE scaling tests. It also only works with the rsh/ssh launcher.