Now that the daemon calls remote_spawn itself, there is no longer
a need for the "tree_spawn" command nor the associated command
processing code since the HNP is no longer sending a tree-spawn
message to the orted.
Thanks Ralph for the guidance !
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
When the node regex is too long to be sent on the command line,
retrieve it first from the parent, and then spawn the remote orted
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This parameter can be used to set the node regex max length that can
be passed to the orted command line.
For testing purpose, it can be set to zero in order to force the node regex
being retrieved by orted from its parent.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
since open-mpi/ompi@8f496b01b7
sstore_stage_local_compress_waitpid_cb is invoked with an orte_wait_tracker_t *,
that must be used to reach the orte_sstore_stage_local_app_snapshot_info_t *.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
since open-mpi/ompi@8f496b01b7
rsh_wait_daemon is invoked with an orte_wait_tracker_t *,
that must be used to reach the orte_plm_rsh_caddy_t *.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
set grp_local_rank as MPI_UNDEFINED before invoking
ompi_comm_nexcid() in order to benefit from the optimizations
introduced in open-mpi/ompi@68167ec879
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This change makes comparison of `mpi-f08-interfaces.F90` and
`pmpi-f08-interfaces.F90` easier.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
They were incorrectly changed to subroutines in only `pmpi`
in 258d1aa160.
Strictly speaking, this change involves binary incompatibility.
But nobody used these subroutines and nobody will be affected because
these subroutines were useless (didn't return a calculated value).
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
This commit fixes an issue when a registration is created for a large
region and then invalidated while part of it is in use.
References #4509
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This fixes a regression in sockets provider which could return -EINTR value
from fi_cq_read() due to a syscall being interrupted. The error value is
currently interpreted as fatal condition. Relax the rule so that we can retry
fi_cq_read() operation.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@intel.com>
As documented in #4563 and #3697, there is an issue on ARM and
POWER platforms when the atomic fifo assembly isn't inlined,
which manifests as a hang. Document the issue and the
work-around until a proper fix is committed.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
In both cases we were comparing with the wrong size, it should be either
the number of local processes or the number of nodes, and not the size
of the communicator.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
If we detect that someone has given us an incorrect node name, provide a helpful message telling them as it is almost certainly a typo.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
This commit moves the backing files to /dev/shm to avoid limitations
that may be set on /tmp. The files are registered with pmix to ensure
they are cleaned up after an erroneous exit.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 48101278160672317ade352365592f56ef3b8977)
If available, have apps use registration capability to cleanup their session directories. Setup capability for vader to register its shared memory file location - let someone familiar with that code do so.
Final cleanup to track uid/gid, update the opal/pmix API to pass flags for ignore and leave top directory alone
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Somehow, the code for passing a daemon's parent was accidentally removed, thus breaking the tree-spawn callback sequence and causing all daemons to phone directly home. Note that this is noticeably slower than no-tree-spawn for small clusters where directly ssh launch of the child daemons from the HNP doesn't overload the available file descriptors.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
'-' is not an alpha character nor a digit, but it is a valid hostname
character and should be handled as an alpha character, otherwise, nodes
such as node-001 do not get "compressed" in the regex.
Refs open-mpi/ompi#4621
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Pull in changes from the v2.0x, v2.x, and v3.0.x release branches
so that master includes all items from released releases.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>