onto the backend daemons. By default, let mpirun only pack the app_context
info and send that to the backend daemons where the mapping will
be done. This significantly reduces the computational time on mpirun as it isn't
running up/down the topology tree computing thousands of binding
locations, and it reduces the launch message to a very small number of
bytes.
When running -novm, fall back to the old way of doing things
where mpirun computes the entire map and binding, and then sends
the full info to the backend daemon.
Add a new cmd line option/mca param --fwd-mpirun-port that allows
mpirun to dynamically select a port, but then passes that back to
all the other daemons so they will use that port as a static port
for their own wireup. In this mode, we no longer "phone home" directly
to mpirun, but instead use the static port to wireup at daemon
start. We then use the routing tree to rollup the initial
launch report, and limit the number of open sockets on mpirun's node.
Update ras simulator to track the new nidmap code
Cleanup some bugs in the nidmap regex code, and enhance the error message for not enough slots to include the host on which the problem is found.
Update gadget platform file
Initialize the range count when starting a new range
Fix the no-np case in managed allocation
Ensure DVM node usage gets cleaned up after each job
Update scaling.pl script to use --fwd-mpirun-port. Pre-connect the daemon to its parent during launch while we are otherwise waiting for the daemon's children to send their "phone home" rollup messages
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
This commit fixes an error in teardown where the event bases are town
down before the peer structures are released. This causes us to call
event_del on an invalid event base. At best this makes valgrind
complain and at worst this causes aborts or segvs.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Fix static port wireup by recording the TCP port mpirun is using and correctly passing the regex of hosts to the daemons. Do a better job of closing sockets on failed connection attempts. Correctly identify the remote host in the associated error message.
Fix partial allocation operations by not attempting to set #slots on nodes that were not used, and thus don't have a daemon or topology assigned to them
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Repair rsh/ssh tree spawn by unpacking and updating the nidmap in remote_spawn.
Add more specific error messages so the cause of a messaging problem is a little clearer. Remove some stale code. Ensure we stop trying to send a message after a few times.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Silence a warning in orted_submit
Protect against a free'd value in an error path when forming oob tcp connections
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
The problem was observed for direct modex used with recursive doubling
algorithm (used for collective ID calculation prior to d52a2d081e9598a9ac9a50fb4b013a6d2a72375b)
that has pairwise nature and counter-connections are highly likely.
The following scenario was uncovering the issue:
* ranks `x` and `y` want to communicate with each other, `x` < `y`;
* rank `x` initiates the connection and sends the ack;
* rank `y` starts to `connect()` and gets the ack from `x`;
* `y` identifies that it already started connecting and `y` > `x` so it rejects incoming connection.
* `x` sees that his connection was rejected in `mca_oob_tcp_peer_recv_connect_ack()` when trying to
read the message header using `tcp_peer_recv_blocking()` which calls `mca_oob_tcp_peer_close()`
that effectively flushes all the messages in the peer->send_queue.
* `y` send the ack to `x` and the connection is established, however all the messages for the peer
at `x` are vanished (except the front one in peer->send_msg).
This commit introduces a "nack" function that will be used at `y` side to tell `x` that `y` has the
priority and `x`'s connection should be closed. This allows to avoid "guessing" on the unexpectedly
closed connection.
Signed-off-by: Artem Polyakov <artpol84@gmail.com>
there is no need for a configure option as well - so remove the
--enable-orte-static-ports configure option. When decoding the daemon
nidmap, mark new daemons as ALIVE by default - we will discover dead
ones as we go.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
display the hop node used to send a message
(if the message is sent directly, then the hop is the destination)
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Still not completely done as we need a better way of tracking the routed module being used down in the OOB - e.g., when a peer drops connection, we want to remove that route from all conduits that (a) use the OOB and (b) are routed, but we don't want to remove it from an OFI conduit.
Multiple conduits can exist at the same time, and can even point to the same base transport. Each conduit can have its own characteristics (e.g., flow control) based on the info keys provided to the "open_conduit" call. For ease during the transition period, the "legacy" RML interfaces remain as wrappers over the new conduit-based APIs using a default conduit opened during orte_init - this default conduit is tied to the OOB framework so that current behaviors are preserved. Once the transition has been completed, a one-time cleanup will be done to update all RML calls to the new APIs and the "legacy" interfaces will be deleted.
While we are at it: Remove oob/usock component to eliminate the TMPDIR length problem - get all working, including oob_stress
* qos framework is moving to the scon layer and is no longer required in ORTE
* remove the rml/ftrm component as we now have multiple active components, and so the wrapper needs to be rethought
* no need for separating the "base" from "API" module definition. The two are identical
* move the "stub" functions into their own file for cleanliness
* general cleanup to meet coding standards
* cleanup some logic in the stubs
Bring Slurm PMI-1 component online
Bring the s2 component online
Little cleanup - let the various PMIx modules set the process name during init, and then just raise it up to the ORTE level. Required as the different PMI environments all pass the jobid in different ways.
Bring the OMPI pubsub/pmi component online
Get comm_spawn working again
Ensure we always provide a cpuset, even if it is NULL
pmix/cray: adjust cray pmix component for pmix
Make changes so cray pmix can work within the integrated
ompi/pmix framework.
Bring singletons back online. Implement the comm_spawn operation using pmix - not tested yet
Cleanup comm_spawn - procs now starting, error in connect_accept
Complete integration
Have only a single level of "if" conditionals. Also, slightly change
the logic such that we only die/break out of the loop if we get EMFILE
-- all other errors are ok to go on to the next fd.
Finally, use a real show_help() message to warn when other errors occur.