onto the backend daemons. By default, let mpirun only pack the app_context
info and send that to the backend daemons where the mapping will
be done. This significantly reduces the computational time on mpirun as it isn't
running up/down the topology tree computing thousands of binding
locations, and it reduces the launch message to a very small number of
bytes.
When running -novm, fall back to the old way of doing things
where mpirun computes the entire map and binding, and then sends
the full info to the backend daemon.
Add a new cmd line option/mca param --fwd-mpirun-port that allows
mpirun to dynamically select a port, but then passes that back to
all the other daemons so they will use that port as a static port
for their own wireup. In this mode, we no longer "phone home" directly
to mpirun, but instead use the static port to wireup at daemon
start. We then use the routing tree to rollup the initial
launch report, and limit the number of open sockets on mpirun's node.
Update ras simulator to track the new nidmap code
Cleanup some bugs in the nidmap regex code, and enhance the error message for not enough slots to include the host on which the problem is found.
Update gadget platform file
Initialize the range count when starting a new range
Fix the no-np case in managed allocation
Ensure DVM node usage gets cleaned up after each job
Update scaling.pl script to use --fwd-mpirun-port. Pre-connect the daemon to its parent during launch while we are otherwise waiting for the daemon's children to send their "phone home" rollup messages
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
The README on master had grown very, very stale. This commit copies
the README from the tip of the v2.x branch (from
https://github.com/open-mpi/ompi/pull/3119) and preserves a few minor
differences between master and the v2.x branch.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
[skip ci]
bot:notest
this fixes the issue reported by Nicolas Joly on the mailing: the sharedfp/lockedfile component does not support right now a scenario where multiple jobs read from the same input file, due to a collision of the filenames utilized for the sharedfp handle. Although not part of the oroginal report, the same occurs for the sharedfp/sm component. Add therefore the jobid to be part of the lockedfilename/sm file name.
use the OMPI_CAST_RTE_NAME macro to determine jobid
Fixes: #3098
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
hwloc v1.5 does not support HWLOC_OBJ_OSDEV_COPROC
nor hwloc_topology_dup(), so for this version :
- do not search for coprocessors
- do not try hwloc_topology_dup(), note this is not
used anywhere in the code base
Thanks Jeff for helping with the wording
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
this commit brings over the behavior from the 2.x series to master, mostly with the fork for the 3.x series in mind.
Also, use strncasecmp instead of two strncmps
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
- Support MPI-2.2 and MPI-3.0 COLL features.
* `MPI_REDUCE_SCATTER_BLOCK`
* neighborhood collective communication
* nonblocking collective communication
- Add `*_BASE_ARGS` and `*_BASE_ARG_NAMES` for convenience.
- Use parameter names used in the MPI Standard.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
This is mostly for error cases, where we need to release the
newly created proc. Currently the code deadlocks because the endpoint
lock is help at the release and the lock is not recursive.
Aslo added some code to print the IP addresses that don't match during
the TCP connection step.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Memory hooks are now set-up on demand. pml/yalla, mtl/mxm and
coll/hcoll need the memory hooks, so make sure those are installed.
Signed-off-by: Yossi Itigin <yosefe@mellanox.com>
Per a prior commit, the presence of "hwloc.h" can cause ambiguity when
using --with-hwloc=external (i.e., whether to include
opal/mca/hwloc/hwloc.h or whether to include the system-installed
hwloc.h).
This commit:
1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h.
2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to
find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc.
3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the
rest of the code base.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Frameworks are usually required to have a framework/framework.h file.
However, this is sometimes problematic (see the hwloc use case/problem
description, below).
This commit allows frameworks to have an "autogen.options" file (i.e.,
project/mca/framework/autogen.options) that specifies things that
autogen needs to know about the framework. Currently, the only option
recognized in autogen.options is "framework_header", which allows a
framework to specify that its header file is named something other
than "framework.h" (the framework header file must still be in the
project/mca/framework directory; it simply may be named something
other than framework.h). More options may be introduced over time.
The use case that motivated this is the hwloc framework
(https://github.com/open-mpi/ompi/issues/2616).
Per MCA framework rules, the hwloc framework is required to have an
opal/mca/hwloc/hwloc.h file. However, the hwloc library itself *also*
has an hwloc.h file. This causes a problem when configuring Open MPI
with --with-hwloc=external (meaning: do not use the hwloc embedded
within the Open MPI source code tree -- instead, use an hwloc
installation from outside the Open MPI source code tree).
Specifically, when in the opal/mca/hwloc directory, the presence of
"-I." in DEFAULT_INCLUDES (put there by Automake) causes a confusion
between the hwloc.h in opal/mca/hwloc/hwloc.h and the system-installed
hwloc.h. Chaos ensues (see the GitHub issue for more detail).
The solution is to rename the opal/mca/hwloc/hwloc.h to something else
(e.g., hwloc-internal.h), and extend autogen.pl to allow frameworks to
have an alternate name for their framework header file.
This commit introduces the autogen.pl mechanism to allow the alternate
header file name. A follow-on commit will effect this change in the
hwloc framework (and update all the places in the code base to use the
new filename).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>