If PMIX_PACKAGE_RANK is available, uses this value to select between multiple
NIC of equal distance between the current process. If this value is not
available, try to calculate it by getting the locality string from each local
process and assign a package_rank. If everything fails, fall back to using
process_id.rank to select the NIC. This last case is not ideal, but has a small
chance of occuring, and causes an output to be displayed to notify that this is
occuring.
Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
With Open MPI 5.0, the decision was made to stop building
3rd-party packages, such as Libevent, HWLOC, PMIx, and PRRTE as
MCA components and instead 1) start relying on external libraries
whenever possible and 2) Open MPI builds the 3rd party
libraries (if needed) as independent libraries, rather than
linked into libopen-pal.
This patch moves libevent from an MCA framework to a stand-alone
library built outside of OPAL. A wrapper in opal/util is provided
to minimize the unnecessary changes in the rest of the code. When
using the internal Libevent, it will be installed as a stand-alone
libevent.a, instead of bundled in OPAL. Any pre-installed version
of Libevent at or after 2.0.21 is preferred over the internal
version.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This PR removes the MCA_BTL_DES_FLAGS_PUT and MCA_BTL_DES_FLAGS_GET
descriptor flags. At some point these had some meaning but they were
replaced by the rcache access flags.
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
The ofi_rxm provider is dependent upon the underlying hardware for its
implementation of FI_DELIVERY_COMPLETE. Since this can lead to early
completions, we disable the provider to avoid correctness issues.
This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.
Signed-off-by: William Zhang <wilzhang@amazon.com>
The btl/ofi does not currently utilize the common ofi include/exclude
list. Added verification code similar to the mtl/ofi that will check if
the info object is in the include or exclude list. If it isn't in the
include list or is in the exclude list, validate_info will return
OPAL_ERROR. The btl/ofi will no longer pass a provider name as a hint
when calling getinfo, instead filtering the provider during
validate_info.
This patch also moves the is_in_list MTL function into common code and
adds additional debugging output to the BTL to match the MTL standard.
Signed-off-by: William Zhang <wilzhang@amazon.com>
EFA incorrectly implements FI_DELIVERY_COMPLETE in earlier libfabric
versions. While FI_DELIVERY_COMPLETE would be advertised by the
provider, completions would return too early by not accounting for
bounce buffers on the receive side. This would cause the BTL
to receive early completions that lead to correctness issues.
This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.
Signed-off-by: William Zhang <wilzhang@amazon.com>
There are 2 reasons for this:
- pending CUDA events are not progressed by this BTL, so anything that becomes
asychronous will never be completed.
- we use the packed data on the shared memory backing file, and this will be
returned to the peer process upon return (thus if we copy asynchronously we
might not copy the right data).
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
It has been broken for months because of the lack of initialization of the
HWLOC library. The smcuda process creating the backing file (local rank 0)
uses opal_cache_line_size to align the objects in the backing file, and the
opal_cache_line_size is initialized by default to 128. Later on, when the rest
of the processes attach the same backing file, HWLOC has been called and the
cache size has now been updated to the correct value. If this value is
different than the default one (and they are as most cache sizes are 64 bytes
right now) the objects in the backing file will be misaligned.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
This commit updates the btl interface to change the parameters
passed to receive callbacks. The interface used to pass the tag,
a btl base descriptor, and the callback context. Most of the
values in the btl base descriptor were unused and only helped
simplify the callbacks from the self btl. All of the arguments
have now been replaced with a single receive callback descriptor.
This descriptor contains the incoming endpoint, data segment(s),
tag, and callback context. All btls have been updated to use
the new callback and the btl interface version has been bumped
to v3.2.0.
As part of this change the descriptor argument (and the segments
contained within it) have been marked as const. The were treated
as const before but this change could allow the compiler to make
better optimization decisions and will enforce that the callback
does not attempt to change the data in the descriptor.
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
mtl_btl_ofi_rcache_init() initializes patcher which should only take
place things are single threaded. OFI providers may start spawn threads,
so initialize the rcache before creating OFI objects to prevent races.
Authored-by: John L. Byrne <john.l.byrne@hpe.com>
Signed-off-by: Harumi Kuno <harumi.kuno@hpe.com>
also add common verbose variable.
Note the verbosity thing is a little tricky owing to the way the MCA frameworks and components are registered and
and initialized. The BTL's are registered/initialized prior to the MTL components even getting registered.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Now that the old sm btl has been gone for some time there was a request
to rename vader to sm. This commit does just that (reluctantly).
An alias has been generated so specifying vader in the btl selection
variable or specifying vader parameters will continue to work.
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
Adds the capability to select a NIC based on hardware locality.
Creates a list of NICs that share the same cpuset as the process,
then selects the NIC based on the (local rank) % (number of NICs).
If no NICs are available that share the same cpuset, the selection process
will create a list of all available NICs and make a selection based on
(local rank) % (number of NICs)
Signed-off-by: Nikola Dancejic <dancejic@amazon.com>
The configure script for the btl uct component reports an error for
the new UCX 1.8.0 versions as it was fixed up to UCX 1.7.
This fixes#7612
Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
Deprecate the current OMPI-specific MPI_Info key definitions for
MPI_Comm_spawn and replace them with their PMIx equivalents. Issue a
deprecation/conversion warning as this is done. Also issue deprecation
warnings for options such as "ompi_non_mpi" that are no longer used.
Handle both cases where the user might pass either the PMIx attribute
name itself (e.g., "PMIX_MAPBY") or the string value of the attribute
(e.g., PMIX_MAPBY, which translates to "pmix.mapby"). This can only be
done for PMIx v4 and above, so protect that code.
Silence a couple of Coverity warnings and add a test along the way.
Signed-off-by: Ralph Castain <rhc@pmix.org>
Consolidate the ompi_process_info and opal_process_info structs to
remove duplicate storage and conversion issues. Unwind some interweaving
of include files using opal.h. Silence a couple of warnings.
For now, set the arch to local if PMIX_ARCH is not found.
Signed-off-by: Ralph Castain <rhc@pmix.org>
Do some code cleanup in the connect/accept code. Ensure that the OMPI
layer has access to the PMIx identifier for the process. Add macros for
converting PMIx names to/from strings. Cleanup a few of the simple test
programs. Add a little more info to a btl/tcp error message.
Signed-off-by: Ralph Castain <rhc@pmix.org>
Found a handful of other URLs that weren't https-ized, so I updated
them, too (after verifying that they support https, of course).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Add a framework to support different types of threading models including
user space thread packages such as Qthreads and argobot:
https://github.com/pmodels/argobotshttps://github.com/Qthreads/qthreads
The default threading model is pthreads. Alternate thread models are
specificed at configure time using the --with-threads=X option.
The framework is static. The theading model to use is selected at
Open MPI configure/build time.
mca/threads: implement Argobots threading layer
config: fix thread configury
- Add double quotations
- Change Argobot to Argobots
config: implement Argobots check
If the poll time is too long, MPI hangs.
This quick fix just sets it to 0, but it is not good for the
Pthreads version. Need to find a good way to abstract it.
Note that even 1 (= 1 millisecond) causes disastrous performance
degradation.
rework threads MCA framework configury
It now works more like the ompi/mca/rte configury,
modulo some edge items that are special for threading package
linking, etc.
qthreads module
some argobots cleanup
Signed-off-by: Noah Evans <noah.evans@gmail.com>
Signed-off-by: Shintaro Iwasaki <siwasaki@anl.gov>
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
We currently save the hostname of a proc when we create the ompi_proc_t for it. This was originally done because the only method we had for discovering the host of a proc was to include that info in the modex, and we had to therefore store it somewhere proc-local. Obviously, this ccarried a memory penalty for storing all those strings, and so we added a "cutoff" parameter so that we wouldn't collect hostnames above a certain number of procs.
Unfortunately, this still results in an 8-byte/proc memory cost as we have a char* pointer in the opal_proc_t that is contained in the ompi_proc_t so that we can store the hostname of the other procs if we fall below the cutoff. At scale, this can consume a fair amount of memory.
With the switch to relying on PMIx, there is no longer a need to cache the proc hostnames. Using the "optional" feature of PMIx_Get, we restrict the retrieval to be purely proc-local - i.e., we retrieve the info either via shared memory or from within the proc-internal hash storage (depending upon the active PMIx components). Thus, the retrieval of a hostname is purely a local operation involving no communication.
All RM's are required to provide a complete hostname map of all procs at startup. Thus, we have full access to all hostnames without including them in a modex or having to cache them on each proc. This allows us to remove the char* pointer from the opal_proc_t, saving us 8-bytes/proc.
Unfortunately, PMIx_Get does not currently support the return of a static pointer to memory. Thus, even though PMIx has the hostname in its memory, it can only return a malloc'd version of it. I have therefore ensured that the return from opal_get_proc_hostname is consistently malloc'd and free'd wherever used. This shouldn't be a burden as the hostname is only used in one of two circumstances:
(a) in an error message
(b) in a verbose output for debugging purposes
Thus, there should be no performance penalty associated with the malloc/free requirement. PMIx will eventually be returning static pointers, and so we can eventually simplify this method and return a "const char*" - but as noted, this really isn't an issue even today.
Signed-off-by: Ralph Castain <rhc@pmix.org>
Extend the PMIx modex recv macros to cover the full set of
immediate/optional combinations. If PMIx_Init cannot reach a server,
then declare the MPI proc to be a singleton.
Provide full support for info values via PMIx
Catch all the values used in the "info" area of OMPI using data
available from PMIx instead of via envars. Update PMIx and PRRTE to sync
with their capabilities.
PMIx
- ensure cleanup of fork/exec children
- fix bug in gds/hash that left app info off of list
PRRTE
- fix multi-app bugs
- port setup_child logic from orte
- OMPI env changes
- set app->first_rank
- ensure common hostname across prun, prte, and pmix
- Fix "nolocal" support
Silence a warning from btl/vader
Signed-off-by: Ralph Castain <rhc@pmix.org>
This commit fixes an issue with the include usage in some
ompi source files. These source files are using the <> form
of include when the "" form is correct (as these are internal,
**not** system headers).
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Nathan Hjelm <hjelmn@google.com>
Per suggestion of @awlauria, added some comments about
the need to free ep before resources it points to.
Signed-off-by: Harumi Kuno <harumi.kuno@hpe.com>
This fix is from John L. Byrne (john.l.byrne@hpe.com).
When OFI Libfabric binds objects to endpoints, before the object can
be successfully closed, the endpoint must first be freed. For scalable
endpoints, objects can also be bound to transmit and receive contexts,
and for objects that are bound to contexts, we need to first free the
contexts before freeing the endpoint. We also need to clear the memory
registration cache.
If we don't clean up properly, then fi\_close may not be able to close
the domain because the dom will have a non-zero ref count.
Signed-off-by: harumi kuno <harumi.kuno@hpe.com>
Keep track of the connected procs in vader_add_procs().
Otherwise, the same rank will reconnect the same shmem
segment (rank 0+...) multiple times instead of the next
one as intended.
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
Restrict the search to the "immediate" range so at worst we check with
our local server and don't go up to the host daemon.
Signed-off-by: Ralph Castain <rhc@pmix.org>
The SM BTL was effectively removed a long time ago. All that was left
was a shell that warned people if they tried to use the SM BTL. For
v5.0, we plan to finally remove this ancient shell (and possibly
replace it with vader).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Will be replaced by PRRTE. Ensure that OMPI and OPAL layers build
without reference to ORTE. Setup opal/pmix framework to be static.
Remove support for all PMI-1 and PMI-2 libraries. Add support for
"external" pmix component as well as internal v4 one.
remove orte: misc fixes
- UCX fixes
- VPATH issue
- oshmem fixes
- remove useless definition
- Add PRRTE submodule
- Get autogen.pl to traverse PRRTE submodule
- Remove stale orcm reference
- Configure embedded PRRTE
- Correctly pass the prefix to PRRTE
- Correctly set the OMPI_WANT_PRRTE am_conditional
- Move prrte configuration to the end of OMPI's configure.ac
- Make mpirun a symlink to prun, when available
- Fix makedist with --no-orte/--no-prrte option
- Add a `--no-prrte` option which is the same as the legacy
`--no-orte` option.
- Remove embedded PMIx tarball. Replace it with new submodule
pointing to OpenPMIx master repo's master branch
- Some cleanup in PRRTE integration and add config summary entry
- Correctly set the hostname
- Fix locality
- Fix singleton operations
- Fix support for "tune" and "am" options
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>