The usNIC BTL does not use more than 1 iov, so be sure to set it to 1
so that we don't allocate cq/rq/sq entries based on a default (i.e.,
>1) number of iovs per entry.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This PR renames the common library for OFI libfabric from
libfabric to ofi. There are a number of reasons this
is good to do:
1) its shorter and replaces 9 characters with three for
function names for what may eventually be a fairly extensive interface
2) OFI is the term used for MTL and RML components that use
the OFI libfabric interface
3) A planned OSC component will also use the OFI term.
4) Other HPC libraries that can use OFI libfabric tend to use
the term "ofi" internally and also in their configure options
relevant to OFI libfabric (i.e. MPICH/CH4, Intel MPI, Sandia SHMEM)
There seem to be comments in places in the Open MPI source
code that indicate that this common library will be going away.
Far from it as we will want to be able to share things like
AV objects between OMPI and possibly OSHMEM components that
use the OFI libfabric interface.
This PR also adds a synonym to the --with-libfabric(-libdir)
configury options: --with-ofi and with-ofi-libdir.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
This commit recategorizes several mpirun arguments,
and moves the information for mpirun --help arguments
to the bottom of the general help message. I also
added the OPAL_CMD_LINE_OTYPE field to two commands
that were missed initially because they were not
in the same area as the others.
Signed-off-by: Nathaniel Graham <ngraham@lanl.gov>
With this, libs (e.g., "-ldl") are not added to the wrapper LIBS
flags. This may work on some platforms, but on at least RHEL 7.3, it
does not (i.e., compiling MPI applications fails because it can't find
dlopen).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
* Add a path for failed component load information to be reported up.
* This allows ompi_info to display this information inline to make it
easier for folks to see if the component is present but failed for
some reason. Most likely a missing library, but could be a libnl
conflict.
* Add MCA parameter to enable this feature:
- `mca_base_component_track_load_errors` takes a boolean
- Default: `false`
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Set the daemons' state to "running" and mark them as "alive" by default when constructing the nidmap
Get the DVM running again
Fix direct modex by eliminating race condition caused by releasing data while sending it
Up the size limit before compressing
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Commit fec519a793 broke the ability to
run autogen.pl in a distribution tarball. This commit restores that
ability by also distributing opal/mca/hwloc/autogen.options in the
tarball.
Skipping CI because CI does not test this functionality:
[skip ci]
bot:notest
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Several component-specific functions were named with a prefix of
"opal_timer_base", which was quite confusing. Rename them to have a
prefix "opal_timer_linux" to make it clear that they are here in this
component (and different than *actual* opal_timer_base symbols).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit makes datagram checks time based and reduces their
frequency when only the wildcard datagram is posted. This change
improves latency on knl systems.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit removes the local operation count check from the local SMSG
completion queue. This check was leading to hangs due to an undocumented
feature of the ugni library. The local SMSG CQ is used to send credit
return messages back to the sender. The ugni library never checks for
the completion itself but relying on the SMSG user to periodically
check the CQ.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit updates the ugni btl to make use of multiple device
contexts to improve the multi-threaded RMA performance. This commit
contains the following:
- Cleanup the endpoint structure by removing unnecessary field. The
structure now also contains all the fields originally handled by the
common/ugni endpoint.
- Clean up the fragment allocation code to remove the need to
initialize the my_list member of the fragment structure. This
member is not initialized by the free list initializer function.
- Remove the (now unused) common/ugni component. btl/ugni no longer
need the component. common/ugni was originally split out of
btl/ugni to support bcol/ugni. As that component exists there is no
reason to keep this component.
- Create wrappers for the ugni functionality required by
btl/ugni. This was done to ease supporting multiple device
contexts. The wrappers are thread safe and currently use a spin
lock instead of a mutex. This produces better performance when
using multiple threads spread over multiple cores. In the future
this lock may be replaced by another serialization mechanism. The
wrappers are located in a new file: btl_ugni_device.h.
- Remove unnecessary device locking from serial parts of the ugni
btl. This includes the first add-procs and module finalize.
- Clean up fragment wait list code by moving enqueue into common
function.
- Expose the communication domain flags as an MCA variable. The
defaults have been updated to reflect the recommended setting for
knl and haswell.
- Avoid allocating fragments for communication with already
overloaded peers.
- Allocate RDMA endpoints dyncamically. This is needed to support
spreading RMA operations accross multiple contexts.
- Add support for spreading RMA communication over multiple ugni
device contexts. This should greatly improve the threading
performance when communicating with multiple peers. By default the
number of virtual devices depends on 1) whether
opal_using_threads() is set, 2) how many local processes are in the
job, and 3) how many bits are available in the pid. The last is
used to ensure that each CDM is created with a unique id.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit exposes ugni statistics for use with MPI_T. There is
no overhead to providing these counters.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
hwloc v1.5 does not support HWLOC_OBJ_OSDEV_COPROC
nor hwloc_topology_dup(), so for this version :
- do not search for coprocessors
- do not try hwloc_topology_dup(), note this is not
used anywhere in the code base
Thanks Jeff for helping with the wording
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This is mostly for error cases, where we need to release the
newly created proc. Currently the code deadlocks because the endpoint
lock is help at the release and the lock is not recursive.
Aslo added some code to print the IP addresses that don't match during
the TCP connection step.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Per a prior commit, the presence of "hwloc.h" can cause ambiguity when
using --with-hwloc=external (i.e., whether to include
opal/mca/hwloc/hwloc.h or whether to include the system-installed
hwloc.h).
This commit:
1. Renames opal/mca/hwloc/hwloc.h to hwloc-internal.h.
2. Adds opal/mca/hwloc/autogen.options to tell autogen.pl to expect to
find hwloc-internal.h (instead of hwloc.h) in opal/mca/hwloc.
3. s@opal/mca/hwloc/hwloc.h@opal/mca/hwloc/hwloc-internal.h@g in the
rest of the code base.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
* Include a 'demo' component that shows some of the features.
* Currently has hooks for:
- MPI_Initialized
- top, bottom
- MPI_Init_thread
- top, bottom
- MPI_Finalized
- top, bottom
- MPI_Init
- top (pre-opal_init), top (post-opal_init), error, bottom
- MPI_Finalize
- top, bottom
* Other places in ompi can 'register' to hook into any one of these places
by passing back a component structure filled with function pointers.
* Add a `MCA_BASE_COMPONENT_FLAG_REQUIRED` flag to the MCA structure that
is checked by the `hook` framework. If a required, static component has
been excluded then the `hook` framework will fail to initialize.
- See note in `opal/mca/mca.h` as to why this is checked in the `hook`
framework and not in `opal/mca/base/mca_base_component_find.c`
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
in this context, AMD64 really means amd64 or em64t, so let's
rename this into X86_64 in order to avoid any confusion
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
This commit makes the vma tree garbage collection list a lifo. This
way we can avoid having to hold any lock when releasing vmas. In
theory this should finally fix the hold-and-wait deadlock detailed
in #1654.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Allow someone to specify the "pe=N" modifier to a mapping policy when N=1. This equates to just "bind-to core", but helps people who use a script to set the PE policy. Fix a bug where setting the binding policy left a lingering "if-supported" flag that shouldn't be there.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Cleanup a race condition segfault during finalize by ensuring the PMIx progress thread is stopped prior to starting to tear down the messaging components
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Do not use opal_output_verbose inside O(n) loops. This was causing us
to make O(n) calls to snprintf which was greatly slowing launch at
scale.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new base enumerator type for variables that take of
the values -1, 0, and 1. These values are mapped to the strings auto,
false, true. This commit updates the mpi_leave_pinned MCA variable to
use the new enumerator.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Add a verbose to show all the failed attempts to match the
remote interfaces based on the modex info.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
It's possible that we can have zlib.h but still not have zlib support.
Use the correct macro to protect the usage of calling zlib functions.
This fixes 32-bit MTT builds at Cisco (e.g.,
https://mtt.open-mpi.org/index.php?do_redir=2389).
Submitted upstream to PMIX: https://github.com/pmix/master/pull/290
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
PMIx_server_register_nspace() is an asynchronous operation, so
the pmix glue wait for it completes before returning.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
the argc field from the opal_pmix_app_t struct was removed,
so adjust the pmix/ext11 glue accordingly.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
- New MCA option: opal_stacktrace_output
- Specifies where the stack trace output stream goes.
- Accepts: none, stdout, stderr, file[:filename]
- Default filename 'stacktrace'
- Filename will be `stacktrace.PID`, or if VPID is available,
then the filename will be `stacktrace.VPID.PID`
- Update util/stacktrace to allow for different output avenues
including files. Previously this was hardcoded to 'stderr'.
- Since opal_backtrace_print needs to be signal safe, passing it a
FILE object that actually represents a file stream is difficult. This
is because we cannot open the file in the signal handler using
`fopen` (not safe), but have to use `open` (safe). Additionally, we
cannot use `fdopen` to convert the `int fd` to a `FILE *fh` since it
is also not signal safe.
- I did not want to break the backtrace.h API so I introduced a new
rule (documented in `backtrace.c`) that if the `FILE *file`
argument is `NULL` then look for the `opal_stacktrace_output_fileno`
variable to tell you which file descriptor to use for output.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>