This component is a workaround to a bug in libibverbs that prints a
dire warning that usNIC devices are not supported (of course not --
usNIC devices provide functionality through libfabric, not
libibverbs). This component was written before a better workaround
was created: a "no op" libibverbs plugin for usNIC devices
(https://github.com/cisco/libusnic_verbs, and is also available in
binary form on cisco.com).
Hence, this component no longer builds by default. It's still
available if a user specifically asks for it (e.g., if they do not
want to install the "no op" libibverbs plugin), but it's not the
default. This component also has the side-effect of making
libopen-pal.so depend on libibverbs.so, which can be annoying for
packagers (which is another reason it isn't built by default any
more).
The send code in the ugni btl has an optimization that enables it to
return 1 (fragment gone) in some cases. This optimization involved
removing the btl ownership and callback flags to ensure the fragment
stuck around long enough for its completion flag to be checked. This
works fine for the single-threaded case but not in the multi-threaded
case. It is possible that a fragment will be completed by another
thread while a thread is in mca_btl_ugni_send. This competition can
lead to a leaked fragment, missed callback, or both. To fix the issue
without removing the optimization a reference count has been added to
the fragment. Callbacks and fragment release will not be made until
the fragment reference count has reach 0. The count is incremented
before sending the frag and decremented after the completion flag has
been checked. The fix has been verified to work using a multi-threaded
RMA benchmark with the osc/pt2pt component.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a race condition that can cause an endpoint to be
added to the wait list multiple times. To fix the issue an additional
check has been added to ensure the endpoint is not on the wait list
after the wait list lock is held. The wait list processing code has
also been updated to keep the wait list lock until all wait listed
endpoints have been handled. This reduces the chance that an endpoint
that is being processed by the wait list code is not re-added to the
list by a competing send.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Three minor updates from the code review of
https://github.com/open-mpi/ompi-release/pull/933:
* Remove an extra blank line a show_help message
* We no longer allow -1 for the MCA param btl_usnic_av_eq_num, so
change the flag to REGINT_GE_ONE
* Change "num_blocks" definition to be in terms of block_len (not
eq_size)
A bunch of empirical testing has shown that increasing the retranmit
timeout from 1ms to 5ms doesn't adversely affect performance, yet
decreases the number of gratuitious retransmissions.
Add endpoints in a blocked manner so that we don't overrun the
fi_av_insert() event queue. Also make the AV EQ length an MCA param,
and report it in mca_btl_base_verbose >=5 output.
Sequence numbers will wrap around; it is not sufficient to check for
(seq-1) -- must use the SEQ_DIFF macro to properly handle the
wraparound.
This bug wasn't serious; it just meant we might retransmit one or two
extra times when retransmits were triggerd and the sequence numbers
wrapped around their sliding windows.
when SMT is enabled, a core must be counted as long as one of its hwthread is allowed
Thanks Ben Menadue for the report.
This fixes a regression from open-mpi/ompi@6d149554a7
The eviction callback, for convenience (and to avoid code
duplication), use to call opal_hotel_checkout(). However,
opal_hotel_checkout() deletes the eviction event -- which is fine to
do when opal_hotel_checkout() is invoked by the application. But when
it's invoked by the same event that it's deleting, it can cause Bad
Things to happen.
For simplicity, instead of invoking opal_hotel_checkout() from the
eviction callback, just duplicate the checkout logic into the eviction
callback function (and skip the delete-the-evict-event part).
For good measure, put a comment in all three places where the checkout
logic occurs (because it's inlined): don't change this logic without
changing all 3 places.
Finally, also add a line in the docs for opal_hotel_init() warning
users from calling opal_hotel_checkout() from their eviction
callback.
`cm_message_event_active == 1` but main thread has already stopped
processing messages and thus we will have the situation where one
message was left unhandled leading to a hang.
setmntent() doesn't support root_fd, but manual parsing of
/proc/mounts is fragile, and actually buggy for very long mount lines
(see open-mpi/hwloc#142 (comment)).
Since we only openat("/proc/mounts") there, just manually concatenate
the fsroot_path and use setmntent().
Thanks to Nathan Hjelm for the report.
(Cherry-picked from open-mpi/hwloc@d2d07b9a22)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Per open-mpi/ompi#1230, add a comment explaining why we write to a
temporary file and then rename(2) the file, just so that future code
maintainers don't wonder why we do this seemingly-useless step.
Update the configure logic for the new pmix120 component
ckpt
Get the pmix120 component to work - still not really registering or handling notifications, but infrastructure now operates
Cleanup some of the symbol scopes, and provide a more comprehensive rename.h file. Will pretty it up later - let's see how this works
Cleanup the rename files to use the pretty macros
The problem was in mca_btl_openib_proc_create. This function may be called
from several places simultaneously:
* from the main thread when somebody wants to do `MPI_Send()` (for example) for
the first time;
* from udcm if the counterpart peer is trying to connect and `mca_btl_openib_get_ep()`
is called.
In this case one of the threads may add an uninitialized proc structure
to the `mca_btl_openib_component.ib_procs` and the other will read it and
treat as initialized.
This commit turns ib_proc initialization into a single atomic operation.
Move .so version numbers to their appropriate project in the top-level
VERSION file. Also add the project name to all .so version number
names. Remove no-longer-used .so names.
NOTE: Building with external pmix *requires* that you also build with external libevent and hwloc libraries. Detect this at configure and error out with large message if this requirement is violated.
Closes#1204 (replaces it)
Fixes#1064
It was decided some time ago that there is no benefit to using any
per-peer receive queues on infiniband. At the time we decided not to
change the default but that objection has been dropped. This commit
changes the 128 message queue to use SRQ instead of PP. This has no
impact on iWarp which sets the default in a different way.
Closesopen-mpi/ompi#1156
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
When PtlMDBind was removed, the result check was left in which
causes intermittent failures depending on the junk value found in
the 'ret' variable. The commit removes the result check.
Rename the pmix1xx component to pmix111 so it reflects the actual release it includes
Resolve the problem of PMIx being passed a bogus --with-platform argument when configuring the PMIx tarball code. There is no reason we should be passing --with-platform arguments to any internal subdirectory, so just leave that out when constructing the opal_subdir_args variable.
Update the PMIx code and continue attempting to debug direct modex
Fix a problem in the ORTE PMIx server - there was an early intent to optimize the direct modex by fetching data for all procs from the target job on the remote node, instead of fetching the data one proc at a time. However, this was never completely implemented, and so we would hang if we had multiple overlapping requests for data from more than one proc on the node.
Update PMIx to v1.1.2
This commit removes a check that causes mca_base_group_register to
improperly create a new group instead of using an existing group
when the project and framework names are the same. This check was
originally intended to prevent forming groups with names like
ompi_ompi, opal_opal, etc but there is no reason why we shouldn't
allow that.
Fixesopen-mpi/ompi#1155
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
Mofed 2.2 does not have the IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG attribute
flag. Add a check to fix compilation for mofed 2.2. This commit only
fixes complilation with the older mofed. It will not allow an Open MPI
compiled with mofed 2.3 or newer to work on a machine with mofed 2.2.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
There was a bug with the way the cray pmix component
was setting the locality property for ranks on the
same node, etc.
Improve location/syntax of a comment block.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
If btl-portals4 is configured to use logical mapping of ranks to
physical nodes, then the endpoint must have the rank field set.
This commit fixes a bug that caused the endpoint to have the
nid/pid instead of the rank if the endpoint already exists.
some of the collective modules. Added a new function
opan_datatype_span, to compute the memory span of
count number of datatype, excluding the gaps in the
beginning and at the end. If a memory allocation is
made using the returned value, the gap (also returned)
should be removed from the allocated pointer.
This update adds an additional check (if supported) to see if 8-byte
atomics are supported by the hardware. If 8-byte atomics are not
supported the atomics support is disabled.
This commit also includes some cleanup.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit allows to control output during abnormal oshmem/ompi application
termination.
Fixed issue in backtrace output. HAVE_BACKTRACE was never set so user was limited
in control of this variable.
Two related mca variables are moved to opal layer. Corresponding aliases are
added for ompi and oshmem.
This commit adds support for fetch-and-add and compare-and-swap when
using the mlx5 driver. The support is only enabled if the expanded
verbs interface is detected. This is required because mlx5 HCAs return
the atomic result in network byte order. This support may need to be
tweaked if Mellanox commits their changes into upstream verbs.
Closesopen-mpi/ompi#1077
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
OPAL free lists can be initialized with a fragment size that differs
from the size of objects from a class. This allows the free list code
to support OPAL objects that have flexible array members.
Unfortunately the free list code will throw out the desired length in
some cases. The code in question was committed in
open-mpi/ompi@90fb58de. The side effects of this are varied and can
cause segmentation faults, assert failures, hangs, etc. This commit
adds a check to ensure the requested size is at least as large as the
class size and makes opal_free_list allocations always honor the
requested fragment size (as long as it is larger than the class
size).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
1. Fix: old v1.6-era code reset the stats-emitting event to fire twice
for each time period.
1. Add the usNIC device name to the output for differentiating the
output in multi-rail scenarios.
This commit revives the component retention functionality that was
removed as part of the component repository rewrite. The new
mca_base_component_repository_retain_component function works by
preventing the dlclosing of a dynamic component until a matching call
to mca_base_component_repository_release is made.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
to continue current default behavior.
Also add an MCA param pmix_base_collect_data to direct that the blocking fence shall return all data to each process. Obviously, this param has no effect if async_
modex is used.
Move the fake usnic IBV provider out of common/verbs and into a new
common/verbs_usnic component that is always statically linked into
libopen-pal. The fake provider is registered with libibverbs at run
time, but there is no *un*register IBV API. Hence, we can't let the
code containing this provider be dlclosed -- which means it needs to
be statically linked into libopen-pal.
Fixesopen-mpi/ompi#1060.
if CFLAGS and/or CPPFLAGS are passed to the ompi configure command line, pmix1xx
configure will not use the correct ones previously passed in the environment
see discussion started at http://www.open-mpi.org/community/lists/devel/2015/10/18159.php
Thanks Siegmar Gross for bringing this to our attention
This commit fixes the following bugs:
- On send failure release newly allocated message.
- In the destructor for udcm_message_sent_t always remove the send
timeout event from the event base. Failure to do this can lead to
memory corruption since the destructor may be called from an event
callback.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This was fixed on my btl 3.0 branch but the changeset got lost in a
rebase. Fixes issues with lock ups when using osc/rdma.
References open-mpi/ompi#1010
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The mca_base_select function uses returned priorities to select the
best component/module. This priority may be of use to the caller so
pass that information back in an optional argument. If the priority is
not needed pass NULL.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The handling of RDMA get alignment in ugni BTL for Aries
(cray xc) was wrong, resulting in very poor bandwidth
for ugni BTL on aries.
Verified using osu_bw now gives sensible bandwidth on
Aries.
Fixes#1005
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
This commit removes the service and async event threads from the
openib btl. Both threads are replaced by opal progress thread
support. The run_in_main function is now supported by allocating an
event and adding it to the sync event base. This ensures that the
requested function is called as part of opal_progress.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a access_flags argument to the mpool registration
function. This flag indicates what kind of access is being requested:
local write, remote read, remote write, and remote atomic. The values
of the registration access flags in the btl are tied to the new flags
in the mpool. All mpools have been updated to include the new argument
but only the grdma and udreg mpools have been updated to make use of
the access flags. In both mpools existing registrations are checked
for sufficient access before being returned. If a registration does
not contain sufficient access it is marked as invalid and a new
registration is generated.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Otherwise libpciaccess sends a big error message to stderr:
Error opening /devices/pci@0,0:reg: Permission denied
(cherry picked from commit open-mpi/hwloc@d93c7c0960)
In the default mode of operation, the Portals4 components support
dynamic add_procs().
The Portals4 components have two alternate modes (flow control and
logical-to-physical) that require knowledge of all procs at startup.
In these modes, mtl-portals4 sets the MCA_MTL_BASE_FLAG_REQUIRE_WORLD
flag and btl-portals4 sets the MCA_BTL_FLAGS_SINGLE_ADD_PROCS flag
to tell the PML that we need all the procs in one add_procs() call.
After PMIx integration, the thrid parameter to OPAL_MODEX_RECV() is
opal_process_name_t instead of opal_proc_t. This commit replaces
proc with &proc->proc_name.