This reverts commit 6acebc40a1.
This patch is causing numerous "Socket closed" messages which are
causing most of the failures on Cisco's MTT run. See
https://github.com/open-mpi/ompi/issues/5849 for more information.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
It is apparently possible for different instances of the same UCT
transport to have different limits (max short put for example). To
account for this we need to store the attributes per TL context not
per TL. This commit fixes the issue.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
While trying to debug #3035, it's not clear whether there is
an issue with the modex data or printing the address list.
Print the number of endpoints on the error, which will help
determine which case is happening to Cisco.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
When creating TCP BTL modules, print more information about the
module's ethernet association, including the first address associated
with the device, as debug output.
Fix a flipped output string for IPv4 and IPv6 addresses in the
modex send code.
Add the addresses being published in the modex to the debugging
output in modex send, to help match failures in endpoint match.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This commit updates the uct btl to change the transports parameter
into a priority list. The dc_mlx5, rc_mlx5, and ud transports to the
priority list. This will give better out of the box performance for
multi-threaded codes beacuse the *_mlx5 transports can avoid the mlx5
lock inside libmlx5_rdmav2.
This commit also fixes a number of leaks and a possible deadlock when
using RDMA.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The Open MPI code base assumed that asprintf always behaved like
the FreeBSD variant, where ptr is set to NULL on error. However,
the C standard (and Linux) only guarantee that the return code will
be -1 on error and leave ptr undefined. Rather than fix all the
usage in the code, we use opal_asprintf() wrapper instead, which
guarantees the BSD-like behavior of ptr always being set to NULL.
In addition to being correct, this will fix many, many warnings
in the Open MPI code base.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This commit works around an Oracle C compiler bug in 5.15 (not sure
when it was introduced). The bug is triggered when we chain
assignments of atomic variables. Ex:
_Atomic intptr x, y;
intptr_t z = 0;
x = y = z;
Will produce a compiler error of the form:
operand cannot have void type: op "="
assignment type mismatch:
long "=" void
To work around the issue we are removing the chain assignment and
setting the head and tail on different lines.
Fixes#5814
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
On some platfoms reading a 64-bit value is non-atomic and it is
possible that the two 32-bit values are read in the wrong order. To
ensure the tag is always read first this commit reads the tag before
reading the full 64-bit value.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Get Brian's patch from #5825 and his log message:
Fix a failure in binding the initiating side of a connection
on MacOS. MacOS doesn't like passing the size of the storage
structure (sockaddr_storage) instead of the expected size of
the structure (sockaddr_in or sockaddr_in6), which was causing
bind() failures. This patch simply changes the structure size
to the expected size.
Add a more clear error message in debug mode.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Per
https://github.com/open-mpi/ompi/issues/3035#issuecomment-426085673,
it looks like the IP address for a given interface is being stashed in
two places: on the endpoint and on the module.
1. On the endpoint, it is storing the moral equivalent of a
(struct sockaddr_in.sin_addr).
2. On the module, it is storing a full (struct sockaddr_storage).
The call to opal_net_get_hostname() expects a full (struct sockaddr*)
-- not just the stripped-down (struct sockaddr_in.sin_addr). Hence,
when the original code was passing in the endpoint's (struct
sockaddr_in.sin_addr) and opal_net_get_hostname() was treating it
like a (struct sockaddr), hilarity ensued (i.e., we got the wrong
output).
This commit eliminates the call to opal_net_get_hostname() and just
calls inet_ntop() directly to convert the (struct
sockaddr_in.sin_addr) to a string.
NOTE: Per the github comment cited above, there can be a disparity
between the IP address cached on the endpoint vs. the IP address
cached on the module. This only happens with interfaces that have
more than one IP address. This commit does not fix that issue.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
In many cases, this was a simple string replace. In a few places, it
entailed:
1. Updating some comments and removing now-redundant foo[size-1]='\0'
statements.
2. Updating passing (size-1) to (size) (because opal_string_copy()
wants the entire destination buffer length).
This commit actually fixes a bunch of potential (yet quite unlikely)
bugs where we could have ended up with non-null-terminated strings.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Disable async receive for CUDA under OpenIB. While a performance
optimization, it also causes incorrect results for transfers
larger than the GPUDirect RDMA limit. This change has been validated
and approved by Akshay.
References #3972
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This can be returned when running on QEMU user-mode emulation,
which does not support getsockopt with SO_RCVTIMEO.
Signed-off-by: Michael Kuron <mkuron@icp.uni-stuttgart.de>
This commit updates the entire codebase to use specific opal types for
all atomic variables. This is a change from the prior atomic support
which required the use of the volatile keyword. This is the first step
towards implementing support for C11 atomics as that interface
requires the use of types declared with the _Atomic keyword.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
To ensure fast box entries are complete when processed by the
receiving process the tag must be written last. This includes a zero
header for the next fast box entry (in some cases). This commit fixes
two instances where the tag was written too early. In one case, on
32-bit systems it is possible for the tag part of the header to be
written before the size. The second instance is an ordering issue. The
zero header was being written after the fastbox header.
Fixes#5375, #5638
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Since openib is on its long, slow way out the door, don't let it
complain about not being able to find any NICs at run time.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit fixes a bug when using the UCT btl with the UCX memory
hooks disabled. We were misssing a call to
opal_mem_hooks_unregister_release to remove the btl memory hook
callback.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
bugfix: major: openib send credits returned correctly after a fault for pending frags to dead processes; also tweak the default IB retry timeouts tomake this happen faster
Make it compile in non-debug builds
Mark the IB endpoint as failed when invoking an error; this resolves UDCM connection deadlocks
Changing the default IB retry timeouts is not a good idea.
We'll need to find another way to speedup credit recovery in failure cases.
Remove ULFM specific cases
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
When an error is returned by the socket operations, trigger the
appropriate error path in the PML to give an opportunity for
rerouting/error handling.
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
The write memory barrier was intended to precede setting a fast-box
header but instead follows it. This commit moves the memory barrier to
the intended location.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The 2 sided communication support is added for non-tagmatching provider
to take advantage of this BTL and PML OB1. The current state is
"functional" and not optimized for performance.
Two sided support is disabled by default and can be turned on by mca
parameter: "mca_btl_ofi_mode".
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
Due to decreasing support by vendors/other orgs for the OpenIB BTL,
only look for iWarp/RoCE devices by default. Allow IB HCAs
with ports configured for ethernet.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
OFI BTL uses context for completion but never ask for it in
fi_getinfo(3). This commit makes sure that we always ask for FI_CONTEXT
to eliminate any potential error.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
This commit fixes two bugs in the RMA/atomic emulation code:
1) Fix a fragment leak when using AMO emulation.
2) Always initialize the single-copy emulation code. This is required
to use the AMO support.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The descriptor flags field in a fragment were being ready after the
fragment may have been freed. This commit reads the flags before
calling the user callback.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds support for atomic operations as well as rdma for
systems without rdma support. This support is implemented using an
internal send tag.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The current cast is *functional*, but isn't really the way it should
be done. This commit makes the cast the way it should be done.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Fix two facepalms:
1. The "uint32" in the hash map functions refer to the *key* size, not
the *value* size. The values are always 64 bits.
2. Pass the straight value to the "set" functions -- not the pointer
to the value.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit changed the way btl/ofi call progress. Before, we force
progression with every rdma/atomic call. This gives performance boost in
some case and slow down on others. Now we only force progression after
some number of rdma calls which result in better performance overall.
Also added new MCA parameter 'mca_btl_ofi_progress_threshold' to set
the threshold number. The new default is 64.
Also:
Added FI_DELIVERY_COMPLETE to tx_rtx flags to ensure that the completion
is generated after the message has been received on the remote side.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
This commit improves the injection rate and latency for RDMA
operations. This is done by the following improvements:
- If C11's _Thread_local keyword is available then always use the
same virtual device index for the same thread when using RDMA. If
the keyword is not available then attempt to use any device that
isn't already in use. The binding support is enabled by default but
can be disabled via the btl_ugni_bind_devices MCA variable.
- When posting FMA and RDMA operations always attempt to reap
completions after posting the operation. This allows us to
better balance the work of reaping completions across all
application threads.
- Limit the total number of outstanding BTE transactions. This
fixes a performance bug when using many threads.
- Split out RDMA and local SMSG completion queue sizes. The RDMA
queue size is better tuned for performance with RMA-MT.
- Split out put and get FMA limits. The old btl_ugni_fma_limit MCA
variable is deprecated. The new variable names are:
btl_ugni_fma_put_limit and btl_ugni_fma_get_limit.
- Change how post descriptors are handled. They are no longer
allocated seperately from the RDMA endpoints.
- Some cleanup to move error code out of the critical path.
- Disable the FMA sharing flag on the CDM when we detect that there
should be enough FMA descriptors for the number of virtual devices
we plan will create. If the user sets this flag we will not unset
it. This change should improve the small-message RMA performance by
~ 10%.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a new btl for one-sided and two-sided. This btl
uses the uct layer in OpenUCX. This btl makes use of multiple uct
contexts and per-thread device pinning to provide good performance
when using threads and osc/rdma. This btl has been tested extensively
with osc/rdma and passes all MTT tests on aries and IB hardware.
For now this new component disables itself but can be enabled by
setting the btl_ucx_transports MCA variable with a comma-delimited
list of supported memory domains/transport layers. For example:
--mca btl_uct_memory_domains ib/mlx5_0. The specific transports used
can be selected using --mca btl_uct_transports. The default is to use
any available transport.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The giant size of the TCP proc struct is causing a problem in some
environments (because it is allocated on the stack), and it was too
big, anyway.
Instead, use a hash map. That way, it starts small and can grow if it
needs to. It also makes no assumptions about the values of the kernel
interface indexes.
Fixes#5292.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit add support for scalable endpoint to enhance multithreaded
application performance. The BTL will detect the support from ofi
provider and will fallback to normal usage of scalable endpoint is not
supported.
NEW MCA parameters:
- mca_btl_ofi_disable_sep: force the btl to not use scalable endpoint.
- mca_btl_ofi_num_contexts_per_module: number of communication context
to create (should be the same as number of thread).
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
FI_MR_UNSPEC is not supposed to be used beyond ofi version 1.5. This
commit replaces FI_MR_UNSPEC with the new FI_MR_BASIC mode bits
(FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR).
The btl functionality remains the same.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
OFI sockets provider will sometime return FI_EINTR. Apparently this flag
does not exist in OFI version 1.5. This commit added an ifdef check.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
This commit fixes problem where users have libfabric version < 1.5
installed and build failed because the new MR modebits does not exist.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
This commit added new transport layer to be used with osc rdma module.
This BTL provides put, get, atomic and fetch atomic operations. It can
be used with multiple hardware vendors as long as they have their
provider under Libfabric and have the right capabilities.
Signed-off-by: Thananon Patinyasakdikul <thananon.patinyasakdikul@intel.com>
This commit fixes a segfault in btl-portals4 add_procs(). The segfault
occurs if add_procs() is called after a del_procs() call that reduces
the proc count to zero which would cause PT and NI resources to be
freed. This commit resolves the segfault by using a common
initiailization boolean and only freeing module resources in
finalize().
Signed-off-by: Todd Kordenbrock (thkgcode@gmail.com)
Some of the show_help() messages that were added in 40afd525f8 were
really normal / expected behavior (e.g., if 2 peers connect in the TCP
BTL more-or-less simultaneously, one of them will drop the connection
-- no need to show_help() about this; it's expected behavior). Roll
back these messages to be opal_output_verbose() kinds of messages.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
sizeof(addrs[0].addr_inet)==16 (so that it can handle IPv6 addresses),
but the memory that we are copying from (my_ss->sin_addr) is only 4
bytes long. Don't copy beyond the end of that source buffer.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Important note :
According to the man page
"On success, process_vm_readv() returns the number of bytes read and
process_vm_writev() returns the number of bytes written. This return
value may be less than the total number of requested bytes, if a
partial read/write occurred. (Partial transfers apply at the
granularity of iovec elements. These system calls won't perform a
partial transfer that splits a single iovec element.)"
So since we use a single iovec element, the returned size should either
be 0 or size, and the do loop should not be needed here.
We tried on various Linux kernels with size > 2 GB, and surprisingly,
the returned value is always 0x7ffff000 (fwiw, it happens to be the size
of the larger number of pages that fits a signed 32 bits integer).
We do not know whether this is a bug from the kernel, the libc or even
the man page, but for the time being, we do as is process_vm_readv() could
return any value.
Thanks Heiko Bauke for the bug report.
Refs. open-mpi/ompi#4829
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Fix case where the btl_tcp_links MCA parameter is used to create multiple TCP connections between peers.
Three issues were resulting in hangs during large message transfer:
* The 2nd..btl_tcp_link connections were dropped during establishment because the per-process
address check was binary, rather than a count
* The accept handler would not skip a btl module that was already in use, resulting in all
connections for a given address being vectored to a single btl
* Multiple addresses in the same subnet caused connections to be
stalled, as the receiver would always use the same (first) address
found. Binding the outgoing connection solves this issue
* Lastly fix race condition created by connections being started at the exact same time
by accpeting connections not in the closed state, allowing endpoint_accept to resolve
dispute
Signed-off-by: Jordan Cherry <cherryj@amazon.com>
This commit replaces the current VMA tree implementation with one that
uses the new opal_interval_tree_t class. Since the VMA tree lock is no
longer used this commit also updates rcache/grdma and btl/vader to
take better care when searching for existing registrations.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Follow-on to 8097d09858: now that BTL_VERSION is defined in btl.h, be
a little smarter about whether we define it or not.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit adds a new optional function to the BTL module:
btl_flush. This function takes an optional BTL endpoint. When called
this function completes all outstanding RDMA and atomic operations
started prior to the call to btl_flush.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Since open-mpi/ompi@47fd2313ab
the backing file is now in /dev/shm by default. As a consequence,
the backing file name has to include the jobid so more than one job
can run at a time.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Resolve a race condition between registering for a file to be removed upon termination and actual creation of that file by providing attributes that identify whether the path is a file or directory. This removes the need for PMIx to detect the difference.
Refs #4686
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
This commit moves the backing files to /dev/shm to avoid limitations
that may be set on /tmp. The files are registered with pmix to ensure
they are cleaned up after an erroneous exit.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 48101278160672317ade352365592f56ef3b8977)
If available, have apps use registration capability to cleanup their session directories. Setup capability for vader to register its shared memory file location - let someone familiar with that code do so.
Final cleanup to track uid/gid, update the opal/pmix API to pass flags for ignore and leave top directory alone
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
There were multiple paths that could lead to a fast box
allocation. One of them made little sense (in-place send) so it has
been removed to allow a rework of the fast-box send function. This
should fix a number of issues with hanging/crashing when using the
vader btl.
References #4260
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds support for fetch-and-op atomics. This is needed
because and and or are irreversible operations so there needs to be a
way to get the old value atomically. These are also the only semantics
supported by C11 (there is not atomic_op_fetch, just
atomic_fetch_op). The old op-and-fetch atomics have been defined in
terms of fetch-and-op.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit renames the arithmetic atomic operations in opal to
indicate that they return the new value not the old value. This naming
differentiates these routines from new functions that return the old
value.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit eliminates the old opal_atomic_bool_cmpset functions. They
have been replaced by the opal_atomic_compare_exchange_strong
functions.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit renames the atomic compare-and-swap functions to indicate
the return value. This is in preperation for adding support for a
compare-and-swap that returns the old value. At the same time the
return type has been changed to bool.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Their is racing condition in TCP connection establishment
during simultaneous handshake. This PR handles the fix for
it.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
Cisco wrote a bipartite graph solver to properly solve
interface pair selection for usNIC. Using the reachable
framework, the TCP BTL (and possibly the runtime network
code) can use the graph solver to make more optimal pair
selection. Jeff was happy to have the code more broadly
used, but didn't have time to do the move, hence this
commit.
There are a couple of minor changes to the code compared
to the usNIC version. Obviously, the functions have
been renamed to match naming convention for their new
home. Since it's easier to write unit tests for
util/ code, the unit tests have been made first class
tests run at "make check" time. This last bit required
moving some of the definitions into a new header,
bipartite_graph_internal.h, so that they could be
included in both the library code and the test code.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
This commit adds the code necessary to support forming connections across
subnets. The primary changes are to 1) add the gid to the modex, and 2)
use the gid to create the address handle.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
Some OSes have hardcoded limits to prevent overflowing over an int32_t.
We can either detect this at configure (which might be a nicer but
incomplete solution), or always force the pipelined protocol over TCP.
As it only covers data larger than 1GB, no performance penalty is to be
expected.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
as the writev and readv support a sum larger than a uint32_t
this version will work. For the other OSes a different patch
is required. This patch is a slight modification of the one
proposed by @ggouaillardet.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* Resolves#3705
* Components should link against the project level library to better
support `dlopen` with `RTLD_LOCAL`.
* Extend the `mca_FRAMEWORK_COMPONENT_la_LIBADD` in the `Makefile.am`
with the appropriate project level library:
```
MCA components in ompi/
$(top_builddir)/ompi/lib@OMPI_LIBMPI_NAME@.la
MCA components in orte/
$(top_builddir)/orte/lib@ORTE_LIB_PREFIX@open-rte.la
MCA components in opal/
$(top_builddir)/opal/lib@OPAL_LIB_PREFIX@open-pal.la
MCA components in oshmem/
$(top_builddir)/oshmem/liboshmem.la"
```
Note: The changes in this commit were automated by the script in
the commit that proceeds it with the `libadd_mca_comp_update.py`
script. Some components were not included in this change because
they are statically built only.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
This commit has two changes
1. Adding magic string during handshake can cause
issue when used with older version of MPI. Hence set
RCVTIMEO paramter to 2 second
2. Using single call during handshake instead of
two calls
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
As part of improvement towards tcp debugging
we are moving few BTL_ERROR to show_help and also
update the function behaviour of
mca_btl_tcp_endpoint_complete_connect to return
SUCCESS and ERROR cases.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
As part of improvement towards handling failure case
in btl tcp we are using magic string to verify mpi
connection. In case if there is mismatch or missing
magic string we can identify that we are trying to
connect with someother process.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
Moving non-blocking send/receive function to btl_tcp
will help reusing these function where ever needed.
In this case we plan to reuse receive function to
retrive magic string to validate established connection
is from mpi process.
Signed-off-by: Mohan Gandhi <mohgan@amazon.com>
Not sure how/when this got deleted, but put back the "Cisco usNIC"
line in the transport summary at the end of configure.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
usnic endpoints was always created with default send credit value of 8. This
commit assign the correct number from the hardware instead.
Signed-off-by: Thananon Patinyasakdikul <apatinya@cisco.com>
Update to support passing of HWLOC shmem topology to client procs
Update use of distance API per @bgoglin
Have the openib component lookup its object in the distance matrix
Bring usnic up-to-date
Restore binding for hwloc2
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
If a user explicitly asks for the "sm" BTL, print a show_help message
saying that the SM BTL is dead, and the user should be using "vader".
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
This commit ensures that the pml callback is always made when
sending fragments. This is needed to avoid #3845. Once that is
fixed the #if 0'd code can be restored.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Based on an idea from Brian move the libevent trigger update to a later
stage instead of the generic add/del procs. So, we are doing the
increment/decrement when we register the recv handler for an endpoint,
so basically when we create and connect a socket to a peer. The benefit
is that as long as TCP is not used, there should be no impact on the
performance of other BTLs. The drawback is that the first TCP connection
will be slightly slower, but then once we have a peer connected over
TCP things go back to normal.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
This commit cleans up code in opal to use OPAL_LIST_FOREACH(_SAFE),
OPAL_LIST_DESTRUCT, and OPAL_LIST_RELEASE.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>