1
1

5656 Коммитов

Автор SHA1 Сообщение Дата
William Zhang
a471f8749f reachable: Update documentation on reachable function
We have decided to show interfaces that are identical to itself as
reachable. This is consistent with the previous netmask logic when
determining reachability.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2019-12-02 18:04:40 +00:00
William Zhang
ce40436895 reachable/netlink: Show an interface as reachable to itself
Due to the way netlinks detects reachability, it will not show an
interface as reachable to itself, even if it can pass through a loopback
interface. To maintain similar behavior with netmasks, we display an
interface as reachable to itself.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2019-12-02 18:04:40 +00:00
Joseph Schuchart
ee80babe5c Remove unused opal_shmem_seg_hdr_t to retain alignment
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2019-11-29 08:40:14 +01:00
Brice Goglin
ea80a20e10 hwloc/base: fix opal proc locality wrt to NUMA nodes on hwloc 2.0
Both opal_hwloc_base_get_relative_locality() and _get_locality_string()
iterate over hwloc levels to build the proc locality information.
Unfortunately, NUMA nodes are not in those normal levels anymore since 2.0.
We have to explicitly look a the special NUMA level to get that locality info.

I am factorizing the core of the iterations inside dedicated "_by_depth"
functions and calling them again for the NUMA level at the end of the loops.

Thanks to Hatem Elshazly for reporting the NUMA communicator split failure
at https://www.mail-archive.com/users@lists.open-mpi.org/msg33589.html

It looks like only the opal_hwloc_base_get_locality_string() part is needed
to fix that split, but there's no reason not to fix get_relative_locality()
as well.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2019-11-27 12:41:33 +01:00
Josh Hursey
b50d568004
Merge pull request #7191 from gvallee/returncode_check
Add the missing code to check a return code
2019-11-26 09:02:23 -06:00
Josh Hursey
f828990bcf
Merge pull request #7192 from gvallee/contiaing_typo
Fix typo in comment: contiaing -> containing
2019-11-25 14:40:08 -06:00
Geoffroy Vallee
127573cf44
Fix typo in comment: contiaing -> containing
Signed-off-by: Geoffroy Vallee <geoffroy.vallee@gmail.com>
2019-11-24 11:06:41 -05:00
Geoffroy Vallee
de6f130b4a
Add the missing code to check a return code
Signed-off-by: Geoffroy Vallee <geoffroy.vallee@gmail.com>
2019-11-24 11:04:10 -05:00
Geoffroy Vallee
98de17c6da
Fix a type in comments: insertted -> inserted
Signed-off-by: Geoffroy Vallee <geoffroy.vallee@gmail.com>
2019-11-24 00:36:06 -05:00
Howard Pritchard
7c1aafb6e1 clang: pacify clang wrt ompi use of c11 atomics
The clang compiler (or at least the one used by the Cray CCE 9 and newer)
doesn't like what we're doing passing non _Atomic pointers to C11 atomics.
Fix the ones that keep vader from compiling using Cray CCE 9 and 10.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-11-21 15:14:45 -06:00
Austen Lauria
edcd6d8aeb
Merge pull request #7146 from bgoglin/master
fix typos hlwoc->hwloc
2019-11-13 15:38:00 -05:00
Nathan Hjelm
09dd383f8b
Merge pull request #7108 from devreal/btl-ugni-deadlock
uGNI: Fix potential deadlock when processing outstanding transfers
2019-11-11 10:56:56 -08:00
George Bosilca
3de636dc6f Swap the 2 fields to maintain the size of the struct.
Thanks @devreal for catching this.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-11-07 11:25:03 -05:00
George Bosilca
59fb02618e Prevent overflow when dealing with datatype count.
This patch fixes #7147 by preventing overflow when multiplying
the count and the blocklen. The count reflects MPI count and is
therefore bound to the size of an int (it is an uint32_t) while the
blocklen can be merged together to represent the largest contiguous
memory layout and it is therefore promoted to a size_t.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-11-07 11:25:03 -05:00
Howard Pritchard
9d345d9aa0 btl/uct: add UCT API version check to configury
related to #7128

The UCX crew is no longer guaranteeing that the UCT API is going to be frozen,
so this is kind of a whack-a-mole problem trying to keep the BTL UCT working
with various changing UCT APIs.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2019-11-06 14:27:58 -07:00
Brice Goglin
5c6bd7ea4e fix typos hlwoc->hwloc
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
2019-11-06 10:42:36 +01:00
Nathan Hjelm
a3026c016a btl/uct: fix compilation for UCX 1.7.0
Ref #7128

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2019-11-05 12:53:26 -08:00
George Bosilca
476562752f
Correctly report TCP connect errors.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-10-31 18:33:15 -04:00
Joseph Schuchart
5471d59af4 Fix OPAL_ALIGN_MIN to work on 32-bit systems
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2019-10-30 10:33:10 +01:00
Joseph Schuchart
c09ca039b4 uGNI: Fix potential deadlock when processing outstanding transfers
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2019-10-26 12:21:17 +02:00
Austen Lauria
aa8be9c12d
Merge pull request #6284 from devreal/ompi-rdma-memalign
Ensure proper alignment of memory provided by MPI
2019-10-25 12:27:58 -04:00
Nathan Hjelm
b1ef5a40fa
Merge pull request #7016 from hjelmn/fix_btl_uct_from_yet_another_unannounced_api_break_in_the_openucx_uct_layer
btl/uct: add support for OpenUCX v1.8 API changes
2019-10-17 06:27:18 -07:00
Jeff Squyres
b6c4d5c118
Merge pull request #7060 from jsquyres/pr/usnic-mca-updates
BTL usnic MCA updates
2019-10-15 10:48:10 -04:00
Stanislav Kirillov
0e0763e006
fix ipv6 btl connection bug
Signed-off-by: Stanislav Kirillov <staskirillof@yandex.ru>
2019-10-10 11:20:37 +00:00
Todd Kordenbrock
f7e74b6a3d btl-portals4: fix a flow control configure bug
This commit fixes a configure bug that caused flow control to be
disabled regardless of the configure options used.

Signed-off-by: Todd Kordenbrock <thkgcode@gmail.com>
2019-10-09 17:12:56 -05:00
Geoff Paulsen
4e1e6f8972
Merge pull request #6993 from awlauria/fix_warnings_master
Fix miscellaneous compiler warnings.
2019-10-09 09:17:02 -05:00
Jeff Squyres
3080033a8c btl/usnic: set retrans_timeout back down to 5ms
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-08 11:17:54 -07:00
Jeff Squyres
132e4cab3b btl/usnic: set ack_iteration_delay default to 4
It was previously accidentally set to 0.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-08 11:17:30 -07:00
Jeff Squyres
fe7f772f21 btl/usnic: properly size freelist items
Move the prefix area from the head to the body in relevant size
computations.  This fixes a problem in high traffic situations where
usNIC may have sent from unregistered memory.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-04 14:40:56 -07:00
Jeff Squyres
27e3040dfe btl/usnic: cap the number of resends per progress iteration
New MCA param: btl_usnic_max_resends_per_iteration.  This is the max
number of resends we'll do in a single pass through usNIC component
progress.  This prevents progress from getting stuck in an endless
loop of retransmissions (i.e., if more retransmissions are triggered
during the sending of retransmissions).  Specifically: we need to
leave the resend loop to allow receives to happen (which may ACK
messages we have sent previously, and therefore cause pending resends
to be moot).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-04 13:05:51 -07:00
Jeff Squyres
3cc95d86b2 btl/usnic: increase default retrans_timeout
Significantly increase the default retrans timeout.  If the
retrans timeout is too soon, we can end up in a retransmission storm
where the logic will continually re-transmit the same frames during a
single run through the usNIC progress function (because the timer for
a single frame expires before we have run through re-transmitting all
the frames pending re-transmission).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-04 13:05:51 -07:00
Jeff Squyres
968b1a51b5 btl/usnic: clarifications and fixes regarding ACKs
New MCA parameter: btl_usnic_ack_iteration_delay.  Set this to the
number of times through the usNIC component progress function before
sending a standalone ACK (vs. piggy-backing the ACK on any other send
going to the target peer).

Use "ticks" language to clarify that we're really counting the number
of times through the usNIC component DATA_CHANNEL completion check (to
check for incoming messages) -- it has no relation to wall clock time
whatsoever.

Also slightly change the channel-checking scheme in usNIC component
progress: only check the PRIORITY channel once (vs. checking it once,
not finding anything, and then falling through the progress_2() where we
check PRIORITY again and then check the DATA channel).

As before, if our "progress" libevent fires, increment the tick
counter enough to guarantee that all endpoints that need an ACK will
get triggered to send standalone ACKs the next time through progress,
if necessary.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-04 13:05:51 -07:00
Jeff Squyres
ce2910a28a btl/usnic: s/get_nsec/get_nticks/g
Rename "get_nsec()" to "get_ticks()" to more accurately reflect that
this function has no correlation to wall clock time at all.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-04 13:05:51 -07:00
Jeff Squyres
f3429d7a44 btl/usnic: pack a wire data struct
Might as well save a few bytes when sending this struct across the
network via the __opal_attribute_packed__ attribute.

That being said, also re-order the elements in this struct so that
there's no holes to begin with.  Do this so that the compiler/runtime
won't effect (slow) unaligned reads/writes because of the
__opal_attribute_packed__ attribute.

The "packed" attribute is really more about defensive programming
(e.g., if we make a mistake and have a hole, "packed" will remove it
for us).

*** Do not bring this commit back to existing/already-released release
branches: it will cause incompatibility, since it effectively changes
the usNIC BTL wire protocol.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2019-10-04 13:05:51 -07:00
Austen Lauria
0d4004cc3c Fix miscellaneous compiler warnings.
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
2019-10-01 16:27:25 -04:00
Joseph Schuchart
c385c927fb Ensure proper alignment of memory provided by MPI
Signed-off-by: Joseph Schuchart <schuchart@hlrs.de>
2019-10-01 11:54:29 +02:00
Gilles Gouaillardet
1c4a3598d0 pmix/pmix4x: refresh to the latest open PMIx master
refresh to openpmix/openpmix@ea3b29b1a4

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2019-10-01 14:27:22 +09:00
Nathan Hjelm
8473a66466 btl/uct: fix bug when using a transport without zero-copy
This commit fixes a crash that can occur if a transport
is usable but doesn't have zero-copy support. In this
case do not attempt to use zero-copy and set the max
send size off the bcopy limit.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2019-09-27 17:26:37 -07:00
Nathan Hjelm
526775dfd7 btl/uct: add support for OpenUCX v1.8 API changes
OpenUCX broke the UCT API again in v1.8. This commit updates
btl/uct to fix compilation with current OpenUCX master
(future v1.8). Further changes will likely be needed for
the final release.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2019-09-27 12:34:48 -07:00
Jeff Squyres
8038fac8f9
Merge pull request #6844 from adrianreber/check_for_user_ns
Do not use CMA in user namespaces
2019-09-20 22:10:42 -04:00
William Zhang
ca70b703a1 opal/util: Add function to get vertex data from bipartite graph
Currently, there is no function that allows the user to retrieve the
data they have stored in a vertex easily. Using the internal macros and
knowledge of the structures, the new function will return a pointer to
the user provided vertex data.

Signed-off-by: William Zhang <wilzhang@amazon.com>
2019-09-11 18:28:07 +00:00
George Bosilca
3522916971
Mark predefined empty datatype contiguous.
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-09-07 14:40:21 +10:00
Nathan Hjelm
ae91b11de2 btl/vader: when using single-copy emulation fragment large rdma
This commit changes how the single-copy emulation in the vader btl
operates. Before this change the BTL set its put and get limits
based on the max send size. After this change the limits are unset
and the put or get operation is fragmented internally.

References #6568

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
2019-09-05 23:08:53 -07:00
Adrian Reber
fc68d8a90f
Do not use CMA in user namespaces
Trying out to run processes via mpirun in Podman containers has shown
that the CMA btl_vader_single_copy_mechanism does not work when user
namespaces are involved.

Creating containers with Podman requires at least user namespaces to be
able to do unprivileged mounts in a container

Even if running the container with user namespace user ID mappings which
result in the same user ID on the inside and outside of all involved
containers, the check in the kernel to allow ptrace (and thus
process_vm_{read,write}v()), fails if the same IDs are not in the same
user namespace.

One workaround is to specify '--mca btl_vader_single_copy_mechanism none'
and this commit adds code to automatically skip CMA if user namespaces
are detected and fall back to MCA_BTL_VADER_EMUL.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-09-05 20:15:19 +02:00
Ralph Castain
373e816b37
Ensure buffer_unload leaves the buffer in a clean state
Silence a warning in orte/nidmap

Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-09-04 08:32:27 -07:00
Ralph Castain
9a2902c047
Force use of pmix/preg/native component
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-09-03 17:54:44 -07:00
Ralph Castain
2ebc1fa1b7
Update to current PMIx 4.0.0 (master)
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-09-03 14:46:55 -07:00
George Bosilca
41e6f55807
Small optimization on the datatype commit.
This patch fixes the merge of contiguous elements into larger but more
compact datatypes, and allows for contiguous elements to have thir
blocklen increasing instead of the count. The idea is to always maximize
the blocklen, aka. the contiguous part of the datatype.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
2019-08-30 19:56:48 -04:00
Howard Pritchard
71e1fad4a9
Merge pull request #6855 from hkuno/hkuno/mmap_loop
Fix mmap infinite recurse in memory patcher
2019-08-26 14:28:53 -06:00
Howard Pritchard
5a3646fd1d
Merge pull request #6884 from guserav/reintroduce-common-ofi
Reintroduce common ofi
2019-08-23 13:12:25 -06:00