1
1

715 Коммитов

Автор SHA1 Сообщение Дата
Igor Ivanov
8b05f308f9 opal/memory: Move Memory Allocation Hooks usage from openib
These changes fix issue https://github.com/open-mpi/ompi/issues/1336

- improve abstractions: opal/memory/linux component should be single place that opeartes with
Memory Allocation Hooks.
- avoid collisions in case dynamic component open/close: it is safe because it is linked statically.
- does not change original behaivour.
2016-02-11 14:46:35 +02:00
Jeff Squyres
8d0a592563 usnic: update a few verbose reachability messages 2016-02-06 03:28:48 -08:00
Jeff Squyres
87dbe6ce01 usnic: add high-verbose reachability messages 2016-02-06 03:28:47 -08:00
Jeff Squyres
dac2fe1589 usnic: ensure to use ntohl() for network-order values 2016-02-06 03:28:47 -08:00
Jeff Squyres
51240394a7 usnic: ensure to init module->av_eq_num 2016-02-06 03:28:47 -08:00
Jeff Squyres
89eea51075 usnic: fix calculation for number of blocks 2016-02-02 16:56:34 -08:00
Nathan Hjelm
cd11fc3081 btl/ugni: fix race condition that causes completions to be dropped
The send code in the ugni btl has an optimization that enables it to
return 1 (fragment gone) in some cases. This optimization involved
removing the btl ownership and callback flags to ensure the fragment
stuck around long enough for its completion flag to be checked. This
works fine for the single-threaded case but not in the multi-threaded
case. It is possible that a fragment will be completed by another
thread while a thread is in mca_btl_ugni_send. This competition can
lead to a leaked fragment, missed callback, or both. To fix the issue
without removing the optimization a reference count has been added to
the fragment. Callbacks and fragment release will not be made until
the fragment reference count has reach 0. The count is incremented
before sending the frag and decremented after the completion flag has
been checked. The fix has been verified to work using a multi-threaded
RMA benchmark with the osc/pt2pt component.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-02 12:14:31 -07:00
Nathan Hjelm
14704201e2 btl/ugni: fix race condition when adding endpoint to wait list
This commit fixes a race condition that can cause an endpoint to be
added to the wait list multiple times. To fix the issue an additional
check has been added to ensure the endpoint is not on the wait list
after the wait list lock is held. The wait list processing code has
also been updated to keep the wait list lock until all wait listed
endpoints have been handled. This reduces the chance that an endpoint
that is being processed by the wait list code is not re-added to the
list by a competing send.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2016-02-02 12:13:49 -07:00
Jeff Squyres
9f3ed00125 usnic: minor updates from code review
Three minor updates from the code review of
https://github.com/open-mpi/ompi-release/pull/933:

* Remove an extra blank line a show_help message
* We no longer allow -1 for the MCA param btl_usnic_av_eq_num, so
  change the flag to REGINT_GE_ONE
* Change "num_blocks" definition to be in terms of block_len (not
  eq_size)
2016-02-01 11:14:30 -08:00
Jeff Squyres
c2615a4732 usnic: change retrans timeout to 5ms
A bunch of empirical testing has shown that increasing the retranmit
timeout from 1ms to 5ms doesn't adversely affect performance, yet
decreases the number of gratuitious retransmissions.
2016-01-30 10:49:14 -08:00
Jeff Squyres
797d5026c8 usnic: better av_eq_num default value handling 2016-01-30 10:46:14 -08:00
Jeff Squyres
db825abc00 usnic: don't overrun the fi_av_insert() EQ
Add endpoints in a blocked manner so that we don't overrun the
fi_av_insert() event queue.  Also make the AV EQ length an MCA param,
and report it in mca_btl_base_verbose >=5 output.
2016-01-30 08:33:48 -08:00
Jeff Squyres
d624e0d60f usnic: fix wraparound sequence number issue
Sequence numbers will wrap around; it is not sufficient to check for
(seq-1) -- must use the SEQ_DIFF macro to properly handle the
wraparound.

This bug wasn't serious; it just meant we might retransmit one or two
extra times when retransmits were triggerd and the sequence numbers
wrapped around their sliding windows.
2016-01-30 08:32:13 -08:00
Jeff Squyres
4de4a263f5 usnic: ensure all messages are sent on the data channel
Messages should go on the data channel, even if they're short.  Only
ACKs go on the priority channel.
2016-01-30 08:31:21 -08:00
Jeff Squyres
348ac507c2 usnic: explain why we still have OPAL_HAVE_HWLOC
Put in a comment explaining why btl_usnic_compat.h still defines
OPAL_HAVE_HWLOC, even though master/v2.x no longer does.
2016-01-16 04:11:05 -08:00
Jeff Squyres
0f5fcf9029 usnic: fix common symbol 2016-01-16 03:55:27 -08:00
Jeff Squyres
60ffe713b8 common syms: whitelist bison-generated common symbols
Bison generates some common symbols that we can't do anything about,
so whitelist them.
2016-01-16 03:53:14 -08:00
Artem Polyakov
84e4fb308b Fix race condition in UDCM where service thread sees that
`cm_message_event_active == 1` but main thread has already stopped
processing messages and thus we will have the situation where one
message was left unhandled leading to a hang.
2016-01-08 23:56:21 +06:00
Jeff Squyres
6d073a8da4 btl_sm: add a comment explaining why we rename(2)
Per open-mpi/ompi#1230, add a comment explaining why we write to a
temporary file and then rename(2) the file, just so that future code
maintainers don't wonder why we do this seemingly-useless step.
2016-01-04 14:51:52 -05:00
Artem Polyakov
2abb2972ac Fix Mellanox copyrights with respect to the following PRs:
* https://github.com/open-mpi/ompi/pull/1184
* https://github.com/open-mpi/ompi/pull/1188
* https://github.com/open-mpi/ompi/pull/1197
* https://github.com/open-mpi/ompi/pull/1202
* https://github.com/open-mpi/ompi/pull/1210
* https://github.com/open-mpi/ompi/pull/1216
* https://github.com/open-mpi/ompi/pull/1236
* https://github.com/open-mpi/ompi/pull/1237
* https://github.com/open-mpi/ompi/pull/1248
* https://github.com/open-mpi/ompi/pull/1260
* https://github.com/open-mpi/ompi/pull/1264
2015-12-30 00:12:19 +06:00
Gilles Gouaillardet
fec973efda configury: test portability
replace test ... -o ... with test ... || test ...
and test ... -a ... with test ... && test ...
2015-12-28 13:58:45 +09:00
Nathan Hjelm
700a21022a Merge pull request #1260 from artpol84/openib_proc_account_fix
Openib proc accounting fix
2015-12-27 15:19:52 -07:00
Artem Polyakov
a20826e6b4 Fix vader resource leak.
This nasty bug was nicely masked. It was causing `mca_btl_vader_component.vader_frags_user`
overflow and as the result rear hangs of ompi-test-suite.
2015-12-28 00:41:45 +06:00
Gilles Gouaillardet
2d9aa38e6a btl/openib: fix heterogeneous support 2015-12-25 16:31:35 +09:00
Artem Polyakov
3031affdb7 Fix openib process accounting if procs was dynamically added. 2015-12-24 17:56:35 +06:00
Artem Polyakov
400af6c52d openib addproc improvements:
1. finer grained locks;
2. separate srq creation from cq adjustments.
2015-12-24 17:56:35 +06:00
Artem Polyakov
41c325f15a Shift common code for calculating a port count and btl_rank in openib
into the static function
2015-12-24 17:56:35 +06:00
Gilles Gouaillardet
5fa63f086a btl/tcp: add missing #include <unistd.h>
Thanks Marco Atzeri for contributing the original patch
2015-12-24 14:41:46 +09:00
Gilles Gouaillardet
15ed7ad9f5 btl/sm: add missing #include <unistd.h>
Thanks Marco Atzeri for contributing the original patch
2015-12-24 14:41:41 +09:00
Gilles Gouaillardet
42313acd58 btl/usnic: add missing #include <alloca.h> 2015-12-24 14:33:58 +09:00
Nathan Hjelm
84d890b7e7 Merge pull request #1248 from artpol84/openib_proc_init_race
Openib dynamic add proc race conditions
2015-12-22 21:48:05 -07:00
Artem Polyakov
08ad8357a8 Fix local process accounting in openib when dynamic add_proc is on. 2015-12-22 22:44:46 +06:00
Artem Polyakov
3c2f6d5560 Protect openib_btl->device data with explicit opal_mitex locks. 2015-12-22 18:33:26 +06:00
Gilles Gouaillardet
607d7c7545 btl/sm: rename file after file descriptor has been closed.
Thanks George for spotting this.
2015-12-22 13:56:53 +09:00
Artem Polyakov
e06bffe213 Fix ib_proc locking 2015-12-21 18:52:31 +06:00
Artem Polyakov
3eb4756a17 Force locking regardles to the opal_using_threads() setting. 2015-12-21 18:52:31 +06:00
Artem Polyakov
11b72d9add Make important fields of ib_proc volatile. 2015-12-21 18:52:31 +06:00
Artem Polyakov
86c0c3ec52 Provide additional information: whether ib_proc was newly created or
it was already existing.
2015-12-21 18:52:31 +06:00
Artem Polyakov
9325bd3d69 Protect device initialization 2015-12-21 18:52:31 +06:00
Artem Polyakov
0f77bc7ea7 Perform endpoint initialization atomically. 2015-12-21 18:52:31 +06:00
Artem Polyakov
afaf9c9ea6 Shift ib_proc initialization to the separate function. 2015-12-21 18:52:31 +06:00
Artem Polyakov
3c9fd567b6 Fix openib race condition when direct modex is used.
The problem was in mca_btl_openib_proc_create. This function may be called
from several places simultaneously:
* from the main thread when somebody wants to do `MPI_Send()` (for example) for
the first time;
* from udcm if the counterpart peer is trying to connect and `mca_btl_openib_get_ep()`
is called.

In this case one of the threads may add an uninitialized proc structure
to the `mca_btl_openib_component.ib_procs` and the other will read it and
treat as initialized.

This commit turns ib_proc initialization into a single atomic operation.
2015-12-21 18:52:30 +06:00
Gilles Gouaillardet
db4f483653 btl/sm: fix race condition
write to file and then rename, so when the file is open for read, its content is known to have been written.

Fixes open-mpi/ompi#1230
2015-12-21 16:37:51 +09:00
Nathan Hjelm
e77199fd4f Merge pull request #1235 from ggouaillardet/topic/ibv_exp_fixes
btl/openib: do not mix exp and non exp verbs
2015-12-17 08:36:09 -07:00
Gilles Gouaillardet
994a627f82 btl/openib: do not mix exp and non exp verbs 2015-12-17 16:45:43 +09:00
Artem Polyakov
0951a34e95 Fix openib memory registration limit calculation if cutoff = 0. 2015-12-17 13:45:19 +06:00
Jeff Squyres
2b9341a38a usnic: fix embarrissing typo 2015-12-15 19:01:19 -08:00
Jeff Squyres
944d5061a6 usnic: sendto() can return EPERM if we send too fast
If we send too fast, sendto() can run out of resources and return
EPERM.  So delay a little and try again.
2015-12-15 15:31:29 -08:00
Jeff Squyres
ab1bbca5b9 usnic: improve error message
When sendto() fails, it would be helpful to see the errno value.
2015-12-15 15:04:25 -08:00
Jeff Squyres
c1a6beac8d usnic: fix error message
There were too many "%s" instances.  Re-order the output so that we
show file, line, and then the error message.
2015-12-15 14:48:38 -08:00