1
1
Граф коммитов

325 Коммитов

Автор SHA1 Сообщение Дата
Nathan Hjelm
17b80a987e btl/vader: fix fast box support for 32-bit architectures
On 32-bit architectures loads/stores of fast box headers may take
multiple instructions. This can lead to a data race between the
sender/receiver when reading/writing the sequence number. This can
lead to a situation where the receiver could process incomplete
data. To fix the issue this commit re-orders the fast box header to
put the sequence number and the tag in the same 32-bits to ensure they
are always loaded/stored together.

Fixes #473

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-03-30 16:28:16 -06:00
Jeff Squyres
89e14f5ad6 usnic: fix comment typos 2015-03-27 17:21:53 -07:00
Jeff Squyres
2672d8d26b usnic: update comments explaining fi_av_insert() reaping
Asynchronous fi_av_insert()s are somewhat tricky; update the comments
explaining what is going on (and fix some old/stale comments).
2015-03-27 12:01:59 -07:00
Jeff Squyres
6da77ee940 usnic: only warn about each unreachable endpoint once
fi_av_insert() is invoked with a context containing each endpoint
USNIC_NUM_CHANNELS times.  If the address on that endpoint fails to
resolve / is unreachable / has some error, we'll therefore get
USNIC_NUM_CHANNELS error completions with that same endpoint.  We
therefore only want to warn about the unreachability of (and
OBJ_RELEASE) that endpoint the *first* time.

Fixes CSCut46822.
2015-03-27 12:01:59 -07:00
Jeff Squyres
506431d1b6 usnic: count fi_av_insert() completions properly
num_left represents the number of times we called fi_av_insert(), and
therefore it indicates the number of non-error completions that we
will receive.
2015-03-27 12:01:59 -07:00
Mike Dubman
00784ae3ba btl/openib: fix compiler warning, by HalR 2015-03-13 13:17:23 +02:00
Todd Kordenbrock
9350b06f7d btl-portals4: fix compiler warnings 2015-03-12 20:34:04 -05:00
Todd Kordenbrock
d1656347c8 btl-portals4: implement the BTL 3.0 interface 2015-03-12 14:19:44 -05:00
adrianreber
714d9aa67e Merge pull request #348 from adrianreber/topic/orte_cr_continue_like_restart
Topic/orte cr continue like restart
2015-03-12 14:54:02 +01:00
Nathan Hjelm
ce6caab2a7 Merge pull request #463 from hjelmn/cuda_async
btl/openib: cuda: fix CUDA-aware support with async copy
2015-03-11 09:52:48 -06:00
Jeff Squyres
c61dd4d56f usnic: each err eq entry reports *1* completion
Actually, the return from fi_eq_readerr() only indicates a *single*
error completion (not err_entry.data completions).
2015-03-11 08:07:20 -07:00
Nathan Hjelm
395635f017 Merge pull request #461 from hjelmn/btl_openib_cleanup
btl/openib: remove derived btl segment type
2015-03-11 08:20:41 -06:00
Jeff Squyres
9c926e5e82 usnic: add more commments/explanation about error cases
If we really get a catastrophic error from a libfabric call, don't
bother trying to continue (because data has been corrupted and there's
nothing sane left to do).  Just call opal_btl_usnic_exit() (which
tries to call the PML error callback, but we're so early in the
module_init process that this likely hasn't been setup yet, so the job
will likely abort).
2015-03-11 07:16:28 -07:00
Jeff Squyres
51583789fb usnic: re-indent some show_help code
Nothing too substantial here, but two of the messages moved from
"libfabric API failed" to "internal error during init", just to be a
bit more descriptive.
2015-03-11 07:15:28 -07:00
Jeff Squyres
1b836d784c usnic: subtract number of errored insertions from loop count
When we get errors, the entry.data field tells us how many errors are
being reported.  So decrement the loop count variable by that much.

This fixes CSCut30441.
2015-03-11 07:13:10 -07:00
Adrian Reber
f45dd069bd FT: fix compilation using --with-ft (1/5)
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.

This first patch moves orte_cr_continue_like_restart from ORTE
to opal_cr_continue_like_restart in OPAL. This only leaves three
calls from OPAL to ORTE in the FT code. As it is not yet 100%
clear how to handle these calls the code orte_sstore.set_attr()
has been #ifdef'd out for now.
2015-03-11 14:23:33 +01:00
Nathan Hjelm
b308afa8fd btl/openib: remove derived btl segment type
The derived segment type (btl_openib_segment_t) was intended to store
the registration info needed for put and get. In BTL 3.0 this is no
longer required. I intended to remove this type as part of
open-mpi/ompi@74f1af4548 .

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-03-10 14:41:15 -06:00
Nathan Hjelm
3d32dbd793 btl/openib: cuda: fix CUDA-aware support with async copy
This commit should resolve an issue seen with CUDA-aware support. The
problem came in with BTL 3.0. Before 3.0 the size of the copy was
stored in the incoming segment's des_remote_count field. This field
does not exist in BTL 3.0 so I stored the value in the
des_segment_count field. This caused problems with the cuda support
code. To fix the issue the endpoint pointer is now stored in the in
fragment's endpoint pointer which free's up the segment's des_cbdata
pointer for storing the transfer size.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-03-10 14:38:12 -06:00
Jeff Squyres
d97551bdb1 usnic: endpoint type hint moved to a sub-struct
Update to match new libfabric API/structure change.
2015-03-10 09:47:41 -07:00
Jeff Squyres
afec1454f5 usnic: only setup the connectivity checker if we have modules
If we ended up with no modules (e.g., all usnic devices were
excluded), there was a race condition in that the connectivity agent
could tear down its local socket before one or more of the local
clients saw it.  Therefore, the local clients would timeout waiting
for the socket to appear.

So move the connectivity checker init later in the bootstrapping
process (it *must* be setup before module_init()), and have it only
invoked if we actually ended up with one or more modules.
2015-03-10 07:43:20 -07:00
Jeff Squyres
06accb721c usnic: ensure to free all resources if no usnic BTLs found
If all usnic devices are excluded, then we need to ensure the error
path includes freeing the filter.

This was Coverity CID 1288085
2015-03-10 07:43:20 -07:00
Jeff Squyres
4b2cba46f4 usnic: fix bootstrap error paths
Fix previously-unfinished error paths during startup/bootstrapping.
Instead of just blindly continuing on when an fi_* function call
fails, opal_show_help and skip that device.

Also, only check the usnic config minimums once.  They're VIC-wide and
won't change on a per-device basis -- we only need to check them once.

Fixes CSCut19179.
2015-03-09 16:57:41 -07:00
Gilles Gouaillardet
521317341d btl/openib: fix a double free
as reported by Coverity with CID 1287033
2015-03-06 14:58:11 +09:00
Gilles Gouaillardet
e1cc931e1b btl/tcp: silence CID 710616 2015-03-05 14:20:08 +09:00
Gilles Gouaillardet
134c866aa9 btl/openib: fix misc memory leaks
as reported by Coverity with CIDs 1269848, 1269852 and 1269862
2015-03-05 14:06:18 +09:00
Gilles Gouaillardet
d1b2f043ff fix misc memory leaks
as already reported by Coverity with CIDs
71818, 71819, 72250, 715767, 1196749 and 1274002
2015-03-05 13:58:05 +09:00
Mike Dubman
171d674ca4 Merge pull request #441 from open-mpi/revert-438-topic/use_opal_common_verbs_want_fork_support
Revert "create the opal_common_verbs_want_fork_support parameter."
2015-03-04 10:13:00 +02:00
Gilles Gouaillardet
f43b5b46ee btl/openib: fix heterogeneous support
Thanks @bosilca for the pointer
2015-03-04 13:53:05 +09:00
Gilles Gouaillardet
81b0444ef2 btl/openib: fix comment syntax, no code change
and silence gcc warning about nested comments
2015-03-04 11:24:14 +09:00
Rolf vandeVaart
edf58eb549 Implement CUDA-aware workaround while fork support worked out 2015-03-03 09:50:01 -05:00
Mike Dubman
98503b56e0 Revert "create the opal_common_verbs_want_fork_support parameter." 2015-03-03 14:28:31 +02:00
Alina Sklarevich
8fe42f1bc1 create the opal_common_verbs_want_fork_support parameter.
call the opal_common_verbs_mca_register function to make sure that
opal_common_verbs_want_fork_support mca parameter is created and therefore
can be used to control the fork support.
2015-03-01 17:40:49 +02:00
Rolf vandeVaart
e48bc77342 Fix the coverity fix 2015-02-27 12:49:44 -05:00
Rolf vandeVaart
cfe91d4d0f Fix compile error for CUDA-aware by using new fork MCA parameter. 2015-02-27 10:00:46 -05:00
Gilles Gouaillardet
f33cd58ee9 btl/tcp: fix misc memory leaks
as reported by Coverity with CIDs 710615, 710616 and 710618
2015-02-27 19:16:22 +09:00
Gilles Gouaillardet
60404d1953 btl/base: fix misc memory leaks
as reported by Coverity as CIDs 71818 71819
2015-02-27 19:06:06 +09:00
George Bosilca
778ba0317e Revert "Minor cleanups."
This reverts commit 3b4da0bda4.
2015-02-26 17:53:58 -05:00
George Bosilca
5c3ce3a737 Merge branch 'master' of github.com:open-mpi/ompi 2015-02-26 17:10:18 -05:00
Nathan Hjelm
855d422e62 Merge pull request #408 from hjelmn/btl_3_0_mod
btl: expose local registration thresholds
2015-02-26 12:57:43 -07:00
Mike Dubman
dbc15009b6 Merge pull request #415 from alinask/topic/fix_fork_support_flow
Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support.
2015-02-26 21:50:11 +02:00
Rolf vandeVaart
bbdcf9ff33 Fix missing cast change from opal_free_list_changes. Fixes warning 2015-02-26 11:24:49 -05:00
Gilles Gouaillardet
b888768ca3 btl/scif: fix a typo
this is likely a typo introduced by open-mpi/ompi@5f1254d710
@hjelmn could you please double check this ?
2015-02-26 13:45:51 +09:00
Nathan Hjelm
8a17e69067 btl/ugni: fix typos introduced by free list update 2015-02-25 12:43:05 -07:00
George Bosilca
f3b58006c8 Merge branch 'master' of github.com:open-mpi/ompi 2015-02-25 12:01:35 -05:00
Jeff Squyres
f3c9354d4b usnic: restore compatibility with the v1.8 branch
Also include two other minor changes:

1. More C99-style member initialization in the component struct
1. Fix the BTL module member initialization to not be redundant
2015-02-25 05:37:51 -08:00
Alina Sklarevich
e4c4e7df5e Fix the calls to ibv_fork_init and remove btl_openib_want_fork_support.
In order to have an effect, ibv_fork_init should be called in the
beginning of the verbs initialization flow - before the calls to the
ibv_create_qp and ibv_create_cq verbs.
These functions are called from the oob/ud code and by the time the
other verbs components (btl openib, pml yalla, ...) call ibv_fork_init,
it's too late. This commit forces the call to ibv_fork_init (if it's
requested) right at the beginning of all the components that are using
verbs.
(ibv_fork_init() can be safely called multiple times)

This commit also removes the btl_openib_want_fork_support mca parameter
and adds a new mca parameter instead - opal_verbs_want_fork_support.
Through this new parameter, fork support may be requested for ALL
components.
The default value for this parameter is set to 1.

Before this commit the btl_openib_want_fork_support parameter didn't
provide fork support for the openib btl if its value was set to 1.
(because when openib called ibv_fork_init, it was already after the
calls to ibv_create_* in oob/ud and thereofre it failed).
2015-02-25 10:58:50 +02:00
Jeff Squyres
a85a392896 Merge pull request #422 from jsquyres/topic/coverity-fixes
Some Coverity fixes
2015-02-24 17:00:10 -05:00
Jeff Squyres
3cd36ab12a openib: fix double free
This was CID 1269989
2015-02-24 15:24:10 -05:00
Jeff Squyres
5894f0c1f2 usnic: update to new mpool API
NOTE: Have not added cross-compatibility with v1.8 branch yet
2015-02-24 10:05:45 -07:00
Nathan Hjelm
5f1254d710 Update code base to use the new opal_free_list_t
Use of the old ompi_free_list_t and ompi_free_list_item_t is
deprecated. These classes will be removed in a future commit.

This commit updates the entire code base to use opal_free_list_t and
opal_free_list_item_t.

Notes:

OMPI_FREE_LIST_*_MT -> opal_free_list_* (uses opal_using_threads ())

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-02-24 10:05:45 -07:00