1
1

358 Коммитов

Автор SHA1 Сообщение Дата
Jeff Squyres
9c19dd4e5b usnic: make v1.10 and v2.0 fit in version checking scheme
v1.10 is now in the same compatibility level as v1.7/v1.8 (there
is/will be no v1.9 series).  v2.0 now takes over for what used to be
called v1.9.
2015-05-26 18:42:33 -07:00
Jeff Squyres
95a2b14543 usnic: avoid some compiler warnings
Followup to open-mpi/ompi@65b66ab: if we're not debugging, then #if
out an entire block so that the compiler doesn't warn about variables
that are assigned and not used.
2015-05-26 14:27:26 -07:00
Jeff Squyres
68c33355c6 Merge pull request #595 from goodell/pr/usnic_getname
usnic: use fi_getname in newer libfabric
2015-05-26 17:20:13 -04:00
Ralph Castain
ce915b5757 Fix a typo that incorrectly set the alignment threshold in the openib BTL.
Thanks to Xavier Besseron for pointing it out
2015-05-25 07:12:57 -07:00
Gilles Gouaillardet
60e4d6c795 btl: add conversion macros for mca_btl_base_segment_t for heterogeneous support 2015-05-22 15:52:32 +09:00
Dave Goodell
65b66ab4ae usnic: use fi_getname in newer libfabric
When using an external libfabric (or really any libfabric newer than
libfabric commit 607e863), we must use fi_getname to determine the local
port of our endpoint.  Without this fix, OMPI will hang endlessly
while retransmitting packets to port 0 on the remote host.
2015-05-21 08:51:03 -07:00
Nathan Hjelm
108f55a963 btl/vader: clean up progress of waiting endpoints
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-20 16:14:58 -06:00
Nathan Hjelm
69e70776aa btl/vader: fix double unlock
References #594

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-05-20 14:35:22 -06:00
Howard Pritchard
d9f080b0c7 btl/ugni: silence common symbol squawk
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
2015-05-16 10:23:06 -05:00
George Bosilca
675dccf9d9 Print the port in host byte order. 2015-05-15 00:14:28 -04:00
Todd Kordenbrock
9df163f116 portals4: use a single Memory Descriptor to cover all of memory
In days past, some implementations of Portals4 could not cover all
of memory with a single Memory Descriptor so multiple large
overlapping Memory Descriptors were created.  Because none of the
current implementations have this limitation (and no future
implementations should either), this commit removes the overlapping
Memory Descriptors code.
2015-05-11 11:49:41 -05:00
Gilles Gouaillardet
c809aace47 initialize common symbols from opal
A few uninitialized common symbols are remaining:

common symbols generated by flex :
 * opal/util/keyval/keyval_lex.l: opal_util_keyval_yyleng
 * opal/util/keyval/keyval_lex.o: opal_util_keyval_yytext
 * opal/util/show_help_lex.l: opal_show_help_yyleng
 * opal/util/show_help_lex.l: opal_show_help_yytext

common symbol generated by "external" hwloc library:
 * opal/mca/hwloc/hwloc191/hwloc/src/components.o: component_map
2015-05-08 09:48:51 +09:00
Jeff Squyres
676673189b Merge pull request #565 from jsquyres/pr/fake-usnic-ibv-driver
Squelch libibverbs complaints about lack of usnic userspace plugin
2015-05-05 10:27:33 -04:00
Jeff Squyres
f79e137247 Merge pull request #555 from jsquyres/pr/openib-delay-cpc-init
btl openib: only initialize CPCs if there are devices to use
2015-05-05 10:26:55 -04:00
George Bosilca
459e15479f Remove double ; 2015-04-30 14:43:19 -04:00
Jeff Squyres
76222f462e btl openib: if ibv_open_device() returns NULL, it's not supported
When a libibverbs driver returns NULL for its context, it's the Open
MPI libibverbs fake driver.  Hence, this device is simply not
supported -- ignore it.
2015-04-29 18:07:12 -07:00
Jeff Squyres
202f64868c btl openib: only initialize CPCs if there are devices to use
Defer initializing the CPCs until we know that we have devices/ports
to use.  This both prevents some useless work at startup when there
are no devices/ports to use, and also prevents librdmacm complaining
that there are no verbs-capable RDMA devices available (e.g., if a
Cisco usNIC device is present, but does not present a verbs RDMA
interface).
2015-04-29 17:54:11 -07:00
Jeff Squyres
df6f7597a4 btl openib: only initialize CPCs if there are devices to use
Defer initializing the CPCs until we know that we have devices/ports
to use.  This both prevents some useless work at startup when there
are no devices/ports to use, and also prevents librdmacm complaining
that there are no verbs-capable RDMA devices available (e.g., if a
Cisco usNIC device is present, but does not present a verbs RDMA
interface).
2015-04-29 17:52:41 -07:00
Jeff Squyres
a50ad505e7 There were corner cases that allowed max_reg to be uninitialized. Set
a default value so that those corner cases would still have an
initialized value in max_reg.
2015-04-28 14:34:17 -07:00
Rolf vandeVaart
2b99b44a16 Fix error only seen when running with CUDA 5.5 or less. Introduced with BTL 3.0 2015-04-27 17:09:53 -04:00
Jeff Squyres
df5043bc3f usnic: adjust for libfabric API return value change 2015-04-27 10:18:49 -07:00
Nathan Hjelm
b06b8584f7 btl/openib/udcm: OBJ_CONSTUCT UDCM module member before attempting to initialize
This commit fixes an assert when trying to cleanup a module we failed
to initialize. There is no protection around the OBJ_DESTRUCT calls so
they will always be called so similarly we should always call
OBJ_CONSTRUCT at init.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-22 13:25:12 -06:00
Ryan Grant
9ccb932880 Merge pull request #546 from tkordenbrock/topic/btl.logical.mode.init.patch
btl-portals4: delay interface initialization until add_procs()
2015-04-22 11:22:46 -06:00
Todd Kordenbrock
5a0581a761 btl-portals4: delay interface initialization until add_procs()
The Portals4 spec requires that PtlSetMap() be called before any
Portals4 call other than PtlNIInit(), PtlGetMap() and
PtlGetPhysId().  To satisfy this requirement, this commit delays
interface initialization until the end add_procs() at which time
PtlSetMap() has been called.
2015-04-21 11:29:07 -05:00
Todd Kordenbrock
b1cef6c3ea btl-portals4: remove unused code path
The Portals4 BTL is registered with the PML as an RDMA BTL, so
prepare_src() is only used in limited cases.  This commit removes
the code path from prepare_src() for unbuffered contiguous buffers
with no PML reserve.  This is now handled in register_mem().
2015-04-21 11:27:35 -05:00
Todd Kordenbrock
bca2522d8b btl-portals4: fix send failure for messages larger than eager_size
In the large message case, the sender issues a PtlMEAppend() in
order to generate events when the receiver issues a PtlGet().  This
commit moves the PtlMEAppend() from mca_btl_portals4_prepare_src()
to mca_btl_portals4_register_mem() which is the way it's done in
BTL 3.0.
2015-04-21 11:27:28 -05:00
Nathan Hjelm
33181b2543 opal: use C99 subobject naming for component initialization
This commit helps future-proof opal components by initializing each
component member by name.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-18 10:29:58 -06:00
Nathan Hjelm
3436f2917d Merge pull request #449 from hjelmn/mca_base_update
mca/base update
2015-04-16 08:41:48 -06:00
Nathan Hjelm
f8158a5ec1 btl/vader: fix deadlock in mca_btl_vader_progress_endpoints
This commit fixes a typo in mca_btl_vader_progress_endpoints where
OPAL_THREAD_LOCK was used when OPAL_THREAD_UNLOCK was intended.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-14 09:34:45 -06:00
Nathan Hjelm
a7b0c00ab6 fix memory leaks and valgrind errors
This commit fixes several vagrind errors. Included:

 - installdirs did not correctly reinitialize all pointers to NULL
   at close. This causes valgrind errors on a subsequent call to
   opal_init_tool.

 - several opal strings were leaked by opal_deregister_params which
   was setting them to NULL instead of letting them be freed by the
   MCA variable system.

 - move opal_net_init to AFTER the variable system is initialized and
   opal's MCA variables have been registered. opal_net_init uses a
   variable registered by opal_register_params!

 - do not leak ompi_mpi_main_thread when it is allocated by
   MPI_T_init_thread.

 - do not overwrite ompi_mpi_main_thread if it is already set (by
   MPI_T_init_thread).

 - mca_base_var: read_files was overwritting mca_base_var_file_list
   even if it was non-NULL.

 - mca_base_var: set all file global variables to initial states on
   finalize.

 - btl/vader: decrement enumerator reference count to ensure that it
   is freed.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-04-11 09:28:35 -06:00
Jeff Squyres
d825ec7cc7 usnic: fix the CQ-reading logic for -FI_EAGAIN 2015-04-02 15:56:50 -07:00
Jeff Squyres
26b3c48ccb usnic: update to API change in libfabric 2015-04-01 06:43:08 -07:00
Nathan Hjelm
17b80a987e btl/vader: fix fast box support for 32-bit architectures
On 32-bit architectures loads/stores of fast box headers may take
multiple instructions. This can lead to a data race between the
sender/receiver when reading/writing the sequence number. This can
lead to a situation where the receiver could process incomplete
data. To fix the issue this commit re-orders the fast box header to
put the sequence number and the tag in the same 32-bits to ensure they
are always loaded/stored together.

Fixes #473

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-03-30 16:28:16 -06:00
Jeff Squyres
89e14f5ad6 usnic: fix comment typos 2015-03-27 17:21:53 -07:00
Jeff Squyres
2672d8d26b usnic: update comments explaining fi_av_insert() reaping
Asynchronous fi_av_insert()s are somewhat tricky; update the comments
explaining what is going on (and fix some old/stale comments).
2015-03-27 12:01:59 -07:00
Jeff Squyres
6da77ee940 usnic: only warn about each unreachable endpoint once
fi_av_insert() is invoked with a context containing each endpoint
USNIC_NUM_CHANNELS times.  If the address on that endpoint fails to
resolve / is unreachable / has some error, we'll therefore get
USNIC_NUM_CHANNELS error completions with that same endpoint.  We
therefore only want to warn about the unreachability of (and
OBJ_RELEASE) that endpoint the *first* time.

Fixes CSCut46822.
2015-03-27 12:01:59 -07:00
Jeff Squyres
506431d1b6 usnic: count fi_av_insert() completions properly
num_left represents the number of times we called fi_av_insert(), and
therefore it indicates the number of non-error completions that we
will receive.
2015-03-27 12:01:59 -07:00
Nathan Hjelm
b68d66bb9b MCA: Add the project/project version to the MCA base component
This commit adds support for project_framework_component_* parameter
matching. This is the first step in allowing the same framework name
in multiple projects. This change also bumps the MCA component version
to 2.1.0.

All master frameworks have been updated to use the new component
versioning macro. An mca.h has been added to each project to add a
project specific versioning macro of the form
PROJECT_MCA_VERSION_2_1_0.

Signed-off-by: Nathan Hjelm <hjelmn@me.com>
2015-03-27 10:59:04 -06:00
Mike Dubman
00784ae3ba btl/openib: fix compiler warning, by HalR 2015-03-13 13:17:23 +02:00
Todd Kordenbrock
9350b06f7d btl-portals4: fix compiler warnings 2015-03-12 20:34:04 -05:00
Todd Kordenbrock
d1656347c8 btl-portals4: implement the BTL 3.0 interface 2015-03-12 14:19:44 -05:00
adrianreber
714d9aa67e Merge pull request #348 from adrianreber/topic/orte_cr_continue_like_restart
Topic/orte cr continue like restart
2015-03-12 14:54:02 +01:00
Nathan Hjelm
ce6caab2a7 Merge pull request #463 from hjelmn/cuda_async
btl/openib: cuda: fix CUDA-aware support with async copy
2015-03-11 09:52:48 -06:00
Jeff Squyres
c61dd4d56f usnic: each err eq entry reports *1* completion
Actually, the return from fi_eq_readerr() only indicates a *single*
error completion (not err_entry.data completions).
2015-03-11 08:07:20 -07:00
Nathan Hjelm
395635f017 Merge pull request #461 from hjelmn/btl_openib_cleanup
btl/openib: remove derived btl segment type
2015-03-11 08:20:41 -06:00
Jeff Squyres
9c926e5e82 usnic: add more commments/explanation about error cases
If we really get a catastrophic error from a libfabric call, don't
bother trying to continue (because data has been corrupted and there's
nothing sane left to do).  Just call opal_btl_usnic_exit() (which
tries to call the PML error callback, but we're so early in the
module_init process that this likely hasn't been setup yet, so the job
will likely abort).
2015-03-11 07:16:28 -07:00
Jeff Squyres
51583789fb usnic: re-indent some show_help code
Nothing too substantial here, but two of the messages moved from
"libfabric API failed" to "internal error during init", just to be a
bit more descriptive.
2015-03-11 07:15:28 -07:00
Jeff Squyres
1b836d784c usnic: subtract number of errored insertions from loop count
When we get errors, the entry.data field tells us how many errors are
being reported.  So decrement the loop count variable by that much.

This fixes CSCut30441.
2015-03-11 07:13:10 -07:00
Adrian Reber
f45dd069bd FT: fix compilation using --with-ft (1/5)
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.

This first patch moves orte_cr_continue_like_restart from ORTE
to opal_cr_continue_like_restart in OPAL. This only leaves three
calls from OPAL to ORTE in the FT code. As it is not yet 100%
clear how to handle these calls the code orte_sstore.set_attr()
has been #ifdef'd out for now.
2015-03-11 14:23:33 +01:00
Nathan Hjelm
b308afa8fd btl/openib: remove derived btl segment type
The derived segment type (btl_openib_segment_t) was intended to store
the registration info needed for put and get. In BTL 3.0 this is no
longer required. I intended to remove this type as part of
open-mpi/ompi@74f1af4548 .

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2015-03-10 14:41:15 -06:00