Nothing too substantial here, but two of the messages moved from
"libfabric API failed" to "internal error during init", just to be a
bit more descriptive.
When we get errors, the entry.data field tells us how many errors are
being reported. So decrement the loop count variable by that much.
This fixes CSCut30441.
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.
This first patch moves orte_cr_continue_like_restart from ORTE
to opal_cr_continue_like_restart in OPAL. This only leaves three
calls from OPAL to ORTE in the FT code. As it is not yet 100%
clear how to handle these calls the code orte_sstore.set_attr()
has been #ifdef'd out for now.
The derived segment type (btl_openib_segment_t) was intended to store
the registration info needed for put and get. In BTL 3.0 this is no
longer required. I intended to remove this type as part of
open-mpi/ompi@74f1af4548 .
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit should resolve an issue seen with CUDA-aware support. The
problem came in with BTL 3.0. Before 3.0 the size of the copy was
stored in the incoming segment's des_remote_count field. This field
does not exist in BTL 3.0 so I stored the value in the
des_segment_count field. This caused problems with the cuda support
code. To fix the issue the endpoint pointer is now stored in the in
fragment's endpoint pointer which free's up the segment's des_cbdata
pointer for storing the transfer size.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
If we ended up with no modules (e.g., all usnic devices were
excluded), there was a race condition in that the connectivity agent
could tear down its local socket before one or more of the local
clients saw it. Therefore, the local clients would timeout waiting
for the socket to appear.
So move the connectivity checker init later in the bootstrapping
process (it *must* be setup before module_init()), and have it only
invoked if we actually ended up with one or more modules.
Re-structure the loop looking for duplicates a little so that we only
have a single free of the string that happens regardless of whether we
found a duplicate or not.
This was Coverity CID 1288090
Fix previously-unfinished error paths during startup/bootstrapping.
Instead of just blindly continuing on when an fi_* function call
fails, opal_show_help and skip that device.
Also, only check the usnic config minimums once. They're VIC-wide and
won't change on a per-device basis -- we only need to check them once.
Fixes CSCut19179.
There were a number of bugs in the framework/variable code that
affected deregistration:
- Frameworks could be erroneously closed if seperately registered and
opened then subsequently closed. This was a bug in the original
design which only reference counted opens but not
registrations. This would cause undefined behavior if
MPI_T_finalize actually calls ompi_info_close_components as
intended. Now both registrations and opens are reference counted
and frameworks/components are not torn down until the matching
number of close calls have been made.
- group_find_by_name did not pass the invalidok flags down
to mca_base_var_group_get_internal correctly.
- Group deregistration caused the group to be completely reset. This
does not match the behavior required by MPI_T as it could reduce
the number of variables/subgroups in a group.
This commit also updates MPI_T_finalize to call
ompi_info_close_components as originally intended.
Closes#374
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Embedding libltdl without the use of Libtool bootstrapping has
proven... difficult. Instead, create a new simple "dl" framework. It
only provides 4 functions:
- open a DSO (very similar to lt_dlopenadvise())
- lookup a symbol in a previously-opened DSO (very similar to lt_dlsym())
- close a previously-opened DSO (very similar to lt_dlclose())
- iterate over all files in a directory (very similar to ld_dlforeachfile())
There will be follow-on commits with a simple dlopen-based component
(nowhere near as complete/functional as libltdl, but good enough for
Linux and OS X), and a libltdl-based component for all other
platforms.
The intent is that the dlopen-based component can be built by default
in almost all cases. But if libltdl is available, that component will
be built. End result: we still get DSO-based functionality by default
in (almost?) all cases. Without embedding libltdl. Which is what we
want.
call the opal_common_verbs_mca_register function to make sure that
opal_common_verbs_want_fork_support mca parameter is created and therefore
can be used to control the fork support.