The subarray datatype was not packing/unpacking correctly. This was
leading to wrong results whenever the lb of the subarray datatype was
non-zero.
I tracked the issue to the use of ompi_datatype_create_resized. This
function simply duplicates the old datatype and sets the lb and
extent. This is unfortunately insufficent for the pack/unpack
functions which use the loop end first element offset NOT the lb. This
offset is 0 in the resized datatype. Once ompi_datatype_create_resized
has been fixed this commit should be reverted.
Fixes#380.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
enumerator
This commit adds a function that will return an integer value for an
info key based on the value returned by a variable enumerator. This
feature should greatly simplify code using the info keys (osc for
example).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.
With the changes introduced in the previous patches in this series
some goto constructs for cleanup are no longer necessary and removed.
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.
Follow-up of 552c9ca5a0. This patch
implements the necessary changes in mentioned commit in the FT code.
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.
The FT code used barrier mechanisms which have been removed
with aec5cd08bd. This patch replaces
all those different barriers with opal_pmix.fence(NULL, 0);
I am not sure this is completely correct but at least a starting
point for a review.
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.
This first patch moves orte_cr_continue_like_restart from ORTE
to opal_cr_continue_like_restart in OPAL. This only leaves three
calls from OPAL to ORTE in the FT code. As it is not yet 100%
clear how to handle these calls the code orte_sstore.set_attr()
has been #ifdef'd out for now.
This commit should resolve an issue seen with CUDA-aware support. The
problem came in with BTL 3.0. Before 3.0 the size of the copy was
stored in the incoming segment's des_remote_count field. This field
does not exist in BTL 3.0 so I stored the value in the
des_segment_count field. This caused problems with the cuda support
code. To fix the issue the endpoint pointer is now stored in the in
fragment's endpoint pointer which free's up the segment's des_cbdata
pointer for storing the transfer size.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug in osc/pt2pt which causes MPI_Win_unlock_all
to hang. The problem was caused by code refactoring that moved the
unlock of the local process to after the loop that waits for unlock
acks. This will cause the code to loop forever waiting on the self
ack.
Fixes#444
pml_yalla_del_comm may be called after yalla module is finalized, which
leads to invalid memory access if mxm context is already destroyed in
this point.
There were a number of bugs in the framework/variable code that
affected deregistration:
- Frameworks could be erroneously closed if seperately registered and
opened then subsequently closed. This was a bug in the original
design which only reference counted opens but not
registrations. This would cause undefined behavior if
MPI_T_finalize actually calls ompi_info_close_components as
intended. Now both registrations and opens are reference counted
and frameworks/components are not torn down until the matching
number of close calls have been made.
- group_find_by_name did not pass the invalidok flags down
to mca_base_var_group_get_internal correctly.
- Group deregistration caused the group to be completely reset. This
does not match the behavior required by MPI_T as it could reduce
the number of variables/subgroups in a group.
This commit also updates MPI_T_finalize to call
ompi_info_close_components as originally intended.
Closes#374
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Okay coverity seems to get one stuck in a loop where
by fixing one set of resource allocation problems, it
starts finding more.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Noe that this commit removes option:lt_dladvise from the various
"info" tools output. This technically breaks our CLI "ABI" because
we're not deprecating it / replacing it with an alias to some other
"into" tool output.
Although the dl/libltdl component contains an "have_lt_dladvise" MCA
var that contains the same information, the "option:lt_dladvise"
output from the various "info" tools is *not* an MCA var, and
therefore we can't alias it. So it just has to die.
Remove some outdated discussion of "color" -- looks like this was a
copy-n-paste from the MPI_Comm_split man page. Also make some minor
updates to some Open MPI-specific key text.
Thanks to @eschnett for raising the issue.
Fixes#437.
from the ROMIO sources :
/* Any MPI implementation that wishes to follow the thread-safety and
error reporting features provided by MPICH must implement these
four functions. Defining these as empty should not change the behavior
of correct programs */
the MPIO_DATATYPE_ISCOMMITTED macro now always set err_=0
this is an optimistic approach for Open MPI, but it is likely other upper
layers already checked the datatype was committed.
not setting err_ is incorrect since it can lead to some use of uninitialized
variable.
Fixesopen-mpi/ompi#404
If the free list initialization fails in libnbc_open()
mca_coll_libnbc_component.active_requests remain uninitialized,
resulting in a crash while closing the component
Use of the old ompi_free_list_t and ompi_free_list_item_t is
deprecated. These classes will be removed in a future commit.
This commit updates the entire code base to use opal_free_list_t and
opal_free_list_item_t.
Notes:
OMPI_FREE_LIST_*_MT -> opal_free_list_* (uses opal_using_threads ())
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Please verify your components have been updated correctly. Keep in
mind that in terms of threading:
OPAL_FREE_LIST_GET -> opal_free_list_get_st
OPAL_FREE_LIST_RETURN -> opal_free_list_return_st
I used the opal_using_threads() variant anytime it appeared multiple
threads could be operating on the free list. If this is not the case
update to _st. If multiple threads are always in use change to _mt.
This commit adds an owner file in each of the component directories
for each framework. This allows for a simple script to parse
the contents of the files and generate, among other things, tables
to be used on the project's wiki page. Currently there are two
"fields" in the file, an owner and a status. A tool to parse
the files and generate tables for the wiki page will be added
in a subsequent commit.
The RPATH support added a @{libdir} token into
<package>_WRAPPER_EXTRA_LDFLAGS. However, these flags are also
substituted into the pkg-config data files, and they don't understand
the @{foo} notation. So convert @{libdir} into ${libdir}, which
pkg-config *does* understand.
Thanks to Christoph Junghans (@junghans) for notifying us of the issue.
Fixes#406.
The fi_fabric function appears to free the provider string passed in
in the fabric_attr. This causes MCA to free an invalid pointer when
the parameter is freed.
References #374
Commit open-mpi/ompi@1a3597aam changed the type of the `convertor`
variable from `ompi_osc_base_convertor_t` (which contained an
`opal_convertor_t`) to an `opal_convertor_t`. Hence, using memchecker
to ensure that the inner convertor of the `ompi_osc_base_convertor_t`
is considered initialized is now unnecessary.
Background: In order to support atomics each btl needs to provide support
for communicating with self unless the btl module can guarantee global
atomicity. Before this commit bml/r2 discarded any BTL with lower
exclusivity than an existing send btl. This would cause the BML to
discard any btl other than self.
The new behavior is as follows:
- If an exisiting send btl has higher exclusivity then the btl will not be
added to the send btl list for the endpoint.
- If a btl provides RDMA support then it is always added to the rdma btl
list.
- bml_btl weight for send btls is now calculated across all send btls.
- bml_btl weight for rdma btls is now calculated across all rdma btls.
With this change self should still win as the only send btl for loopback
without disqualifying other btls (ugni, openib) for atomic operations.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
A little background. Historically ob1 always registered the entire memory
region when the RGET protocol was in use. This changed when Mellanox
added support to fragment RGET using the btl_prepare_dst function. Now
that the BTL layer has changed to split out the limits of get/put there
is explicit fragmentation code in ob1. Before this commit the registration
was still done per RGET fragment.
This commit will attempt to register the entire region before creating
RGET fragments. If the registration is successfull then all RGET
fragments will use this registration otherwise they will each attempt
to register their own segment of the receive buffer. If that fails
enough times each fragment will give up and fall back on send/recv.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Coverity identified that we treated the possibility that one of the
message buffers could be NULL in some places (because strdup() could
fail), but not in others.
So just use stack buffers that will never be NULL.
This was CID 1269914.