Based on some on-list and IM discussion with @hjelmn about
open-mpi/ompi@40b7643119, change the testing to a switch/case. If we
fall into the default case, assert() error (because it's an OMPI
developer programming error).
This commit fixes several vagrind errors. Included:
- installdirs did not correctly reinitialize all pointers to NULL
at close. This causes valgrind errors on a subsequent call to
opal_init_tool.
- several opal strings were leaked by opal_deregister_params which
was setting them to NULL instead of letting them be freed by the
MCA variable system.
- move opal_net_init to AFTER the variable system is initialized and
opal's MCA variables have been registered. opal_net_init uses a
variable registered by opal_register_params!
- do not leak ompi_mpi_main_thread when it is allocated by
MPI_T_init_thread.
- do not overwrite ompi_mpi_main_thread if it is already set (by
MPI_T_init_thread).
- mca_base_var: read_files was overwritting mca_base_var_file_list
even if it was non-NULL.
- mca_base_var: set all file global variables to initial states on
finalize.
- btl/vader: decrement enumerator reference count to ensure that it
is freed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes the following bugs:
- opal_output_finalize did not properly set internal state. This
caused problems when calling the sequence opal_output_init (),
opal_output_finalize (), opal_output_init ().
- opal_info support called mca_base_open () but never called the
matching mca_base_close (). mca_base_open () and mca_base_close ()
have been updated to use a open count instead of an open flag to
allow mca_base_open to be called through multiple paths (as may be
the case when MPI_T is in use).
- orte_info support did not register opal variables. This can cause
orte-info to not return opal variables.
- opal_info, orte_info, and ompi_info support have been updated to
use a register count.
- When opening the dl framework the reference count was added to
ensure the framework stuck around. The framework being closed
prematurely was a bug in the MCA base that has since been
corrected. The increment (and associated decrement) have been
removed.
- dl/dlopen did not set the value of
mca_dl_dlopen_component.filename_suffixes_mca_storage on each call
to register. Instead the value was set in the component
structure. This caused the value to be lost when re-loading the
component. Fixed by setting the default value in register.
- Reset shmem framework state on close to avoid returning a stale
component after reloading opal/shmem.
- MCA base parameters were not properly deregistered when the MCA
base was closed.
This commit may fix#374.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The fragment flush code tries to send the active fragment before
sending any queued fragments. This could cause osc messages to arrive
out-of-order at the target (bad). Ensure ordering by alway sending
the active fragment after sending queued fragments.
This commit also fixes a bug when a synchronization message (unlock,
flush, complete) can not be packed at the end of an existing active
fragment. In this case the source process will end up sending 1 more
fragment than claimed in the synchronization message. To fix the issue
a check has been added that fixes the fragment count if this situation
is detected.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This patch tries to do as little as possible in the PML CM blocking
send/receive routines. Basically, avoid creating and filling in an
entire request object. An OMPI-level request is still needed, but we
can create that on the stack instead of going to a free list.
Signed-off-by: Andrew Friedley <andrew.friedley@intel.com>
Signed-off-by: Jithin Jose <jithin.jose@intel.com>
This commit adds support for project_framework_component_* parameter
matching. This is the first step in allowing the same framework name
in multiple projects. This change also bumps the MCA component version
to 2.1.0.
All master frameworks have been updated to use the new component
versioning macro. An mca.h has been added to each project to add a
project specific versioning macro of the form
PROJECT_MCA_VERSION_2_1_0.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
The subarray datatype was not packing/unpacking correctly. This was
leading to wrong results whenever the lb of the subarray datatype was
non-zero.
I tracked the issue to the use of ompi_datatype_create_resized. This
function simply duplicates the old datatype and sets the lb and
extent. This is unfortunately insufficent for the pack/unpack
functions which use the loop end first element offset NOT the lb. This
offset is 0 in the resized datatype. Once ompi_datatype_create_resized
has been fixed this commit should be reverted.
Fixes#380.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
enumerator
This commit adds a function that will return an integer value for an
info key based on the value returned by a variable enumerator. This
feature should greatly simplify code using the info keys (osc for
example).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.
With the changes introduced in the previous patches in this series
some goto constructs for cleanup are no longer necessary and removed.
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.
Follow-up of 552c9ca5a0. This patch
implements the necessary changes in mentioned commit in the FT code.
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.
The FT code used barrier mechanisms which have been removed
with aec5cd08bd. This patch replaces
all those different barriers with opal_pmix.fence(NULL, 0);
I am not sure this is completely correct but at least a starting
point for a review.
Enabling the FT code breaks compilation (again). This series
tries to fix the compiler errors. This is again only fixing
the compiler errors without any warranty that the result
might actually support FT again.
This first patch moves orte_cr_continue_like_restart from ORTE
to opal_cr_continue_like_restart in OPAL. This only leaves three
calls from OPAL to ORTE in the FT code. As it is not yet 100%
clear how to handle these calls the code orte_sstore.set_attr()
has been #ifdef'd out for now.
This commit should resolve an issue seen with CUDA-aware support. The
problem came in with BTL 3.0. Before 3.0 the size of the copy was
stored in the incoming segment's des_remote_count field. This field
does not exist in BTL 3.0 so I stored the value in the
des_segment_count field. This caused problems with the cuda support
code. To fix the issue the endpoint pointer is now stored in the in
fragment's endpoint pointer which free's up the segment's des_cbdata
pointer for storing the transfer size.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug in osc/pt2pt which causes MPI_Win_unlock_all
to hang. The problem was caused by code refactoring that moved the
unlock of the local process to after the loop that waits for unlock
acks. This will cause the code to loop forever waiting on the self
ack.
Fixes#444
pml_yalla_del_comm may be called after yalla module is finalized, which
leads to invalid memory access if mxm context is already destroyed in
this point.
There were a number of bugs in the framework/variable code that
affected deregistration:
- Frameworks could be erroneously closed if seperately registered and
opened then subsequently closed. This was a bug in the original
design which only reference counted opens but not
registrations. This would cause undefined behavior if
MPI_T_finalize actually calls ompi_info_close_components as
intended. Now both registrations and opens are reference counted
and frameworks/components are not torn down until the matching
number of close calls have been made.
- group_find_by_name did not pass the invalidok flags down
to mca_base_var_group_get_internal correctly.
- Group deregistration caused the group to be completely reset. This
does not match the behavior required by MPI_T as it could reduce
the number of variables/subgroups in a group.
This commit also updates MPI_T_finalize to call
ompi_info_close_components as originally intended.
Closes#374
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Okay coverity seems to get one stuck in a loop where
by fixing one set of resource allocation problems, it
starts finding more.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Noe that this commit removes option:lt_dladvise from the various
"info" tools output. This technically breaks our CLI "ABI" because
we're not deprecating it / replacing it with an alias to some other
"into" tool output.
Although the dl/libltdl component contains an "have_lt_dladvise" MCA
var that contains the same information, the "option:lt_dladvise"
output from the various "info" tools is *not* an MCA var, and
therefore we can't alias it. So it just has to die.
Remove some outdated discussion of "color" -- looks like this was a
copy-n-paste from the MPI_Comm_split man page. Also make some minor
updates to some Open MPI-specific key text.
Thanks to @eschnett for raising the issue.
Fixes#437.