This commit fixes an assert when trying to cleanup a module we failed
to initialize. There is no protection around the OBJ_DESTRUCT calls so
they will always be called so similarly we should always call
OBJ_CONSTRUCT at init.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit also fixes a problem with the lazy opening of topo
components. The topo framework incorrectly: 1) checked if the topo
framework was open by checking the length of the components list, and
2) called the framework open directly instead of using
mca_base_framework_open.
fixes#544
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
The Portals4 spec requires that PtlSetMap() be called before any
Portals4 call other than PtlNIInit(), PtlGetMap() and
PtlGetPhysId(). To satisfy this requirement, this commit delays
interface initialization until the end add_procs() at which time
PtlSetMap() has been called.
The Portals4 BTL is registered with the PML as an RDMA BTL, so
prepare_src() is only used in limited cases. This commit removes
the code path from prepare_src() for unbuffered contiguous buffers
with no PML reserve. This is now handled in register_mem().
In the large message case, the sender issues a PtlMEAppend() in
order to generate events when the receiver issues a PtlGet(). This
commit moves the PtlMEAppend() from mca_btl_portals4_prepare_src()
to mca_btl_portals4_register_mem() which is the way it's done in
BTL 3.0.
This commit is related to an RFC from June 2014. Disscussion can be
found at:
http://www.open-mpi.org/community/lists/devel/2014/07/15140.php
The finalize function is set using either the linker option -fini or
__attribute__((destructor)) depending on compiler support. I have
confirmed that this hybrid approach works with all the major
compilers. The attribute is supported by gcc, clang, llvm, xlc, and
icc. The fini function will support pgi. If a compiler/linker
combination does not support either the destructor or fini function a
message will be printed on re-init indicating it is not supported (an
improvement over the current behavior-- SEGV).
I moved the following to the destructor function:
- Class system finalize. This solves a bug when MPI_T_finalize is
called before MPI_Init. The only downside to this change is we will
leave the footprint of the opal class system after
MPI_Finalize. This footprint should be relatively small.
This is an alternative to #517 but the two PRs are not
mutually-exclusive (with some modifications). This commit should also
be safe for 1.8.x as it does not change internal or external ABI (#517
changes internal ABI).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit is a rework of the component repository. The changes
included in this commit are:
- Remove the component dependency code based off .ompi_info
files. This code is legacy code dating back 10 years that and is no
longer used.
- Move the plugin scanning code to the component repository. New
calls have been added to add new scanning paths, query available
components, and dlopen/load components.
- Pass the framework down to mca_base_component_find/filter. Eventually
the framework structure will be used to further validate components
before they are used.
- Add support to the MCA framework system to disable scanning for
dlopened components on open (support already existed in
register). This is really only relevant to installdirs as it has no
register function and no DSO components.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a typo in mca_btl_vader_progress_endpoints where
OPAL_THREAD_LOCK was used when OPAL_THREAD_UNLOCK was intended.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes several vagrind errors. Included:
- installdirs did not correctly reinitialize all pointers to NULL
at close. This causes valgrind errors on a subsequent call to
opal_init_tool.
- several opal strings were leaked by opal_deregister_params which
was setting them to NULL instead of letting them be freed by the
MCA variable system.
- move opal_net_init to AFTER the variable system is initialized and
opal's MCA variables have been registered. opal_net_init uses a
variable registered by opal_register_params!
- do not leak ompi_mpi_main_thread when it is allocated by
MPI_T_init_thread.
- do not overwrite ompi_mpi_main_thread if it is already set (by
MPI_T_init_thread).
- mca_base_var: read_files was overwritting mca_base_var_file_list
even if it was non-NULL.
- mca_base_var: set all file global variables to initial states on
finalize.
- btl/vader: decrement enumerator reference count to ensure that it
is freed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes a bug when opal_error_init is called with the same
values multiple times. If opal_error_init is called too many times it
will start failing with OPAL_ERR_OUT_OF_RESOURCE. To fix the problem
check if an existing convertor matching the requested one and return
that one instead.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes the following bugs:
- opal_output_finalize did not properly set internal state. This
caused problems when calling the sequence opal_output_init (),
opal_output_finalize (), opal_output_init ().
- opal_info support called mca_base_open () but never called the
matching mca_base_close (). mca_base_open () and mca_base_close ()
have been updated to use a open count instead of an open flag to
allow mca_base_open to be called through multiple paths (as may be
the case when MPI_T is in use).
- orte_info support did not register opal variables. This can cause
orte-info to not return opal variables.
- opal_info, orte_info, and ompi_info support have been updated to
use a register count.
- When opening the dl framework the reference count was added to
ensure the framework stuck around. The framework being closed
prematurely was a bug in the MCA base that has since been
corrected. The increment (and associated decrement) have been
removed.
- dl/dlopen did not set the value of
mca_dl_dlopen_component.filename_suffixes_mca_storage on each call
to register. Instead the value was set in the component
structure. This caused the value to be lost when re-loading the
component. Fixed by setting the default value in register.
- Reset shmem framework state on close to avoid returning a stale
component after reloading opal/shmem.
- MCA base parameters were not properly deregistered when the MCA
base was closed.
This commit may fix#374.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
linux: only use the device-tree on Power machines
It's available on ARM but the assumption that cpus' "reg" start at 0
is invalid.
We could make that work but the device-tree doesn't currently
bring anything better than sysfs on ARM, so don't bother for now.
On 32-bit architectures loads/stores of fast box headers may take
multiple instructions. This can lead to a data race between the
sender/receiver when reading/writing the sequence number. This can
lead to a situation where the receiver could process incomplete
data. To fix the issue this commit re-orders the fast box header to
put the sequence number and the tag in the same 32-bits to ensure they
are always loaded/stored together.
Fixes#473
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>