This commit allows to control output during abnormal oshmem/ompi application
termination.
Fixed issue in backtrace output. HAVE_BACKTRACE was never set so user was limited
in control of this variable.
Two related mca variables are moved to opal layer. Corresponding aliases are
added for ompi and oshmem.
There were two bugs in osc/rdma when using threads:
- Deadlock is ompi_osc_rdma_start_atomic. This occurs because
ompi_osc_rdma_frag_alloc is called with the module lock. To fix the
issue the module lock is now recursive. In the future I will add a
new lock to protect just the current rdma fragment.
- Do not drop the lock in ompi_osc_rdma_frag_alloc when calling
ompi_osc_rdma_frag_complete. Not only is it not needed but dropping
the lock at this point can cause a competing thread to mess up the
state.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
We should invoke OBJ_CONTRUCT/OBJ_DESTRUCT only on regular requests
(which are embedded inside UCX requests) and for the completed request.
Persistent requests are already constructed/destructed by the free list.
This fixes an assertion in ompi_request_destruct.
Without this modification, gfortran throw the following error
if these variables are used for `MPI_DIST_GRAPH_CREATE_ADJACENT` or
`MPI_DIST_GRAPH_CREATE_ADJACENT`.
Error: There is no specific subroutine for the generic
'mpi_dist_graph_create_adjacent' at (1)
`MPI_ARGVS_NULL` should be a two-dimensional array.
Without this modification, gfortran throw the following error
if `MPI_ARGVS_NULL` is used for `MPI_COMM_SPAWN_MULTIPLE`.
Error: There is no specific subroutine for the generic
'mpi_comm_spawn_multiple' at (1)
During component finalize, mtl-portals4 would blindly release
resources without testing if the handle was valid. This was OK,
but resource allocation is now delayed until add_procs(). If
mtl-portals4 is deselected, it will be finalized without
add_procs() ever being called. This commit ensures that invalid
handles are not released.
Writing to the pml_monitoring_flush variable will set the filename of
the output file.
Stopping a session for the pml_monitoring_flush will force the
generation of the nobitoring output file (as long as the filename
is not NULL).
To reset the monitoring, une has to bind the pml_monitoring_flush to a
session.
using performance variables "pml_monitoring_messages_count" and
"pml_monitoring_messages_size"
Per Brice suggestion make all data count and message length be
uint64_t.
counting or not the collective traffic as a separate entity. The need
for such a PML is simply because the PMPI interface doesn't allow us to
identify the collective generated traffic.