There are only five places in the non-daemon code paths where opal_hwloc_topology is currently referenced:
* shared memory BTLs (sm, smcuda). I have added a code path to those components that uses the location string
instead of the topology itself, if available, thus avoiding instantiating the topology
* openib BTL. This uses the distance matrix. At present, I haven't developed a method
for replacing that reference. Thus, this component will instantiate the topology
* usnic BTL. Uses the distance matrix.
* treematch TOPO component. Does some complex tree-based algorithm, so it will instantiate
the topology
* ess base functions. If a process is direct launched and not bound at launch, this
code attempts to bind it. Thus, procs in this scenario will instantiate the
topology
Note that instantiating the topology on complex chips such as KNL can consume
megabytes of memory.
Fix pernode binding policy
Properly handle the unbound case
Correct pointer usage
Do not free static error messages!
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
* When using `MPI_Put` with `MPI_Win_lock_all` a hang is possible since
the `put` is waiting on `eager_send_active` to become `true` but
that variable might not be reset in the case of `MPI_Win_lock_all`
depending on other incoming events (e.g., `post` or ACKs of lock
requests.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
* When using `MPI_Lock`/`MPI_Unlock` with `MPI_Get` and non-contiguous
datatypes is is possible that the unlock finishes too early before
the data is actually present in the recv buffer.
* We need to wait for the irecv to complete before unlocking the target.
This commit waits for the outgoing fragment counts to become equal
before unlocking.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
* If the user uses PSCW synchronization after a Fence then the previous
epoch is not reset which can cause the PSCW to transfer data before
it is ready leading to wrong answers.
* This commit resets the `eager_send_active` in the start call.
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
According to the MPI-3.1 p.52 and p.53 (cited below), a request
created by `MPI_*_INIT` but not yet started by `MPI_START` or
`MPI_STARTALL` is inactive therefore `MPI_WAIT` or its friends
must return immediately if such a request is passed.
The current implementation hangs in `MPI_WAIT` and its friends
in such case because a persistent request is initialized as
`req_complete = REQUEST_PENDING`. This commit fixes the
initialization.
Also, this commit fixes internal requests used in `MPI_PROBE`
and `MPI_IPROBE` which was marked wrongly as persistent.
MPI-3.1 p.52:
We shall use the following terminology: A null handle is a handle
with value MPI_REQUEST_NULL. A persistent request and the handle
to it are inactive if the request is not associated with any ongoing
communication (see Section 3.9). A handle is active if it is neither
null nor inactive. An empty status is a status which is set to return
tag = MPI_ANY_TAG, source = MPI_ANY_SOURCE, error = MPI_SUCCESS, and
is also internally configured so that calls to MPI_GET_COUNT,
MPI_GET_ELEMENTS, and MPI_GET_ELEMENTS_X return count = 0 and
MPI_TEST_CANCELLED returns false. We set a status variable to empty
when the value returned by it is not significant. Status is set in
this way so as to prevent errors due to accesses of stale information.
MPI-3.1 p.53:
One is allowed to call MPI_WAIT with a null or inactive request
argument. In this case the operation returns immediately with empty
status.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
Adds the new API hcoll_conetxt_free that resolves the issues
observed with the ctx cache and group_destroy_notify.
Signed-off-by: Valentin Petrov <valentinp@mellanox.com>
`sturct mca_pml_ob1_comm_proc_t`, which is allocated per
connected rank in a communicator, had two paddings after
`expected_sequence` and `send_sequence` by alignments.
By changing the order of the members, the size of
`mca_pml_ob1_comm_proc_t` is reduced by 8 bytes on 64-bit
architectures.
Signed-off-by: KAWASHIMA Takahiro <t-kawashima@jp.fujitsu.com>
This fixes a bug reported in-house occuring with this component. It is triggered if the data assigned to different aggregators is highly differing, leading to different number of internal iterations required to handle it.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
protect the mca_coll_libnbc_component.active_requests list with
the new mca_coll_libnbc_component.lock mutex.
Thanks Jie Hu for the report
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
change the default value of the mca_io_ompio_cycle_buffer_size parameter in order to avoid accidental truncation of a file for very large individual operations.
Thanks to @cniethammer for reporting it.
Signed-off-by: Edgar Gabriel <egabriel@central.uh.edu>
- instead of coll_base_comm_get_reqs(2) for irecv/isend, use only
one request allocated in the stack and do a irecv/send
- instead of ompi_request_wait_all(2), simpy ompi_request_wait
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
this is generally done in mca_pml_ob1_recv_request_free(), but this is not invoked
in via mca_pml_ob1_recv(), so do it manually
Thanks Yvan Fournier for the report
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
* If (legal) non-uniform data type signatures are used in ibcast
then the chosen algorithm may fail on the request, and worst case
it could produce wrong answers.
* Add an MCA parameter that, by default, protects the user from this
scenario. If the user really wants to use it then they have to
'opt-in' by setting the following parameter to false:
- `-mca coll_libnbc_ibcast_skip_dt_decision f`
* Once the following Issues are resolved then this parameter can
be removed.
- https://github.com/open-mpi/ompi/issues/2256
- https://github.com/open-mpi/ompi/issues/1763
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>