Do not allow combination of transports that is not compliant with
shmem spec. Especially do not allow mix of hw and software atomic
ops
Issue: 4721
Change-Id: Ide382f7510495df3d385f2a5ae5f9def6ef5332c
Revert "OPAL: drop dead with core on bad flow. rarely happens with helloworld on large scale."
This reverts commit 86f1d5af3ee484f34092ad3f7a645d9a5ccbcb6c.
Will be reconsidered via RFC as it represents a significant change in behavior
Use experimental verbs to allocate memory at fixed base
virtual address.
verbs will disqualify itself if shared_mr is disabled
or not supported and it is impossible to allocate memory
starting at fixed base virtual address.
verbs contig pages allocator did not guarantee fixed va, now it does.
(cherry picked from commit fd77ebd4525e9e0c1a3ab1c4966bf31aa45251b4)
Apply Jeff`s comments
Update with Jeff commits
(cherry picked from commit open-mpi/ompi-release@4dc487fc3d)
of the topology is higher than the communicator size
It is possible to have a topology degree higher than the size of the communicator.
For example, a periodic cartesian communicator on MPI_COMM_SELF. This will leave
the neighborhood collectives with a request buffer that is too small.
This commits introduces a semantic change :
from now, c_topo must be set before invoking coll_select
of preregistered buffers
Before this change eager gets we retried on each progress loop. This commit
modifies the protocol to only retry eager gets when another eager get has
completed. This commit also cleans up some callback code that is no longer
needed.
The GNI_RDMAMODE_FENCE bit was a left over from
async progress work that is not needed at this point
in the gni BTL. Removing the bit also allows
for the removal of the GNI_CDM_MODE_BTE_SINGLE_CHANNEL
bit from the GNI_CdmCreate call.
1. It's actually hashing now, whereas the old OPAL hash table was not. Thus, it is a bug fix for and, as such, should be included in the 1.8 series.
2. It is dynamic and can grow and shrink the number of buckets in accordance with job size, whereas the old OPAL hash table had a fixed number of buckets which resulted in poor retrieval performance at large scale.
This scheme has been deployed in the field on very large H.P./Mellanox systems and has been demonstrated to significantly decrease job start-up time (~ 20% improvement) when launching applications directly with srun in SLURM environments. However, neither SLURM nor direct launch are prerequisites to take advantage of this change as any entity that utilizes OPAL hash table objects can benefit (at least partially) from this contribution.