This commit fixes multiple issues in the bruck's and recursive
doubling grpcomm algorithms. The following changes are included:
- Use the existing bitmap implementation instead of implementing a
new one. There were bugs in the implementation that caused an
overrun of the bitmap array.
- Clean up the algorithms to eliminate errors.
- Send as little extra data as possible in the bruck's
algorithm.
The changes were testest with various numbers of ortes varying from 1
to 4096.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Update the configure logic for the new pmix120 component
ckpt
Get the pmix120 component to work - still not really registering or handling notifications, but infrastructure now operates
Cleanup some of the symbol scopes, and provide a more comprehensive rename.h file. Will pretty it up later - let's see how this works
Cleanup the rename files to use the pretty macros
Turns out that the way the SLURM plm works
is not compatible with the way MPI processes
on Cray XC obtain RDMA credentials to use
the high speed network. Unlike with ALPS,
the mpirun process is on the first compute
node in the job. With the current PLM launch
system, mpirun (HNP daemon) launches the MPI
ranks on that node rather than relying on
srun.
This will probably require a significant amount
of effort to rework to support Native SLURM
on Cray XC's. As a short term alternative,
have the alps plm (which gets selected by default
again on Cray systems regardless of the launch system)
check whether or not srun or alps is being used on the
system. If alps is not being used, print a helpful
message for the user and abort the job launch.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Move .so version numbers to their appropriate project in the top-level
VERSION file. Also add the project name to all .so version number
names. Remove no-longer-used .so names.
NOTE: Building with external pmix *requires* that you also build with external libevent and hwloc libraries. Detect this at configure and error out with large message if this requirement is violated.
Closes#1204 (replaces it)
Fixes#1064
Due to user confusion, update the show-help messages displayed when
processor and/or memory binding fails. Thanks to Dave Love
(@loveshack) for the initial suggestion.
Fixesopen-mpi/ompi#1087
Rename the pmix1xx component to pmix111 so it reflects the actual release it includes
Resolve the problem of PMIx being passed a bogus --with-platform argument when configuring the PMIx tarball code. There is no reason we should be passing --with-platform arguments to any internal subdirectory, so just leave that out when constructing the opal_subdir_args variable.
Update the PMIx code and continue attempting to debug direct modex
Fix a problem in the ORTE PMIx server - there was an early intent to optimize the direct modex by fetching data for all procs from the target job on the remote node, instead of fetching the data one proc at a time. However, this was never completely implemented, and so we would hang if we had multiple overlapping requests for data from more than one proc on the node.
Update PMIx to v1.1.2
Cray has added plugins to slurm to support
the Cray programming env (alpslli, cray pmi, etc).
Some of the workarounds needed with plm/alps
to avoid issues with Cray PMI getting mixed up
with orte launch system are also required in
a cray native slurm environment.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>