Thanks to Dave Love and Ashley Pittman for pointing out the problem.
cmr=v1.7.4:reviewer=jsquyres:subject=Fix tool communications with mpirun
This commit was SVN r29959.
The following Trac tickets were found above:
Ticket 3963 --> https://svn.open-mpi.org/trac/ompi/ticket/3963
includes various fixes all over the C/R code which are
hard to group like the other patches.
Changes from V1:
* explain why mca_base_component_distill_checkpoint_ready no longer works
* compare return result of opal functions with OPAL_* values
Changes from V2:
* use orte_rml_oob_ft_event() instead of referencing through the modules
* properly protect variable (thanks to --enable-picky)
This commit was SVN r29922.
* default to bind-to core
* map-by slot if np=2
* map-by socket (balance across sockets on each node) if np > 2
* map-by <obj> will imply rank-by <obj> by default (leave default binding as above)
Fix a bug in the map-by <obj> mapper where we incorrectly compute the #procs to assign if the #slots > #procs
cmr=v1.7.4:reviewer=jsquyres:subject=Update default binding and mapping values
This commit was SVN r29919.
Noticed these as part of #3694: external libevent's don't cause argv.h
to automatically get included.
Refs trac:3694
This commit was SVN r29897.
The following Trac tickets were found above:
Ticket 3694 --> https://svn.open-mpi.org/trac/ompi/ticket/3694
error: void value not ignored as it ought to be
in the C/R code by ignoring the return value of functions which
no longer return a value (only void).
Signed-off-by: Adrian Reber <adrian.reber@hs-esslingen.de>
This commit was SVN r29816.
This isn't being used yet - just enabling Nathan to do what he needs.
***** NOTE: any use of the OMPI_DB_GLOBAL_RANK database key must be protected by #ifdef OMPI_DB_GLOBAL_RANK as not all RTE's will define this key. *****
This commit was SVN r29708.
Resolves problems in loop_spawn where the timer was incorrectly firing and killing the overall job.
cmr=v1.7.4:reviewer=hjelmn
This commit was SVN r29661.
should have been all along and fix one place that uses the file
Update opal_portable_platform.h with changes to mpi_portable_platform.h made
in r29608.
Make mpi_portable_platform.h a symlink to opal_portable_platform.h, so that
they won't get out of sync. I'd like to remove mpi_portable_platform.h, but
we don't automatically add -I${includedir}/openmpi/ to make that sane from
a header include point of view, so that's future work.
This commit was SVN r29618.
The following SVN revision numbers were found above:
r29608 --> open-mpi/ompi@b71bd51cdd
The data for each remote daemon is added later in the daemon callback function. Only the HNP retains info in the hash table.
If it is desirable to have each daemon retain its own coprocessor info, then this must be done in orte/mca/ess/base/ess_base_std_orted.c.
This commit was SVN r29497.
The following SVN revision numbers were found above:
r29489 --> open-mpi/ompi@2e2794fa15
to the hash table.
Tested and working on a system with 2 Xeon Phi co-processors.
cmr=v1.7.4:ticket=3847:reviewer=ompi-rm1.7
This commit was SVN r29489.
The following Trac tickets were found above:
Ticket 3847 --> https://svn.open-mpi.org/trac/ompi/ticket/3847
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
Fix two problems that surfaced when using direct launch under SLURM:
1. locally store our own data because some BTLs want to retrieve
it during add_procs rather than use what they have internally
2. cleanup MPI_Abort so it correctly passes the error status all
the way down to the actual exit. When someone implemented the
"abort_peers" API, they left out the error status. So we lost
it at that point and *always* exited with a status of 1. This
forces a change to the API to include the status.
cmr:v1.7.3:reviewer=jsquyres:subject=Fix MPI_Abort and modex_recv for direct launch
This commit was SVN r29405.