Refs trac:4117. Please use this commit rather than the patch attached to
the ticket; the patch had a few mistakes in the tweaked wording.
This commit was SVN r30362.
The following SVN revision numbers were found above:
r30298 --> open-mpi/ompi@58479399c3
The following Trac tickets were found above:
Ticket 4117 --> https://svn.open-mpi.org/trac/ompi/ticket/4117
work around buggy NUMA node cpusets (i.e., buggy BIOSs).
Thanks to Jeff Becker for reporting the issue.
Submitted by Brice Goglin, reviewed by Jeff Squyres.
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r30306.
Change the logic in bind.c to only include <malloc.h> if we don't have posix_memalign.
In http://www.open-mpi.org/community/lists/devel/2014/01/13619.php,
Paul Hargrove found a compiler warning on OpenBSD where <malloc.h>
exists, but is not intended to be used (and doesn't error out, so
AC_CHECK_HEADERS says its ok).
Reviewed by Brice Goglin.
cmr=v1.7.4:reviewer=ompi-rm1.7
This commit was SVN r30234.
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi. This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.
This commit was SVN r30140.
Fix comm_spawn on a single host - with the new default mapping scheme, we were incorrectly computing the number of procs to put on the node.
Refs trac:4003
This commit was SVN r30033.
The following Trac tickets were found above:
Ticket 4003 --> https://svn.open-mpi.org/trac/ompi/ticket/4003
Reset topology usage for each node as we bind as multiple nodes may be linked to the same topology object. This will need to be revisited for scale as it does take some non-zero time to reset the usage each iteration. However, storing individual topology objects for every node consumes memory, so it's a tradeoff.
cmr=v1.7.4:reviewer=jsquyres:subject=Eliminate excessive binding/memory warnings
This commit was SVN r29978.
* default to bind-to core
* map-by slot if np=2
* map-by socket (balance across sockets on each node) if np > 2
* map-by <obj> will imply rank-by <obj> by default (leave default binding as above)
Fix a bug in the map-by <obj> mapper where we incorrectly compute the #procs to assign if the #slots > #procs
cmr=v1.7.4:reviewer=jsquyres:subject=Update default binding and mapping values
This commit was SVN r29919.
This is helpful in the work for #3694: ensure that many places that
eventually end up in configure don't overly-pollute the global shell
variable space (because debugging accidental shell variable pollution
can be a real pain).
Refs trac:3694
This commit was SVN r29830.
The following Trac tickets were found above:
Ticket 3694 --> https://svn.open-mpi.org/trac/ompi/ticket/3694
* Ensure "cnt" is always initialized
* Ensure we dont' buffer overflow on strncat() -- need to ensure we
account for the terminating \0 character
* hwloc_get_type_depth() returns an int (not unsigned), and
HWLOC_TYPE_DEPTH_UNKNOWN if it's unknown (which is probably <0, but
still, might as well check what the official hwloc docs say to
check for)
cmr=v1.7.4:reviewer=rhc:subject=fix hwloc base compiler warnings
This commit was SVN r29686.
This change contains a non-mandatory modification
of the MPI-RTE interface. Anyone wishing to support
coprocessors such as the Xeon Phi may wish to add
the required definition and underlying support
****************************************************************
Add locality support for coprocessors such as the Intel Xeon Phi.
Detecting that we are on a coprocessor inside of a host node isn't straightforward. There are no good "hooks" provided for programmatically detecting that "we are on a coprocessor running its own OS", and the ORTE daemon just thinks it is on another node. However, in order to properly use the Phi's public interface for MPI transport, it is necessary that the daemon detect that it is colocated with procs on the host.
So we have to split the locality to separately record "on the same host" vs "on the same board". We already have the board-level locality flag, but not quite enough flexibility to handle this use-case. Thus, do the following:
1. add OPAL_PROC_ON_HOST flag to indicate we share a host, but not necessarily the same board
2. modify OPAL_PROC_ON_NODE to indicate we share both a host AND the same board. Note that we have to modify the OPAL_PROC_ON_LOCAL_NODE macro to explicitly check both conditions
3. add support in opal/mca/hwloc/base/hwloc_base_util.c for the host to check for coprocessors, and for daemons to check to see if they are on a coprocessor. The former is done via hwloc, but support for the latter is not yet provided by hwloc. So the code for detecting we are on a coprocessor currently is Xeon Phi specific - hopefully, we will find more generic methods in the future.
4. modify the orted and the hnp startup so they check for coprocessors and to see if they are on a coprocessor, and have the orteds pass that info back in their callback message. Automatically detect that coprocessors have been found and identify which coprocessors are on which hosts. Note that this algo isn't scalable at the moment - this will hopefully be improved over time.
5. modify the ompi proc locality detection function to look for coprocessor host info IF the OMPI_RTE_HOST_ID database key has been defined. RTE's that choose not to provide this support do not have to do anything - the associated code will simply be ignored.
6. include some cleanup of the hwloc open/close code so it conforms to how we did things in other frameworks (e.g., having a single "frame" file instead of open/close). Also, fix the locality flags - e.g., being on the same node means you must also be on the same cluster/cu, so ensure those flags are also set.
cmr:v1.7.4:reviewer=hjelmn
This commit was SVN r29435.
To resolve this situation, add the ability to specify a backend topology file that mpirun shall use for its mapping operations. Create a new "set_topology" function in opal hwloc to support it.
This commit was SVN r28682.
some relevant updates/new functionality in the opal/mca/hwloc and
orte/mca/rmaps bases. This work was mainly developed by Mellanox,
with a bunch of advice from Ralph Castain, and some minor advice from
Brice Goglin and Jeff Squyres.
Even though this is mainly Mellanox's work, Jeff is committing only
for logistical reasons (he holds the hg+svn combo tree, and can
therefore commit it directly back to SVN).
-----
Implemented distance-based mapping algorithm as a new "mindist"
component in the rmaps framework. It allows mapping processes by NUMA
due to PCI locality information as reported by the BIOS - from the
closest to device to furthest.
To use this algorithm, specify:
{{{mpirun --map-by dist:<device_name>}}}
where <device_name> can be mlx5_0, ib0, etc.
There are two modes provided:
1. bynode: load-balancing across nodes
1. byslot: go through slots sequentially (i.e., the first nodes are
more loaded)
These options are regulated by the optional ''span'' modifier; the
command line parameter looks like:
{{{mpirun --map-by dist:<device_name>,span}}}
So, for example, if there are 2 nodes, each with 8 cores, and we'd
like to run 10 processes, the mindist algorithm will place 8 processes
to the first node and 2 to the second by default. But if you want to
place 5 processes to each node, you can add a span modifier in your
command line to do that.
If there are two NUMA nodes on the node, each with 4 cores, and we run
6 processes, the mindist algorithm will try to find the NUMA closest
to the specified device, and if successful, it will place 4 processes
on that NUMA but leaving the remaining two to the next NUMA node.
You can also specify the number of cpus per MPI process. This option
is handled so that we map as many processes to the closest NUMA as we
can (number of available processors at the NUMA divided by number of
cpus per rank) and then go on with the next closest NUMA.
The default binding option for this mapping is bind-to-numa. It works
if you don't specify any binding policy. But if you specified binding
level that was "lower" than NUMA (i.e hwthread, core, socket) it would
bind to whatever level you specify.
This commit was SVN r28552.
added by hwloc's embedding so that it doesn't appear in
libhwloc_embedded.la (and therefore propogate all the way up to
libmpi.la).
Committed upstream in hwloc SVN r5588.
This commit was SVN r28457.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r5588
Features:
- Support for an override parameter file (openmpi-mca-param-override.conf).
Variable values in this file can not be overridden by any file or environment
value.
- Support for boolean, unsigned, and unsigned long long variables.
- Support for true/false values.
- Support for enumerations on integer variables.
- Support for MPIT scope, verbosity, and binding.
- Support for command line source.
- Support for setting variable source via the environment using
OMPI_MCA_SOURCE_<var name>=source (either command or file:filename)
- Cleaner API.
- Support for variable groups (equivalent to MPIT categories).
Notes:
- Variables must be created with a backing store (char **, int *, or bool *)
that must live at least as long as the variable.
- Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of
mca_base_var_set_value() to change the value.
- String values are duplicated when the variable is registered. It is up to
the caller to free the original value if necessary. The new value will be
freed by the mca_base_var system and must not be freed by the user.
- Variables with constant scope may not be settable.
- Variable groups (and all associated variables) are deregistered when the
component is closed or the component repository item is freed. This
prevents a segmentation fault from accessing a variable after its component
is unloaded.
- After some discussion we decided we should remove the automatic registration
of component priority variables. Few component actually made use of this
feature.
- The enumerator interface was updated to be general enough to handle
future uses of the interface.
- The code to generate ompi_info output has been moved into the MCA variable
system. See mca_base_var_dump().
opal: update core and components to mca_base_var system
orte: update core and components to mca_base_var system
ompi: update core and components to mca_base_var system
This commit also modifies the rmaps framework. The following variables were
moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode,
rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables.
This commit was SVN r28236.
binding. This fix was included in the upstream 1.6 series, but not
the upstream 1.5 series, and was therefore missed when we brought
1.5.2 to OMPI.
This commit was SVN r28212.
The following SVN revision numbers were found above:
r28040 --> open-mpi/ompi@3d44f97572