1
1

Convert all README files to Markdown

A mindless task for a lazy weekend: convert all the README and
README.txt files to Markdown.  Paired with the slow conversion of all
of our man pages to Markdown, this gives a uniform language to the
Open MPI docs.

This commit moved a bunch of copyright headers out of the top-level
README.txt file, so I updated the relevant copyright header years in
the top-level LICENSE file to match what was removed from README.txt.

Additionally, this commit did (very) little to update the actual
content of the README files.  A very small number of updates were made
for topics that I found blatently obvious while Markdown-izing the
content, but in general, I did not update content during this commit.
For example, there's still quite a bit of text about ORTE that was not
meaningfully updated.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Co-authored-by: Josh Hursey <jhursey@us.ibm.com>
Этот коммит содержится в:
Jeff Squyres 2020-11-08 13:19:39 -05:00
родитель 686c2142e2
Коммит c960d292ec
53 изменённых файлов: 4558 добавлений и 4582 удалений

272
HACKING
Просмотреть файл

@ -1,272 +0,0 @@
Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
University Research and Technology
Corporation. All rights reserved.
Copyright (c) 2004-2005 The University of Tennessee and The University
of Tennessee Research Foundation. All rights
reserved.
Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
University of Stuttgart. All rights reserved.
Copyright (c) 2004-2005 The Regents of the University of California.
All rights reserved.
Copyright (c) 2008-2020 Cisco Systems, Inc. All rights reserved.
Copyright (c) 2013 Intel, Inc. All rights reserved.
$COPYRIGHT$
Additional copyrights may follow
$HEADER$
Overview
========
This file is here for those who are building/exploring OMPI in its
source code form, most likely through a developer's tree (i.e., a
Git clone).
Developer Builds: Compiler Pickyness by Default
===============================================
If you are building Open MPI from a Git clone (i.e., there is a ".git"
directory in your build tree), the default build includes extra
compiler pickyness, which will result in more compiler warnings than
in non-developer builds. Getting these extra compiler warnings is
helpful to Open MPI developers in making the code base as clean as
possible.
Developers can disable this picky-by-default behavior by using the
--disable-picky configure option. Also note that extra-picky compiles
do *not* happen automatically when you do a VPATH build (e.g., if
".git" is in your source tree, but not in your build tree).
Prior versions of Open MPI would automatically activate a lot of
(performance-reducing) debugging code by default if ".git" was found
in your build tree. This is no longer true. You can manually enable
these (performance-reducing) debugging features in the Open MPI code
base with these configure options:
--enable-debug
--enable-mem-debug
--enable-mem-profile
NOTE: These options are really only relevant to those who are
developing Open MPI itself. They are not generally helpful for
debugging general MPI applications.
Use of GNU Autoconf, Automake, and Libtool (and m4)
===================================================
You need to read/care about this section *ONLY* if you are building
from a developer's tree (i.e., a Git clone of the Open MPI source
tree). If you have an Open MPI distribution tarball, the contents of
this section are optional -- you can (and probably should) skip
reading this section.
If you are building Open MPI from a developer's tree, you must first
install fairly recent versions of the GNU tools Autoconf, Automake,
and Libtool (and possibly GNU m4, because recent versions of Autoconf
have specific GNU m4 version requirements). The specific versions
required depend on if you are using the Git master branch or a release
branch (and which release branch you are using). The specific
versions can be found here:
https://www.open-mpi.org/source/building.php
You can check what versions of the autotools you have installed with
the following:
shell$ m4 --version
shell$ autoconf --version
shell$ automake --version
shell$ libtoolize --version
Required version levels for all the OMPI releases can be found here:
https://www.open-mpi.org/source/building.php
To strengthen the above point: the core Open MPI developers typically
use very, very recent versions of the GNU tools. There are known bugs
in older versions of the GNU tools that Open MPI no longer compensates
for (it seemed senseless to indefinitely support patches for ancient
versions of Autoconf, for example). You *WILL* have problems if you
do not use recent versions of the GNU tools.
If you need newer versions, you are *strongly* encouraged to heed the
following advice:
NOTE: On MacOS/X, the default "libtool" program is different than the
GNU libtool. You must download and install the GNU version
(e.g., via MacPorts, Homebrew, or some other mechanism).
1. Unless your OS distribution has easy-to-use binary installations,
the sources can be can be downloaded from:
ftp://ftp.gnu.org/gnu/autoconf/
ftp://ftp.gnu.org/gnu/automake/
ftp://ftp.gnu.org/gnu/libtool/
and if you need it:
ftp://ftp.gnu.org/gnu/m4/
NOTE: It is certainly easiest to download/build/install all four of
these tools together. But note that Open MPI has no specific m4
requirements; it is only listed here because Autoconf requires
minimum versions of GNU m4. Hence, you may or may not *need* to
actually install a new version of GNU m4. That being said, if you
are confused or don't know, just install the latest GNU m4 with the
rest of the GNU Autotools and everything will work out fine.
2. Build and install the tools in the following order:
2a. m4
2b. Autoconf
2c. Automake
2d. Libtool
3. You MUST install the last three tools (Autoconf, Automake, Libtool)
into the same prefix directory. These three tools are somewhat
inter-related, and if they're going to be used together, they MUST
share a common installation prefix.
You can install m4 anywhere as long as it can be found in the path;
it may be convenient to install it in the same prefix as the other
three. Or you can use any recent-enough m4 that is in your path.
3a. It is *strongly* encouraged that you do not install your new
versions over the OS-installed versions. This could cause
other things on your system to break. Instead, install into
$HOME/local, or /usr/local, or wherever else you tend to
install "local" kinds of software.
3b. In doing so, be sure to prefix your $path with the directory
where they are installed. For example, if you install into
$HOME/local, you may want to edit your shell startup file
(.bashrc, .cshrc, .tcshrc, etc.) to have something like:
# For bash/sh:
export PATH=$HOME/local/bin:$PATH
# For csh/tcsh:
set path = ($HOME/local/bin $path)
3c. Ensure to set your $path *BEFORE* you configure/build/install
the four packages.
4. All four packages require two simple commands to build and
install (where PREFIX is the prefix discussed in 3, above).
shell$ cd <m4 directory>
shell$ ./configure --prefix=PREFIX
shell$ make; make install
--> If you are using the csh or tcsh shells, be sure to run the
"rehash" command after you install each package.
shell$ cd <autoconf directory>
shell$ ./configure --prefix=PREFIX
shell$ make; make install
--> If you are using the csh or tcsh shells, be sure to run the
"rehash" command after you install each package.
shell$ cd <automake directory>
shell$ ./configure --prefix=PREFIX
shell$ make; make install
--> If you are using the csh or tcsh shells, be sure to run the
"rehash" command after you install each package.
shell$ cd <libtool directory>
shell$ ./configure --prefix=PREFIX
shell$ make; make install
--> If you are using the csh or tcsh shells, be sure to run the
"rehash" command after you install each package.
m4, Autoconf and Automake build and install very quickly; Libtool will
take a minute or two.
5. You can now run OMPI's top-level "autogen.pl" script. This script
will invoke the GNU Autoconf, Automake, and Libtool commands in the
proper order and setup to run OMPI's top-level "configure" script.
Running autogen.pl may take a few minutes, depending on your
system. It's not very exciting to watch. :-)
If you have a multi-processor system, enabling the multi-threaded
behavior in Automake 1.11 (or newer) can result in autogen.pl
running faster. Do this by setting the AUTOMAKE_JOBS environment
variable to the number of processors (threads) that you want it to
use before invoking autogen.pl. For example (you can again put
this in your shell startup files):
# For bash/sh:
export AUTOMAKE_JOBS=4
# For csh/tcsh:
set AUTOMAKE_JOBS 4
5a. You generally need to run autogen.pl whenever the top-level
file "configure.ac" changes, or any files in the config/ or
<project>/config/ directories change (these directories are
where a lot of "include" files for OMPI's configure script
live).
5b. You do *NOT* need to re-run autogen.pl if you modify a
Makefile.am.
Use of Flex
===========
Flex is used during the compilation of a developer's checkout (it is
not used to build official distribution tarballs). Other flavors of
lex are *not* supported: given the choice of making parsing code
portable between all flavors of lex and doing more interesting work on
Open MPI, we greatly prefer the latter.
Note that no testing has been performed to see what the minimum
version of Flex is required by Open MPI. We suggest that you use
v2.5.35 at the earliest.
*** NOTE: Windows developer builds of Open MPI *require* Flex version
2.5.35. Specifically, we know that v2.5.35 works and 2.5.4a does not.
We have not tested to figure out exactly what the minimum required
flex version is on Windows; we suggest that you use 2.5.35 at the
earliest. It is for this reason that the
contrib/dist/make_dist_tarball script checks for a Windows-friendly
version of flex before continuing.
For now, Open MPI will allow developer builds with Flex 2.5.4. This
is primarily motivated by the fact that RedHat/Centos 5 ships with
Flex 2.5.4. It is likely that someday Open MPI developer builds will
require Flex version >=2.5.35.
Note that the flex-generated code generates some compiler warnings on
some platforms, but the warnings do not seem to be consistent or
uniform on all platforms, compilers, and flex versions. As such, we
have done little to try to remove those warnings.
If you do not have Flex installed, it can be downloaded from the
following URL:
https://github.com/westes/flex
Use of Pandoc
=============
Similar to prior sections, you need to read/care about this section
*ONLY* if you are building from a developer's tree (i.e., a Git clone
of the Open MPI source tree). If you have an Open MPI distribution
tarball, the contents of this section are optional -- you can (and
probably should) skip reading this section.
The Pandoc tool is used to generate Open MPI's man pages.
Specifically: Open MPI's man pages are written in Markdown; Pandoc is
the tool that converts that Markdown to nroff (i.e., the format of man
pages).
You must have Pandoc >=v1.12 when building Open MPI from a developer's
tree. If configure cannot find Pandoc >=v1.12, it will abort.
If you need to install Pandoc, check your operating system-provided
packages (to include MacOS Homebrew and MacPorts). The Pandoc project
itself also offers binaries for their releases:
https://pandoc.org/

258
HACKING.md Обычный файл
Просмотреть файл

@ -0,0 +1,258 @@
# Open MPI Hacking / Developer's Guide
## Overview
This file is here for those who are building/exploring OMPI in its
source code form, most likely through a developer's tree (i.e., a
Git clone).
## Developer Builds: Compiler Pickyness by Default
If you are building Open MPI from a Git clone (i.e., there is a `.git`
directory in your build tree), the default build includes extra
compiler pickyness, which will result in more compiler warnings than
in non-developer builds. Getting these extra compiler warnings is
helpful to Open MPI developers in making the code base as clean as
possible.
Developers can disable this picky-by-default behavior by using the
`--disable-picky` configure option. Also note that extra-picky compiles
do *not* happen automatically when you do a VPATH build (e.g., if
`.git` is in your source tree, but not in your build tree).
Prior versions of Open MPI would automatically activate a lot of
(performance-reducing) debugging code by default if `.git` was found
in your build tree. This is no longer true. You can manually enable
these (performance-reducing) debugging features in the Open MPI code
base with these configure options:
* `--enable-debug`
* `--enable-mem-debug`
* `--enable-mem-profile`
***NOTE:*** These options are really only relevant to those who are
developing Open MPI itself. They are not generally helpful for
debugging general MPI applications.
## Use of GNU Autoconf, Automake, and Libtool (and m4)
You need to read/care about this section *ONLY* if you are building
from a developer's tree (i.e., a Git clone of the Open MPI source
tree). If you have an Open MPI distribution tarball, the contents of
this section are optional -- you can (and probably should) skip
reading this section.
If you are building Open MPI from a developer's tree, you must first
install fairly recent versions of the GNU tools Autoconf, Automake,
and Libtool (and possibly GNU m4, because recent versions of Autoconf
have specific GNU m4 version requirements). The specific versions
required depend on if you are using the Git master branch or a release
branch (and which release branch you are using). [The specific
versions can be found
here](https://www.open-mpi.org/source/building.php).
You can check what versions of the autotools you have installed with
the following:
```
shell$ m4 --version
shell$ autoconf --version
shell$ automake --version
shell$ libtoolize --version
```
[Required version levels for all the OMPI releases can be found
here](https://www.open-mpi.org/source/building.php).
To strengthen the above point: the core Open MPI developers typically
use very, very recent versions of the GNU tools. There are known bugs
in older versions of the GNU tools that Open MPI no longer compensates
for (it seemed senseless to indefinitely support patches for ancient
versions of Autoconf, for example). You *WILL* have problems if you
do not use recent versions of the GNU tools.
***NOTE:*** On MacOS/X, the default `libtool` program is different
than the GNU libtool. You must download and install the GNU version
(e.g., via MacPorts, Homebrew, or some other mechanism).
If you need newer versions, you are *strongly* encouraged to heed the
following advice:
1. Unless your OS distribution has easy-to-use binary installations,
the sources can be can be downloaded from:
* https://ftp.gnu.org/gnu/autoconf/
* https://ftp.gnu.org/gnu/automake/
* https://ftp.gnu.org/gnu/libtool/
* And if you need it: https://ftp.gnu.org/gnu/m4/
***NOTE:*** It is certainly easiest to download/build/install all
four of these tools together. But note that Open MPI has no
specific m4 requirements; it is only listed here because Autoconf
requires minimum versions of GNU m4. Hence, you may or may not
*need* to actually install a new version of GNU m4. That being
said, if you are confused or don't know, just install the latest
GNU m4 with the rest of the GNU Autotools and everything will work
out fine.
1. Build and install the tools in the following order:
1. m4
1. Autoconf
1. Automake
1. Libtool
1. You MUST install the last three tools (Autoconf, Automake, Libtool)
into the same prefix directory. These three tools are somewhat
inter-related, and if they're going to be used together, they MUST
share a common installation prefix.
You can install m4 anywhere as long as it can be found in the path;
it may be convenient to install it in the same prefix as the other
three. Or you can use any recent-enough m4 that is in your path.
1. It is *strongly* encouraged that you do not install your new
versions over the OS-installed versions. This could cause
other things on your system to break. Instead, install into
`$HOME/local`, or `/usr/local`, or wherever else you tend to
install "local" kinds of software.
1. In doing so, be sure to prefix your $path with the directory
where they are installed. For example, if you install into
`$HOME/local`, you may want to edit your shell startup file
(`.bashrc`, `.cshrc`, `.tcshrc`, etc.) to have something like:
```sh
# For bash/sh:
export PATH=$HOME/local/bin:$PATH
# For csh/tcsh:
set path = ($HOME/local/bin $path)
```
1. Ensure to set your `$PATH` *BEFORE* you configure/build/install
the four packages.
1. All four packages require two simple commands to build and
install (where PREFIX is the prefix discussed in 3, above).
```
shell$ cd <m4 directory>
shell$ ./configure --prefix=PREFIX
shell$ make; make install
```
***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
run the `rehash` command after you install each package.
```
shell$ cd <autoconf directory>
shell$ ./configure --prefix=PREFIX
shell$ make; make install
```
***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
run the `rehash` command after you install each package.
```
shell$ cd <automake directory>
shell$ ./configure --prefix=PREFIX
shell$ make; make install
```
***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
run the `rehash` command after you install each package.
```
shell$ cd <libtool directory>
shell$ ./configure --prefix=PREFIX
shell$ make; make install
```
***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
run the `rehash` command after you install each package.
m4, Autoconf and Automake build and install very quickly; Libtool
will take a minute or two.
1. You can now run OMPI's top-level `autogen.pl` script. This script
will invoke the GNU Autoconf, Automake, and Libtool commands in the
proper order and setup to run OMPI's top-level `configure` script.
Running `autogen.pl` may take a few minutes, depending on your
system. It's not very exciting to watch. :smile:
If you have a multi-processor system, enabling the multi-threaded
behavior in Automake 1.11 (or newer) can result in `autogen.pl`
running faster. Do this by setting the `AUTOMAKE_JOBS` environment
variable to the number of processors (threads) that you want it to
use before invoking `autogen`.pl. For example (you can again put
this in your shell startup files):
```sh
# For bash/sh:
export AUTOMAKE_JOBS=4
# For csh/tcsh:
set AUTOMAKE_JOBS 4
```
1. You generally need to run autogen.pl whenever the top-level file
`configure.ac` changes, or any files in the `config/` or
`<project>/config/` directories change (these directories are
where a lot of "include" files for Open MPI's `configure` script
live).
1. You do *NOT* need to re-run `autogen.pl` if you modify a
`Makefile.am`.
## Use of Flex
Flex is used during the compilation of a developer's checkout (it is
not used to build official distribution tarballs). Other flavors of
lex are *not* supported: given the choice of making parsing code
portable between all flavors of lex and doing more interesting work on
Open MPI, we greatly prefer the latter.
Note that no testing has been performed to see what the minimum
version of Flex is required by Open MPI. We suggest that you use
v2.5.35 at the earliest.
***NOTE:*** Windows developer builds of Open MPI *require* Flex version
2.5.35. Specifically, we know that v2.5.35 works and 2.5.4a does not.
We have not tested to figure out exactly what the minimum required
flex version is on Windows; we suggest that you use 2.5.35 at the
earliest. It is for this reason that the
`contrib/dist/make_dist_tarball` script checks for a Windows-friendly
version of Flex before continuing.
For now, Open MPI will allow developer builds with Flex 2.5.4. This
is primarily motivated by the fact that RedHat/Centos 5 ships with
Flex 2.5.4. It is likely that someday Open MPI developer builds will
require Flex version >=2.5.35.
Note that the `flex`-generated code generates some compiler warnings
on some platforms, but the warnings do not seem to be consistent or
uniform on all platforms, compilers, and flex versions. As such, we
have done little to try to remove those warnings.
If you do not have Flex installed, see [the Flex Github
repository](https://github.com/westes/flex).
## Use of Pandoc
Similar to prior sections, you need to read/care about this section
*ONLY* if you are building from a developer's tree (i.e., a Git clone
of the Open MPI source tree). If you have an Open MPI distribution
tarball, the contents of this section are optional -- you can (and
probably should) skip reading this section.
The Pandoc tool is used to generate Open MPI's man pages.
Specifically: Open MPI's man pages are written in Markdown; Pandoc is
the tool that converts that Markdown to nroff (i.e., the format of man
pages).
You must have Pandoc >=v1.12 when building Open MPI from a developer's
tree. If configure cannot find Pandoc >=v1.12, it will abort.
If you need to install Pandoc, check your operating system-provided
packages (to include MacOS Homebrew and MacPorts). [The Pandoc
project web site](https://pandoc.org/) itself also offers binaries for
their releases.

11
LICENSE
Просмотреть файл

@ -15,9 +15,9 @@ Copyright (c) 2004-2010 High Performance Computing Center Stuttgart,
University of Stuttgart. All rights reserved.
Copyright (c) 2004-2008 The Regents of the University of California.
All rights reserved.
Copyright (c) 2006-2017 Los Alamos National Security, LLC. All rights
Copyright (c) 2006-2018 Los Alamos National Security, LLC. All rights
reserved.
Copyright (c) 2006-2017 Cisco Systems, Inc. All rights reserved.
Copyright (c) 2006-2020 Cisco Systems, Inc. All rights reserved.
Copyright (c) 2006-2010 Voltaire, Inc. All rights reserved.
Copyright (c) 2006-2017 Sandia National Laboratories. All rights reserved.
Copyright (c) 2006-2010 Sun Microsystems, Inc. All rights reserved.
@ -25,7 +25,7 @@ Copyright (c) 2006-2010 Sun Microsystems, Inc. All rights reserved.
Copyright (c) 2006-2017 The University of Houston. All rights reserved.
Copyright (c) 2006-2009 Myricom, Inc. All rights reserved.
Copyright (c) 2007-2017 UT-Battelle, LLC. All rights reserved.
Copyright (c) 2007-2017 IBM Corporation. All rights reserved.
Copyright (c) 2007-2020 IBM Corporation. All rights reserved.
Copyright (c) 1998-2005 Forschungszentrum Juelich, Juelich Supercomputing
Centre, Federal Republic of Germany
Copyright (c) 2005-2008 ZIH, TU Dresden, Federal Republic of Germany
@ -45,7 +45,7 @@ Copyright (c) 2016 ARM, Inc. All rights reserved.
Copyright (c) 2010-2011 Alex Brick <bricka@ccs.neu.edu>. All rights reserved.
Copyright (c) 2012 The University of Wisconsin-La Crosse. All rights
reserved.
Copyright (c) 2013-2016 Intel, Inc. All rights reserved.
Copyright (c) 2013-2020 Intel, Inc. All rights reserved.
Copyright (c) 2011-2017 NVIDIA Corporation. All rights reserved.
Copyright (c) 2016 Broadcom Limited. All rights reserved.
Copyright (c) 2011-2017 Fujitsu Limited. All rights reserved.
@ -56,7 +56,8 @@ Copyright (c) 2013-2017 Research Organization for Information Science (RIST).
Copyright (c) 2017-2020 Amazon.com, Inc. or its affiliates. All Rights
reserved.
Copyright (c) 2018 DataDirect Networks. All rights reserved.
Copyright (c) 2018-2019 Triad National Security, LLC. All rights reserved.
Copyright (c) 2018-2020 Triad National Security, LLC. All rights reserved.
Copyright (c) 2020 Google, LLC. All rights reserved.
$COPYRIGHT$

Просмотреть файл

@ -24,7 +24,7 @@
SUBDIRS = config contrib 3rd-party $(MCA_PROJECT_SUBDIRS) test
DIST_SUBDIRS = config contrib 3rd-party $(MCA_PROJECT_DIST_SUBDIRS) test
EXTRA_DIST = README INSTALL VERSION Doxyfile LICENSE autogen.pl README.JAVA.txt AUTHORS
EXTRA_DIST = README.md INSTALL VERSION Doxyfile LICENSE autogen.pl README.JAVA.md AUTHORS
include examples/Makefile.include

2243
README

Разница между файлами не показана из-за своего большого размера Загрузить разницу

281
README.JAVA.md Обычный файл
Просмотреть файл

@ -0,0 +1,281 @@
# Open MPI Java Bindings
## Important node
JAVA BINDINGS ARE PROVIDED ON A "PROVISIONAL" BASIS - I.E., THEY ARE
NOT PART OF THE CURRENT OR PROPOSED MPI STANDARDS. THUS, INCLUSION OF
JAVA SUPPORT IS NOT REQUIRED BY THE STANDARD. CONTINUED INCLUSION OF
THE JAVA BINDINGS IS CONTINGENT UPON ACTIVE USER INTEREST AND
CONTINUED DEVELOPER SUPPORT.
## Overview
This version of Open MPI provides support for Java-based
MPI applications.
The rest of this document provides step-by-step instructions on
building OMPI with Java bindings, and compiling and running Java-based
MPI applications. Also, part of the functionality is explained with
examples. Further details about the design, implementation and usage
of Java bindings in Open MPI can be found in [1]. The bindings follow
a JNI approach, that is, we do not provide a pure Java implementation
of MPI primitives, but a thin layer on top of the C
implementation. This is the same approach as in mpiJava [2]; in fact,
mpiJava was taken as a starting point for Open MPI Java bindings, but
they were later totally rewritten.
1. O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
implementation of Java bindings in Open MPI". Parallel Comput.
59: 1-20 (2016).
2. M. Baker et al. "mpiJava: An object-oriented Java interface to
MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
pp. 748-762, Springer (1999).
## Building Java Bindings
If this software was obtained as a developer-level checkout as opposed
to a tarball, you will need to start your build by running
`./autogen.pl`. This will also require that you have a fairly recent
version of GNU Autotools on your system - see the HACKING.md file for
details.
Java support requires that Open MPI be built at least with shared libraries
(i.e., `--enable-shared`) - any additional options are fine and will not
conflict. Note that this is the default for Open MPI, so you don't
have to explicitly add the option. The Java bindings will build only
if `--enable-mpi-java` is specified, and a JDK is found in a typical
system default location.
If the JDK is not in a place where we automatically find it, you can
specify the location. For example, this is required on the Mac
platform as the JDK headers are located in a non-typical location. Two
options are available for this purpose:
1. `--with-jdk-bindir=<foo>`: the location of `javac` and `javah`
1. `--with-jdk-headers=<bar>`: the directory containing `jni.h`
For simplicity, typical configurations are provided in platform files
under `contrib/platform/hadoop`. These will meet the needs of most
users, or at least provide a starting point for your own custom
configuration.
In summary, therefore, you can configure the system using the
following Java-related options:
```
$ ./configure --with-platform=contrib/platform/hadoop/<your-platform> ...
````
or
```
$ ./configure --enable-mpi-java --with-jdk-bindir=<foo> --with-jdk-headers=<bar> ...
```
or simply
```
$ ./configure --enable-mpi-java ...
```
if JDK is in a "standard" place that we automatically find.
## Running Java Applications
For convenience, the `mpijavac` wrapper compiler has been provided for
compiling Java-based MPI applications. It ensures that all required MPI
libraries and class paths are defined. You can see the actual command
line using the `--showme` option, if you are interested.
Once your application has been compiled, you can run it with the
standard `mpirun` command line:
```
$ mpirun <options> java <your-java-options> <my-app>
```
For convenience, `mpirun` has been updated to detect the `java` command
and ensure that the required MPI libraries and class paths are defined
to support execution. You therefore do _NOT_ need to specify the Java
library path to the MPI installation, nor the MPI classpath. Any class
path definitions required for your application should be specified
either on the command line or via the `CLASSPATH` environment
variable. Note that the local directory will be added to the class
path if nothing is specified.
As always, the `java` executable, all required libraries, and your
application classes must be available on all nodes.
## Basic usage of Java bindings
There is an MPI package that contains all classes of the MPI Java
bindings: `Comm`, `Datatype`, `Request`, etc. These classes have a
direct correspondence with classes defined by the MPI standard. MPI
primitives are just methods included in these classes. The convention
used for naming Java methods and classes is the usual camel-case
convention, e.g., the equivalent of `MPI_File_set_info(fh,info)` is
`fh.setInfo(info)`, where `fh` is an object of the class `File`.
Apart from classes, the MPI package contains predefined public
attributes under a convenience class `MPI`. Examples are the
predefined communicator `MPI.COMM_WORLD` or predefined datatypes such
as `MPI.DOUBLE`. Also, MPI initialization and finalization are methods
of the `MPI` class and must be invoked by all MPI Java
applications. The following example illustrates these concepts:
```java
import mpi.*;
class ComputePi {
public static void main(String args[]) throws MPIException {
MPI.Init(args);
int rank = MPI.COMM_WORLD.getRank(),
size = MPI.COMM_WORLD.getSize(),
nint = 100; // Intervals.
double h = 1.0/(double)nint, sum = 0.0;
for(int i=rank+1; i<=nint; i+=size) {
double x = h * ((double)i - 0.5);
sum += (4.0 / (1.0 + x * x));
}
double sBuf[] = { h * sum },
rBuf[] = new double[1];
MPI.COMM_WORLD.reduce(sBuf, rBuf, 1, MPI.DOUBLE, MPI.SUM, 0);
if(rank == 0) System.out.println("PI: " + rBuf[0]);
MPI.Finalize();
}
}
```
## Exception handling
Java bindings in Open MPI support exception handling. By default, errors
are fatal, but this behavior can be changed. The Java API will throw
exceptions if the MPI.ERRORS_RETURN error handler is set:
```java
MPI.COMM_WORLD.setErrhandler(MPI.ERRORS_RETURN);
```
If you add this statement to your program, it will show the line
where it breaks, instead of just crashing in case of an error.
Error-handling code can be separated from main application code by
means of try-catch blocks, for instance:
```java
try
{
File file = new File(MPI.COMM_SELF, "filename", MPI.MODE_RDONLY);
}
catch(MPIException ex)
{
System.err.println("Error Message: "+ ex.getMessage());
System.err.println(" Error Class: "+ ex.getErrorClass());
ex.printStackTrace();
System.exit(-1);
}
```
## How to specify buffers
In MPI primitives that require a buffer (either send or receive) the
Java API admits a Java array. Since Java arrays can be relocated by
the Java runtime environment, the MPI Java bindings need to make a
copy of the contents of the array to a temporary buffer, then pass the
pointer to this buffer to the underlying C implementation. From the
practical point of view, this implies an overhead associated to all
buffers that are represented by Java arrays. The overhead is small
for small buffers but increases for large arrays.
There is a pool of temporary buffers with a default capacity of 64K.
If a temporary buffer of 64K or less is needed, then the buffer will
be obtained from the pool. But if the buffer is larger, then it will
be necessary to allocate the buffer and free it later.
The default capacity of pool buffers can be modified with an Open MPI
MCA parameter:
```
shell$ mpirun --mca mpi_java_eager size ...
```
Where `size` is the number of bytes, or kilobytes if it ends with 'k',
or megabytes if it ends with 'm'.
An alternative is to use "direct buffers" provided by standard classes
available in the Java SDK such as `ByteBuffer`. For convenience we
provide a few static methods `new[Type]Buffer` in the `MPI` class to
create direct buffers for a number of basic datatypes. Elements of the
direct buffer can be accessed with methods `put()` and `get()`, and
the number of elements in the buffer can be obtained with the method
`capacity()`. This example illustrates its use:
```java
int myself = MPI.COMM_WORLD.getRank();
int tasks = MPI.COMM_WORLD.getSize();
IntBuffer in = MPI.newIntBuffer(MAXLEN * tasks),
out = MPI.newIntBuffer(MAXLEN);
for(int i = 0; i < MAXLEN; i++)
out.put(i, myself); // fill the buffer with the rank
Request request = MPI.COMM_WORLD.iAllGather(
out, MAXLEN, MPI.INT, in, MAXLEN, MPI.INT);
request.waitFor();
request.free();
for(int i = 0; i < tasks; i++)
{
for(int k = 0; k < MAXLEN; k++)
{
if(in.get(k + i * MAXLEN) != i)
throw new AssertionError("Unexpected value");
}
}
```
Direct buffers are available for: `BYTE`, `CHAR`, `SHORT`, `INT`,
`LONG`, `FLOAT`, and `DOUBLE`. There is no direct buffer for booleans.
Direct buffers are not a replacement for arrays, because they have
higher allocation and deallocation costs than arrays. In some
cases arrays will be a better choice. You can easily convert a
buffer into an array and vice versa.
All non-blocking methods must use direct buffers and only
blocking methods can choose between arrays and direct buffers.
The above example also illustrates that it is necessary to call
the `free()` method on objects whose class implements the `Freeable`
interface. Otherwise a memory leak is produced.
## Specifying offsets in buffers
In a C program, it is common to specify an offset in a array with
`&array[i]` or `array+i`, for instance to send data starting from
a given position in the array. The equivalent form in the Java bindings
is to `slice()` the buffer to start at an offset. Making a `slice()`
on a buffer is only necessary, when the offset is not zero. Slices
work for both arrays and direct buffers.
```java
import static mpi.MPI.slice;
// ...
int numbers[] = new int[SIZE];
// ...
MPI.COMM_WORLD.send(slice(numbers, offset), count, MPI.INT, 1, 0);
```
## Questions? Problems?
If you have any problems, or find any bugs, please feel free to report
them to [Open MPI user's mailing
list](https://www.open-mpi.org/community/lists/ompi.php).

Просмотреть файл

@ -1,275 +0,0 @@
***************************************************************************
IMPORTANT NOTE
JAVA BINDINGS ARE PROVIDED ON A "PROVISIONAL" BASIS - I.E., THEY ARE
NOT PART OF THE CURRENT OR PROPOSED MPI STANDARDS. THUS, INCLUSION OF
JAVA SUPPORT IS NOT REQUIRED BY THE STANDARD. CONTINUED INCLUSION OF
THE JAVA BINDINGS IS CONTINGENT UPON ACTIVE USER INTEREST AND
CONTINUED DEVELOPER SUPPORT.
***************************************************************************
This version of Open MPI provides support for Java-based
MPI applications.
The rest of this document provides step-by-step instructions on
building OMPI with Java bindings, and compiling and running
Java-based MPI applications. Also, part of the functionality is
explained with examples. Further details about the design,
implementation and usage of Java bindings in Open MPI can be found
in [1]. The bindings follow a JNI approach, that is, we do not
provide a pure Java implementation of MPI primitives, but a thin
layer on top of the C implementation. This is the same approach
as in mpiJava [2]; in fact, mpiJava was taken as a starting point
for Open MPI Java bindings, but they were later totally rewritten.
[1] O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
implementation of Java bindings in Open MPI". Parallel Comput.
59: 1-20 (2016).
[2] M. Baker et al. "mpiJava: An object-oriented Java interface to
MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
pp. 748-762, Springer (1999).
============================================================================
Building Java Bindings
If this software was obtained as a developer-level
checkout as opposed to a tarball, you will need to start your build by
running ./autogen.pl. This will also require that you have a fairly
recent version of autotools on your system - see the HACKING file for
details.
Java support requires that Open MPI be built at least with shared libraries
(i.e., --enable-shared) - any additional options are fine and will not
conflict. Note that this is the default for Open MPI, so you don't
have to explicitly add the option. The Java bindings will build only
if --enable-mpi-java is specified, and a JDK is found in a typical
system default location.
If the JDK is not in a place where we automatically find it, you can
specify the location. For example, this is required on the Mac
platform as the JDK headers are located in a non-typical location. Two
options are available for this purpose:
--with-jdk-bindir=<foo> - the location of javac and javah
--with-jdk-headers=<bar> - the directory containing jni.h
For simplicity, typical configurations are provided in platform files
under contrib/platform/hadoop. These will meet the needs of most
users, or at least provide a starting point for your own custom
configuration.
In summary, therefore, you can configure the system using the
following Java-related options:
$ ./configure --with-platform=contrib/platform/hadoop/<your-platform>
...
or
$ ./configure --enable-mpi-java --with-jdk-bindir=<foo>
--with-jdk-headers=<bar> ...
or simply
$ ./configure --enable-mpi-java ...
if JDK is in a "standard" place that we automatically find.
----------------------------------------------------------------------------
Running Java Applications
For convenience, the "mpijavac" wrapper compiler has been provided for
compiling Java-based MPI applications. It ensures that all required MPI
libraries and class paths are defined. You can see the actual command
line using the --showme option, if you are interested.
Once your application has been compiled, you can run it with the
standard "mpirun" command line:
$ mpirun <options> java <your-java-options> <my-app>
For convenience, mpirun has been updated to detect the "java" command
and ensure that the required MPI libraries and class paths are defined
to support execution. You therefore do NOT need to specify the Java
library path to the MPI installation, nor the MPI classpath. Any class
path definitions required for your application should be specified
either on the command line or via the CLASSPATH environmental
variable. Note that the local directory will be added to the class
path if nothing is specified.
As always, the "java" executable, all required libraries, and your
application classes must be available on all nodes.
----------------------------------------------------------------------------
Basic usage of Java bindings
There is an MPI package that contains all classes of the MPI Java
bindings: Comm, Datatype, Request, etc. These classes have a direct
correspondence with classes defined by the MPI standard. MPI primitives
are just methods included in these classes. The convention used for
naming Java methods and classes is the usual camel-case convention,
e.g., the equivalent of MPI_File_set_info(fh,info) is fh.setInfo(info),
where fh is an object of the class File.
Apart from classes, the MPI package contains predefined public attributes
under a convenience class MPI. Examples are the predefined communicator
MPI.COMM_WORLD or predefined datatypes such as MPI.DOUBLE. Also, MPI
initialization and finalization are methods of the MPI class and must
be invoked by all MPI Java applications. The following example illustrates
these concepts:
import mpi.*;
class ComputePi {
public static void main(String args[]) throws MPIException {
MPI.Init(args);
int rank = MPI.COMM_WORLD.getRank(),
size = MPI.COMM_WORLD.getSize(),
nint = 100; // Intervals.
double h = 1.0/(double)nint, sum = 0.0;
for(int i=rank+1; i<=nint; i+=size) {
double x = h * ((double)i - 0.5);
sum += (4.0 / (1.0 + x * x));
}
double sBuf[] = { h * sum },
rBuf[] = new double[1];
MPI.COMM_WORLD.reduce(sBuf, rBuf, 1, MPI.DOUBLE, MPI.SUM, 0);
if(rank == 0) System.out.println("PI: " + rBuf[0]);
MPI.Finalize();
}
}
----------------------------------------------------------------------------
Exception handling
Java bindings in Open MPI support exception handling. By default, errors
are fatal, but this behavior can be changed. The Java API will throw
exceptions if the MPI.ERRORS_RETURN error handler is set:
MPI.COMM_WORLD.setErrhandler(MPI.ERRORS_RETURN);
If you add this statement to your program, it will show the line
where it breaks, instead of just crashing in case of an error.
Error-handling code can be separated from main application code by
means of try-catch blocks, for instance:
try
{
File file = new File(MPI.COMM_SELF, "filename", MPI.MODE_RDONLY);
}
catch(MPIException ex)
{
System.err.println("Error Message: "+ ex.getMessage());
System.err.println(" Error Class: "+ ex.getErrorClass());
ex.printStackTrace();
System.exit(-1);
}
----------------------------------------------------------------------------
How to specify buffers
In MPI primitives that require a buffer (either send or receive) the
Java API admits a Java array. Since Java arrays can be relocated by
the Java runtime environment, the MPI Java bindings need to make a
copy of the contents of the array to a temporary buffer, then pass the
pointer to this buffer to the underlying C implementation. From the
practical point of view, this implies an overhead associated to all
buffers that are represented by Java arrays. The overhead is small
for small buffers but increases for large arrays.
There is a pool of temporary buffers with a default capacity of 64K.
If a temporary buffer of 64K or less is needed, then the buffer will
be obtained from the pool. But if the buffer is larger, then it will
be necessary to allocate the buffer and free it later.
The default capacity of pool buffers can be modified with an 'mca'
parameter:
mpirun --mca mpi_java_eager size ...
Where 'size' is the number of bytes, or kilobytes if it ends with 'k',
or megabytes if it ends with 'm'.
An alternative is to use "direct buffers" provided by standard
classes available in the Java SDK such as ByteBuffer. For convenience
we provide a few static methods "new[Type]Buffer" in the MPI class
to create direct buffers for a number of basic datatypes. Elements
of the direct buffer can be accessed with methods put() and get(),
and the number of elements in the buffer can be obtained with the
method capacity(). This example illustrates its use:
int myself = MPI.COMM_WORLD.getRank();
int tasks = MPI.COMM_WORLD.getSize();
IntBuffer in = MPI.newIntBuffer(MAXLEN * tasks),
out = MPI.newIntBuffer(MAXLEN);
for(int i = 0; i < MAXLEN; i++)
out.put(i, myself); // fill the buffer with the rank
Request request = MPI.COMM_WORLD.iAllGather(
out, MAXLEN, MPI.INT, in, MAXLEN, MPI.INT);
request.waitFor();
request.free();
for(int i = 0; i < tasks; i++)
{
for(int k = 0; k < MAXLEN; k++)
{
if(in.get(k + i * MAXLEN) != i)
throw new AssertionError("Unexpected value");
}
}
Direct buffers are available for: BYTE, CHAR, SHORT, INT, LONG,
FLOAT, and DOUBLE. There is no direct buffer for booleans.
Direct buffers are not a replacement for arrays, because they have
higher allocation and deallocation costs than arrays. In some
cases arrays will be a better choice. You can easily convert a
buffer into an array and vice versa.
All non-blocking methods must use direct buffers and only
blocking methods can choose between arrays and direct buffers.
The above example also illustrates that it is necessary to call
the free() method on objects whose class implements the Freeable
interface. Otherwise a memory leak is produced.
----------------------------------------------------------------------------
Specifying offsets in buffers
In a C program, it is common to specify an offset in a array with
"&array[i]" or "array+i", for instance to send data starting from
a given position in the array. The equivalent form in the Java bindings
is to "slice()" the buffer to start at an offset. Making a "slice()"
on a buffer is only necessary, when the offset is not zero. Slices
work for both arrays and direct buffers.
import static mpi.MPI.slice;
...
int numbers[] = new int[SIZE];
...
MPI.COMM_WORLD.send(slice(numbers, offset), count, MPI.INT, 1, 0);
----------------------------------------------------------------------------
If you have any problems, or find any bugs, please feel free to report
them to Open MPI user's mailing list (see
https://www.open-mpi.org/community/lists/ompi.php).

2191
README.md Обычный файл

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -64,7 +64,7 @@ EXTRA_DIST = \
platform/lanl/cray_xc_cle5.2/optimized-common \
platform/lanl/cray_xc_cle5.2/optimized-lustre \
platform/lanl/cray_xc_cle5.2/optimized-lustre.conf \
platform/lanl/toss/README \
platform/lanl/toss/README.md \
platform/lanl/toss/common \
platform/lanl/toss/common-optimized \
platform/lanl/toss/cray-lustre-optimized \

Просмотреть файл

@ -1,121 +1,108 @@
# Description
2 Feb 2011
Description
===========
This sample "tcp2" BTL component is a simple example of how to build
This sample `tcp2` BTL component is a simple example of how to build
an Open MPI MCA component from outside of the Open MPI source tree.
This is a valuable technique for 3rd parties who want to provide their
own components for Open MPI, but do not want to be in the mainstream
distribution (i.e., their code is not part of the main Open MPI code
base).
NOTE: We do recommend that 3rd party developers investigate using a
DVCS such as Mercurial or Git to keep up with Open MPI
development. Using a DVCS allows you to host your component in
your own copy of the Open MPI source tree, and yet still keep up
with development changes, stable releases, etc.
Previous colloquial knowledge held that building a component from
outside of the Open MPI source tree required configuring Open MPI
--with-devel-headers, and then building and installing it. This
configure switch installs all of OMPI's internal .h files under
$prefix/include/openmpi, and therefore allows 3rd party code to be
`--with-devel-headers`, and then building and installing it. This
configure switch installs all of OMPI's internal `.h` files under
`$prefix/include/openmpi`, and therefore allows 3rd party code to be
compiled outside of the Open MPI tree.
This method definitely works, but is annoying:
* You have to ask users to use this special configure switch.
* Not all users install from source; many get binary packages (e.g.,
RPMs).
* You have to ask users to use this special configure switch.
* Not all users install from source; many get binary packages (e.g.,
RPMs).
This example package shows two ways to build an Open MPI MCA component
from outside the Open MPI source tree:
1. Using the above --with-devel-headers technique
2. Compiling against the Open MPI source tree itself (vs. the
installation tree)
1. Using the above `--with-devel-headers` technique
2. Compiling against the Open MPI source tree itself (vs. the
installation tree)
The user still has to have a source tree, but at least they don't have
to be required to use --with-devel-headers (which most users don't) --
to be required to use `--with-devel-headers` (which most users don't) --
they can likely build off the source tree that they already used.
Example project contents
========================
# Example project contents
The "tcp2" component is a direct copy of the TCP BTL as of January
The `tcp2` component is a direct copy of the TCP BTL as of January
2011 -- it has just been renamed so that it can be built separately
and installed alongside the real TCP BTL component.
Most of the mojo for both methods is handled in the example
components' configure.ac, but the same techniques are applicable
components' `configure.ac`, but the same techniques are applicable
outside of the GNU Auto toolchain.
This sample "tcp2" component has an autogen.sh script that requires
This sample `tcp2` component has an `autogen.sh` script that requires
the normal Autoconf, Automake, and Libtool. It also adds the
following two configure switches:
--with-openmpi-install=DIR
1. `--with-openmpi-install=DIR`:
If provided, `DIR` is an Open MPI installation tree that was
installed `--with-devel-headers`.
If provided, DIR is an Open MPI installation tree that was
installed --with-devel-headers.
This switch uses the installed mpicc --showme:<foo> functionality
to extract the relevant CPPFLAGS, LDFLAGS, and LIBS.
--with-openmpi-source=DIR
If provided, DIR is the source of a configured and built Open MPI
This switch uses the installed `mpicc --showme:<foo>` functionality
to extract the relevant `CPPFLAGS`, `LDFLAGS`, and `LIBS`.
1. `--with-openmpi-source=DIR`:
If provided, `DIR` is the source of a configured and built Open MPI
source tree (corresponding to the version expected by the example
component). The source tree is not required to have been
configured --with-devel-headers.
configured `--with-devel-headers`.
This switch uses the source tree's config.status script to extract
the relevant CPPFLAGS and CFLAGS.
This switch uses the source tree's `config.status` script to
extract the relevant `CPPFLAGS` and `CFLAGS`.
Either one of these two switches must be provided, or appropriate
CPPFLAGS, CFLAGS, LDFLAGS, and/or LIBS must be provided such that
valid Open MPI header and library files can be found and compiled /
linked against, respectively.
`CPPFLAGS`, `CFLAGS`, `LDFLAGS`, and/or `LIBS` must be provided such
that valid Open MPI header and library files can be found and compiled
/ linked against, respectively.
Example use
===========
# Example use
First, download, build, and install Open MPI:
-----
```
$ cd $HOME
$ wget \
https://www.open-mpi.org/software/ompi/vX.Y/downloads/openmpi-X.Y.Z.tar.bz2
[lots of output]
$ wget https://www.open-mpi.org/software/ompi/vX.Y/downloads/openmpi-X.Y.Z.tar.bz2
[...lots of output...]
$ tar jxf openmpi-X.Y.Z.tar.bz2
$ cd openmpi-X.Y.Z
$ ./configure --prefix=/opt/openmpi ...
[lots of output]
[...lots of output...]
$ make -j 4 install
[lots of output]
[...lots of output...]
$ /opt/openmpi/bin/ompi_info | grep btl
MCA btl: self (MCA vA.B, API vM.N, Component vX.Y.Z)
MCA btl: sm (MCA vA.B, API vM.N, Component vX.Y.Z)
MCA btl: tcp (MCA vA.B, API vM.N, Component vX.Y.Z)
[where X.Y.Z, A.B, and M.N are appropriate for your version of Open MPI]
$
-----
```
Notice the installed BTLs from ompi_info.
Notice the installed BTLs from `ompi_info`.
Now cd into this example project and build it, pointing it to the
Now `cd` into this example project and build it, pointing it to the
source directory of the Open MPI that you just built. Note that we
use the same --prefix as when installing Open MPI (so that the built
use the same `--prefix` as when installing Open MPI (so that the built
component will be installed into the Right place):
-----
```
$ cd /path/to/this/sample
$ ./autogen.sh
$ ./configure --prefix=/opt/openmpi --with-openmpi-source=$HOME/openmpi-X.Y.Z
[lots of output]
[...lots of output...]
$ make -j 4 install
[lots of output]
[...lots of output...]
$ /opt/openmpi/bin/ompi_info | grep btl
MCA btl: self (MCA vA.B, API vM.N, Component vX.Y.Z)
MCA btl: sm (MCA vA.B, API vM.N, Component vX.Y.Z)
@ -123,12 +110,11 @@ $ /opt/openmpi/bin/ompi_info | grep btl
MCA btl: tcp2 (MCA vA.B, API vM.N, Component vX.Y.Z)
[where X.Y.Z, A.B, and M.N are appropriate for your version of Open MPI]
$
-----
```
Notice that the "tcp2" BTL is now installed.
Notice that the `tcp2` BTL is now installed.
Random notes
============
# Random notes
The component in this project is just an example; I whipped it up in
the span of several hours. Your component may be a bit more complex
@ -139,17 +125,15 @@ what you need.
Changes required to the component to make it build in a standalone
mode:
1. Write your own configure script. This component is just a sample.
You basically need to build against an OMPI install that was
installed --with-devel-headers or a built OMPI source tree. See
./configure --help for details.
2. I also provided a bogus btl_tcp2_config.h (generated by configure).
This file is not included anywhere, but it does provide protection
against re-defined PACKAGE_* macros when running configure, which
is quite annoying.
3. Modify Makefile.am to only build DSOs. I.e., you can optionally
1. Write your own `configure` script. This component is just a
sample. You basically need to build against an OMPI install that
was installed `--with-devel-headers` or a built OMPI source tree.
See `./configure --help` for details.
1. I also provided a bogus `btl_tcp2_config.h` (generated by
`configure`). This file is not included anywhere, but it does
provide protection against re-defined `PACKAGE_*` macros when
running `configure`, which is quite annoying.
1. Modify `Makefile.am` to only build DSOs. I.e., you can optionally
take the static option out since the component can *only* build in
DSO mode when building standalone. That being said, it doesn't
hurt to leave the static builds in -- this would (hypothetically)

105
contrib/dist/linux/README поставляемый
Просмотреть файл

@ -1,105 +0,0 @@
Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
University Research and Technology
Corporation. All rights reserved.
Copyright (c) 2004-2006 The University of Tennessee and The University
of Tennessee Research Foundation. All rights
reserved.
Copyright (c) 2004-2006 High Performance Computing Center Stuttgart,
University of Stuttgart. All rights reserved.
Copyright (c) 2004-2006 The Regents of the University of California.
All rights reserved.
Copyright (c) 2006-2016 Cisco Systems, Inc. All rights reserved.
$COPYRIGHT$
Additional copyrights may follow
$HEADER$
===========================================================================
Note that you probably want to download the latest release of the SRPM
for any given Open MPI version. The SRPM release number is the
version after the dash in the SRPM filename. For example,
"openmpi-1.6.3-2.src.rpm" is the 2nd release of the SRPM for Open MPI
v1.6.3. Subsequent releases of SRPMs typically contain bug fixes for
the RPM packaging, but not Open MPI itself.
The buildrpm.sh script takes a single mandatory argument -- a filename
pointing to an Open MPI tarball (may be either .gz or .bz2). It will
create one or more RPMs from this tarball:
1. Source RPM
2. "All in one" RPM, where all of Open MPI is put into a single RPM.
3. "Multiple" RPM, where Open MPI is split into several sub-package
RPMs:
- openmpi-runtime
- openmpi-devel
- openmpi-docs
The folowing arguments could be used to affect script behaviour.
Please, do NOT set the same settings with parameters and config vars.
-b
If you specify this option, only the all-in-one binary RPM will
be built. By default, only the source RPM (SRPM) is built. Other
parameters that affect the all-in-one binary RPM will be ignored
unless this option is specified.
-n name
This option will change the name of the produced RPM to the "name".
It is useful to use with "-o" and "-m" options if you want to have
multiple Open MPI versions installed simultaneously in the same
enviroment. Requires use of option "-b".
-o
With this option the install path of the binary RPM will be changed
to /opt/_NAME_/_VERSION_. Requires use of option "-b".
-m
This option causes the RPM to also install modulefiles
to the location specified in the specfile. Requires use of option "-b".
-i
Also build a debuginfo RPM. By default, the debuginfo RPM is not built.
Requires use of option "-b".
-f lf_location
Include support for Libfabric. "lf_location" is Libfabric install
path. Requires use of option "-b".
-t tm_location
Include support for Torque/PBS Pro. "tm_location" is path of the
Torque/PBS Pro header files. Requires use of option "-b".
-d
Build with debugging support. By default,
the RPM is built without debugging support.
-c parameter
Add custom configure parameter.
-r parameter
Add custom RPM build parameter.
-s
If specified, the script will try to unpack the openmpi.spec
file from the tarball specified on the command line. By default,
the script will look for the specfile in the current directory.
-R directory
Specifies the top level RPM build direcotry.
-h
Prints script usage information.
Target architecture is currently hard-coded in the beginning
of the buildrpm.sh script.
Alternatively, you can build directly from the openmpi.spec spec file
or SRPM directly. Many options can be passed to the building process
via rpmbuild's --define option (there are older versions of rpmbuild
that do not seem to handle --define'd values properly in all cases,
but we generally don't care about those old versions of rpmbuild...).
The available options are described in the comments in the beginning
of the spec file in this directory.

88
contrib/dist/linux/README.md поставляемый Обычный файл
Просмотреть файл

@ -0,0 +1,88 @@
# Open MPI Linux distribution helpers
Note that you probably want to download the latest release of the SRPM
for any given Open MPI version. The SRPM release number is the
version after the dash in the SRPM filename. For example,
`openmpi-1.6.3-2.src.rpm` is the 2nd release of the SRPM for Open MPI
v1.6.3. Subsequent releases of SRPMs typically contain bug fixes for
the RPM packaging, but not Open MPI itself.
The `buildrpm.sh` script takes a single mandatory argument -- a
filename pointing to an Open MPI tarball (may be either `.gz` or
`.bz2`). It will create one or more RPMs from this tarball:
1. Source RPM
1. "All in one" RPM, where all of Open MPI is put into a single RPM.
1. "Multiple" RPM, where Open MPI is split into several sub-package
RPMs:
* `openmpi-runtime`
* `openmpi-devel`
* `openmpi-docs`
The folowing arguments could be used to affect script behaviour.
Please, do NOT set the same settings with parameters and config vars.
* `-b`:
If you specify this option, only the all-in-one binary RPM will
be built. By default, only the source RPM (SRPM) is built. Other
parameters that affect the all-in-one binary RPM will be ignored
unless this option is specified.
* `-n name`:
This option will change the name of the produced RPM to the "name".
It is useful to use with "-o" and "-m" options if you want to have
multiple Open MPI versions installed simultaneously in the same
enviroment. Requires use of option `-b`.
* `-o`:
With this option the install path of the binary RPM will be changed
to `/opt/_NAME_/_VERSION_`. Requires use of option `-b`.
* `-m`:
This option causes the RPM to also install modulefiles
to the location specified in the specfile. Requires use of option `-b`.
* `-i`:
Also build a debuginfo RPM. By default, the debuginfo RPM is not built.
Requires use of option `-b`.
* `-f lf_location`:
Include support for Libfabric. "lf_location" is Libfabric install
path. Requires use of option `-b`.
* `-t tm_location`:
Include support for Torque/PBS Pro. "tm_location" is path of the
Torque/PBS Pro header files. Requires use of option `-b`.
* `-d`:
Build with debugging support. By default,
the RPM is built without debugging support.
* `-c parameter`:
Add custom configure parameter.
* `-r parameter`:
Add custom RPM build parameter.
* `-s`:
If specified, the script will try to unpack the openmpi.spec
file from the tarball specified on the command line. By default,
the script will look for the specfile in the current directory.
* `-R directory`:
Specifies the top level RPM build direcotry.
* `-h`:
Prints script usage information.
Target architecture is currently hard-coded in the beginning
of the `buildrpm.sh` script.
Alternatively, you can build directly from the `openmpi.spec` spec
file or SRPM directly. Many options can be passed to the building
process via `rpmbuild`'s `--define` option (there are older versions
of `rpmbuild` that do not seem to handle `--define`'d values properly
in all cases, but we generally don't care about those old versions of
`rpmbuild`...). The available options are described in the comments
in the beginning of the spec file in this directory.

Просмотреть файл

@ -61,7 +61,7 @@ created.
- copy of toss3-hfi-optimized.conf with the following changes:
- change: comment "Add the interface for out-of-band communication and set
it up" to "Set up the interface for out-of-band communication"
- remove: oob_tcp_if_exclude = ib0
- remove: oob_tcp_if_exclude = ib0
- remove: btl (let Open MPI figure out what best to use for ethernet-
connected hardware)
- remove: btl_openib_want_fork_support (no infiniband)

Просмотреть файл

@ -33,7 +33,7 @@
# Automake).
EXTRA_DIST += \
examples/README \
examples/README.md \
examples/Makefile \
examples/hello_c.c \
examples/hello_mpifh.f \

Просмотреть файл

@ -1,67 +0,0 @@
Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
University Research and Technology
Corporation. All rights reserved.
Copyright (c) 2006-2012 Cisco Systems, Inc. All rights reserved.
Copyright (c) 2007-2009 Sun Microsystems, Inc. All rights reserved.
Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
Copyright (c) 2013 Mellanox Technologies, Inc. All rights reserved.
$COPYRIGHT$
The files in this directory are sample MPI applications provided both
as a trivial primer to MPI as well as simple tests to ensure that your
Open MPI installation is working properly.
If you are looking for a comprehensive MPI tutorial, these samples are
not enough. Excellent MPI tutorials are available here:
http://www.citutor.org/login.php
Get a free account and login; you can then browse to the list of
available courses. Look for the ones with "MPI" in the title.
There are two MPI examples in this directory, each using one of six
different MPI interfaces:
- Hello world
C: hello_c.c
C++: hello_cxx.cc
Fortran mpif.h: hello_mpifh.f
Fortran use mpi: hello_usempi.f90
Fortran use mpi_f08: hello_usempif08.f90
Java: Hello.java
C shmem.h: hello_oshmem_c.c
Fortran shmem.fh: hello_oshmemfh.f90
- Send a trivial message around in a ring
C: ring_c.c
C++: ring_cxx.cc
Fortran mpif.h: ring_mpifh.f
Fortran use mpi: ring_usempi.f90
Fortran use mpi_f08: ring_usempif08.f90
Java: Ring.java
C shmem.h: ring_oshmem_c.c
Fortran shmem.fh: ring_oshmemfh.f90
Additionally, there's one further example application, but this one
only uses the MPI C bindings:
- Test the connectivity between all processes
C: connectivity_c.c
The Makefile in this directory will build as many of the examples as
you have language support (e.g., if you do not have the Fortran "use
mpi" bindings compiled as part of Open MPI, the those examples will be
skipped).
The Makefile assumes that the wrapper compilers mpicc, mpic++, and
mpifort are in your path.
Although the Makefile is tailored for Open MPI (e.g., it checks the
"ompi_info" command to see if you have support for C++, mpif.h, use
mpi, and use mpi_f08 F90), all of the example programs are pure MPI,
and therefore not specific to Open MPI. Hence, you can use a
different MPI implementation to compile and run these programs if you
wish.
Make today an Open MPI day!

66
examples/README.md Обычный файл
Просмотреть файл

@ -0,0 +1,66 @@
The files in this directory are sample MPI applications provided both
as a trivial primer to MPI as well as simple tests to ensure that your
Open MPI installation is working properly.
If you are looking for a comprehensive MPI tutorial, these samples are
not enough. [Excellent MPI tutorials are available
here](http://www.citutor.org/login.php).
Get a free account and login; you can then browse to the list of
available courses. Look for the ones with "MPI" in the title.
There are two MPI examples in this directory, each using one of six
different MPI interfaces:
## Hello world
The MPI version of the canonical "hello world" program:
* C: `hello_c.c`
* C++: `hello_cxx.cc`
* Fortran mpif.h: `hello_mpifh.f`
* Fortran use mpi: `hello_usempi.f90`
* Fortran use mpi_f08: `hello_usempif08.f90`
* Java: `Hello.java`
* C shmem.h: `hello_oshmem_c.c`
* Fortran shmem.fh: `hello_oshmemfh.f90`
## Ring
Send a trivial message around in a ring:
* C: `ring_c.c`
* C++: `ring_cxx.cc`
* Fortran mpif.h: `ring_mpifh.f`
* Fortran use mpi: `ring_usempi.f90`
* Fortran use mpi_f08: `ring_usempif08.f90`
* Java: `Ring.java`
* C shmem.h: `ring_oshmem_c.c`
* Fortran shmem.fh: `ring_oshmemfh.f90`
## Connectivity Test
Additionally, there's one further example application, but this one
only uses the MPI C bindings to test the connectivity between all
processes:
* C: `connectivity_c.c`
## Makefile
The `Makefile` in this directory will build as many of the examples as
you have language support (e.g., if you do not have the Fortran `use
mpi` bindings compiled as part of Open MPI, the those examples will be
skipped).
The `Makefile` assumes that the wrapper compilers `mpicc`, `mpic++`, and
`mpifort` are in your path.
Although the `Makefile` is tailored for Open MPI (e.g., it checks the
`ompi_info` command to see if you have support for `mpif.h`, the `mpi`
module, and the `use mpi_f08` module), all of the example programs are
pure MPI, and therefore not specific to Open MPI. Hence, you can use
a different MPI implementation to compile and run these programs if
you wish.
Make today an Open MPI day!

19
ompi/contrib/README.md Обычный файл
Просмотреть файл

@ -0,0 +1,19 @@
This is the OMPI contrib system. It is (far) less functional and
flexible than the OMPI MCA framework/component system.
Each contrib package must have a `configure.m4`. It may optionally also
have an `autogen.subdirs` file.
If it has a `configure.m4` file, it must specify its own relevant
files to `AC_CONFIG_FILES` to create during `AC_OUTPUT` -- just like
MCA components (at a minimum, usually its own `Makefile`). The
`configure.m4` file will be slurped up into the main `configure`
script, just like other MCA components. Note that there is currently
no "no configure" option for contrib packages -- you *must* have a
`configure.m4` (even if all it does it call `$1`). Feel free to fix
this situation if you want -- it probably won't not be too difficult
to extend `autogen.pl` to support this scenario, similar to how it is
done for MCA components. :smile:
If it has an `autogen.subdirs` file, then it needs to be a
subdirectory that is autogen-able.

Просмотреть файл

@ -1,19 +0,0 @@
This is the OMPI contrib system. It is (far) less functional and
flexible than the OMPI MCA framework/component system.
Each contrib package must have a configure.m4. It may optionally also
have an autogen.subdirs file.
If it has a configure.m4 file, it must specify its own relevant files
to AC_CONFIG_FILES to create during AC_OUTPUT -- just like MCA
components (at a minimum, usually its own Makefile). The configure.m4
file will be slurped up into the main configure script, just like
other MCA components. Note that there is currently no "no configure"
option for contrib packages -- you *must* have a configure.m4 (even if
all it does it call $1). Feel free to fix this situation if you want
-- it probably won't not be too difficult to extend autogen.pl to
support this scenario, similar to how it is done for MCA components.
:-)
If it has an autogen.subdirs file, then it needs to be a subdirectory
that is autogen-able.

Просмотреть файл

@ -13,7 +13,7 @@
# $HEADER$
#
EXTRA_DIST = profile2mat.pl aggregate_profile.pl
EXTRA_DIST = profile2mat.pl aggregate_profile.pl README.md
sources = common_monitoring.c common_monitoring_coll.c
headers = common_monitoring.h common_monitoring_coll.h

Просмотреть файл

@ -1,181 +0,0 @@
Copyright (c) 2013-2015 The University of Tennessee and The University
of Tennessee Research Foundation. All rights
reserved.
Copyright (c) 2013-2015 Inria. All rights reserved.
$COPYRIGHT$
Additional copyrights may follow
$HEADER$
===========================================================================
Low level communication monitoring interface in Open MPI
Introduction
------------
This interface traces and monitors all messages sent by MPI before they go to the
communication channels. At that levels all communication are point-to-point communications:
collectives are already decomposed in send and receive calls.
The monitoring is stored internally by each process and output on stderr at the end of the
application (during MPI_Finalize()).
Enabling the monitoring
-----------------------
To enable the monitoring add --mca pml_monitoring_enable x to the mpirun command line.
If x = 1 it monitors internal and external tags indifferently and aggregate everything.
If x = 2 it monitors internal tags and external tags separately.
If x = 0 the monitoring is disabled.
Other value of x are not supported.
Internal tags are tags < 0. They are used to tag send and receive coming from
collective operations or from protocol communications
External tags are tags >=0. They are used by the application in point-to-point communication.
Therefore, distinguishing external and internal tags help to distinguish between point-to-point
and other communication (mainly collectives).
Output format
-------------
The output of the monitoring looks like (with --mca pml_monitoring_enable 2):
I 0 1 108 bytes 27 msgs sent
E 0 1 1012 bytes 30 msgs sent
E 0 2 23052 bytes 61 msgs sent
I 1 2 104 bytes 26 msgs sent
I 1 3 208 bytes 52 msgs sent
E 1 0 860 bytes 24 msgs sent
E 1 3 2552 bytes 56 msgs sent
I 2 3 104 bytes 26 msgs sent
E 2 0 22804 bytes 49 msgs sent
E 2 3 860 bytes 24 msgs sent
I 3 0 104 bytes 26 msgs sent
I 3 1 204 bytes 51 msgs sent
E 3 1 2304 bytes 44 msgs sent
E 3 2 860 bytes 24 msgs sent
Where:
- the first column distinguishes internal (I) and external (E) tags.
- the second column is the sender rank
- the third column is the receiver rank
- the fourth column is the number of bytes sent
- the last column is the number of messages.
In this example process 0 as sent 27 messages to process 1 using point-to-point call
for 108 bytes and 30 messages with collectives and protocol related communication
for 1012 bytes to process 1.
If the monitoring was called with --mca pml_monitoring_enable 1 everything is aggregated
under the internal tags. With te above example, you have:
I 0 1 1120 bytes 57 msgs sent
I 0 2 23052 bytes 61 msgs sent
I 1 0 860 bytes 24 msgs sent
I 1 2 104 bytes 26 msgs sent
I 1 3 2760 bytes 108 msgs sent
I 2 0 22804 bytes 49 msgs sent
I 2 3 964 bytes 50 msgs sent
I 3 0 104 bytes 26 msgs sent
I 3 1 2508 bytes 95 msgs sent
I 3 2 860 bytes 24 msgs sent
Monitoring phases
-----------------
If one wants to monitor phases of the application, it is possible to flush the monitoring
at the application level. In this case all the monitoring since the last flush is stored
by every process in a file.
An example of how to flush such monitoring is given in test/monitoring/monitoring_test.c
Moreover, all the different flushed phased are aggregated at runtime and output at the end
of the application as described above.
Example
-------
A working example is given in test/monitoring/monitoring_test.c
It features, MPI_COMM_WORLD monitoring , sub-communicator monitoring, collective and
point-to-point communication monitoring and phases monitoring
To compile:
> make monitoring_test
Helper scripts
--------------
Two perl scripts are provided in test/monitoring
- aggregate_profile.pl is for aggregating monitoring phases of different processes
This script aggregates the profiles generated by the flush_monitoring function.
The files need to be in in given format: name_<phase_id>_<process_id>
They are then aggregated by phases.
If one needs the profile of all the phases he can concatenate the different files,
or use the output of the monitoring system done at MPI_Finalize
in the example it should be call as:
./aggregate_profile.pl prof/phase to generate
prof/phase_1.prof
prof/phase_2.prof
- profile2mat.pl is for transforming a the monitoring output into a communication matrix.
Take a profile file and aggregates all the recorded communicator into matrices.
It generated a matrices for the number of messages, (msg),
for the total bytes transmitted (size) and
the average number of bytes per messages (avg)
The output matrix is symmetric
Do not forget to enable the execution right to these scripts.
For instance, the provided examples store phases output in ./prof
If you type:
> mpirun -np 4 --mca pml_monitoring_enable 2 ./monitoring_test
you should have the following output
Proc 3 flushing monitoring to: ./prof/phase_1_3.prof
Proc 0 flushing monitoring to: ./prof/phase_1_0.prof
Proc 2 flushing monitoring to: ./prof/phase_1_2.prof
Proc 1 flushing monitoring to: ./prof/phase_1_1.prof
Proc 1 flushing monitoring to: ./prof/phase_2_1.prof
Proc 3 flushing monitoring to: ./prof/phase_2_3.prof
Proc 0 flushing monitoring to: ./prof/phase_2_0.prof
Proc 2 flushing monitoring to: ./prof/phase_2_2.prof
I 2 3 104 bytes 26 msgs sent
E 2 0 22804 bytes 49 msgs sent
E 2 3 860 bytes 24 msgs sent
I 3 0 104 bytes 26 msgs sent
I 3 1 204 bytes 51 msgs sent
E 3 1 2304 bytes 44 msgs sent
E 3 2 860 bytes 24 msgs sent
I 0 1 108 bytes 27 msgs sent
E 0 1 1012 bytes 30 msgs sent
E 0 2 23052 bytes 61 msgs sent
I 1 2 104 bytes 26 msgs sent
I 1 3 208 bytes 52 msgs sent
E 1 0 860 bytes 24 msgs sent
E 1 3 2552 bytes 56 msgs sent
you can parse the phases with:
> /aggregate_profile.pl prof/phase
Building prof/phase_1.prof
Building prof/phase_2.prof
And you can build the different communication matrices of phase 1 with:
> ./profile2mat.pl prof/phase_1.prof
prof/phase_1.prof -> all
prof/phase_1_size_all.mat
prof/phase_1_msg_all.mat
prof/phase_1_avg_all.mat
prof/phase_1.prof -> external
prof/phase_1_size_external.mat
prof/phase_1_msg_external.mat
prof/phase_1_avg_external.mat
prof/phase_1.prof -> internal
prof/phase_1_size_internal.mat
prof/phase_1_msg_internal.mat
prof/phase_1_avg_internal.mat
Credit
------
Designed by George Bosilca <bosilca@icl.utk.edu> and
Emmanuel Jeannot <emmanuel.jeannot@inria.fr>

209
ompi/mca/common/monitoring/README.md Обычный файл
Просмотреть файл

@ -0,0 +1,209 @@
# Open MPI common monitoring module
Copyright (c) 2013-2015 The University of Tennessee and The University
of Tennessee Research Foundation. All rights
reserved.
Copyright (c) 2013-2015 Inria. All rights reserved.
Low level communication monitoring interface in Open MPI
## Introduction
This interface traces and monitors all messages sent by MPI before
they go to the communication channels. At that levels all
communication are point-to-point communications: collectives are
already decomposed in send and receive calls.
The monitoring is stored internally by each process and output on
stderr at the end of the application (during `MPI_Finalize()`).
## Enabling the monitoring
To enable the monitoring add `--mca pml_monitoring_enable x` to the
`mpirun` command line:
* If x = 1 it monitors internal and external tags indifferently and aggregate everything.
* If x = 2 it monitors internal tags and external tags separately.
* If x = 0 the monitoring is disabled.
* Other value of x are not supported.
Internal tags are tags < 0. They are used to tag send and receive
coming from collective operations or from protocol communications
External tags are tags >=0. They are used by the application in
point-to-point communication.
Therefore, distinguishing external and internal tags help to
distinguish between point-to-point and other communication (mainly
collectives).
## Output format
The output of the monitoring looks like (with `--mca
pml_monitoring_enable 2`):
```
I 0 1 108 bytes 27 msgs sent
E 0 1 1012 bytes 30 msgs sent
E 0 2 23052 bytes 61 msgs sent
I 1 2 104 bytes 26 msgs sent
I 1 3 208 bytes 52 msgs sent
E 1 0 860 bytes 24 msgs sent
E 1 3 2552 bytes 56 msgs sent
I 2 3 104 bytes 26 msgs sent
E 2 0 22804 bytes 49 msgs sent
E 2 3 860 bytes 24 msgs sent
I 3 0 104 bytes 26 msgs sent
I 3 1 204 bytes 51 msgs sent
E 3 1 2304 bytes 44 msgs sent
E 3 2 860 bytes 24 msgs sent
```
Where:
1. the first column distinguishes internal (I) and external (E) tags.
1. the second column is the sender rank
1. the third column is the receiver rank
1. the fourth column is the number of bytes sent
1. the last column is the number of messages.
In this example process 0 as sent 27 messages to process 1 using
point-to-point call for 108 bytes and 30 messages with collectives and
protocol related communication for 1012 bytes to process 1.
If the monitoring was called with `--mca pml_monitoring_enable 1`,
everything is aggregated under the internal tags. With the e above
example, you have:
```
I 0 1 1120 bytes 57 msgs sent
I 0 2 23052 bytes 61 msgs sent
I 1 0 860 bytes 24 msgs sent
I 1 2 104 bytes 26 msgs sent
I 1 3 2760 bytes 108 msgs sent
I 2 0 22804 bytes 49 msgs sent
I 2 3 964 bytes 50 msgs sent
I 3 0 104 bytes 26 msgs sent
I 3 1 2508 bytes 95 msgs sent
I 3 2 860 bytes 24 msgs sent
```
## Monitoring phases
If one wants to monitor phases of the application, it is possible to
flush the monitoring at the application level. In this case all the
monitoring since the last flush is stored by every process in a file.
An example of how to flush such monitoring is given in
`test/monitoring/monitoring_test.c`.
Moreover, all the different flushed phased are aggregated at runtime
and output at the end of the application as described above.
## Example
A working example is given in `test/monitoring/monitoring_test.c` It
features, `MPI_COMM_WORLD` monitoring , sub-communicator monitoring,
collective and point-to-point communication monitoring and phases
monitoring
To compile:
```
shell$ make monitoring_test
```
## Helper scripts
Two perl scripts are provided in test/monitoring:
1. `aggregate_profile.pl` is for aggregating monitoring phases of
different processes This script aggregates the profiles generated by
the `flush_monitoring` function.
The files need to be in in given format: `name_<phase_id>_<process_id>`
They are then aggregated by phases.
If one needs the profile of all the phases he can concatenate the different files,
or use the output of the monitoring system done at `MPI_Finalize`
in the example it should be call as:
```
./aggregate_profile.pl prof/phase to generate
prof/phase_1.prof
prof/phase_2.prof
```
1. `profile2mat.pl` is for transforming a the monitoring output into a
communication matrix. Take a profile file and aggregates all the
recorded communicator into matrices. It generated a matrices for
the number of messages, (msg), for the total bytes transmitted
(size) and the average number of bytes per messages (avg)
The output matrix is symmetric.
For instance, the provided examples store phases output in `./prof`:
```
shell$ mpirun -np 4 --mca pml_monitoring_enable 2 ./monitoring_test
```
Should provide the following output:
```
Proc 3 flushing monitoring to: ./prof/phase_1_3.prof
Proc 0 flushing monitoring to: ./prof/phase_1_0.prof
Proc 2 flushing monitoring to: ./prof/phase_1_2.prof
Proc 1 flushing monitoring to: ./prof/phase_1_1.prof
Proc 1 flushing monitoring to: ./prof/phase_2_1.prof
Proc 3 flushing monitoring to: ./prof/phase_2_3.prof
Proc 0 flushing monitoring to: ./prof/phase_2_0.prof
Proc 2 flushing monitoring to: ./prof/phase_2_2.prof
I 2 3 104 bytes 26 msgs sent
E 2 0 22804 bytes 49 msgs sent
E 2 3 860 bytes 24 msgs sent
I 3 0 104 bytes 26 msgs sent
I 3 1 204 bytes 51 msgs sent
E 3 1 2304 bytes 44 msgs sent
E 3 2 860 bytes 24 msgs sent
I 0 1 108 bytes 27 msgs sent
E 0 1 1012 bytes 30 msgs sent
E 0 2 23052 bytes 61 msgs sent
I 1 2 104 bytes 26 msgs sent
I 1 3 208 bytes 52 msgs sent
E 1 0 860 bytes 24 msgs sent
E 1 3 2552 bytes 56 msgs sent
```
You can then parse the phases with:
```
shell$ /aggregate_profile.pl prof/phase
Building prof/phase_1.prof
Building prof/phase_2.prof
```
And you can build the different communication matrices of phase 1
with:
```
shell$ ./profile2mat.pl prof/phase_1.prof
prof/phase_1.prof -> all
prof/phase_1_size_all.mat
prof/phase_1_msg_all.mat
prof/phase_1_avg_all.mat
prof/phase_1.prof -> external
prof/phase_1_size_external.mat
prof/phase_1_msg_external.mat
prof/phase_1_avg_external.mat
prof/phase_1.prof -> internal
prof/phase_1_size_internal.mat
prof/phase_1_msg_internal.mat
prof/phase_1_avg_internal.mat
```
## Authors
Designed by George Bosilca <bosilca@icl.utk.edu> and
Emmanuel Jeannot <emmanuel.jeannot@inria.fr>

Просмотреть файл

@ -1,340 +0,0 @@
OFI MTL:
--------
The OFI MTL supports Libfabric (a.k.a. Open Fabrics Interfaces OFI,
https://ofiwg.github.io/libfabric/) tagged APIs (fi_tagged(3)). At
initialization time, the MTL queries libfabric for providers supporting tag matching
(fi_getinfo(3)). Libfabric will return a list of providers that satisfy the requested
capabilities, having the most performant one at the top of the list.
The user may modify the OFI provider selection with mca parameters
mtl_ofi_provider_include or mtl_ofi_provider_exclude.
PROGRESS:
---------
The MTL registers a progress function to opal_progress. There is currently
no support for asynchronous progress. The progress function reads multiple events
from the OFI provider Completion Queue (CQ) per iteration (defaults to 100, can be
modified with the mca mtl_ofi_progress_event_cnt) and iterates until the
completion queue is drained.
COMPLETIONS:
------------
Each operation uses a request type ompi_mtl_ofi_request_t which includes a reference
to an operation specific completion callback, an MPI request, and a context. The
context (fi_context) is used to map completion events with MPI_requests when reading the
CQ.
OFI TAG:
--------
MPI needs to send 96 bits of information per message (32 bits communicator id,
32 bits source rank, 32 bits MPI tag) but OFI only offers 64 bits tags. In
addition, the OFI MTL uses 2 bits of the OFI tag for the synchronous send protocol.
Therefore, there are only 62 bits available in the OFI tag for message usage. The
OFI MTL offers the mtl_ofi_tag_mode mca parameter with 4 modes to address this:
"auto" (Default):
After the OFI provider is selected, a runtime check is performed to assess
FI_REMOTE_CQ_DATA and FI_DIRECTED_RECV support (see fi_tagged(3), fi_msg(2)
and fi_getinfo(3)). If supported, "ofi_tag_full" is used. If not supported,
fall back to "ofi_tag_1".
"ofi_tag_1":
For providers that do not support FI_REMOTE_CQ_DATA, the OFI MTL will
trim the fields (Communicator ID, Source Rank, MPI tag) to make them fit the 62
bits available bit in the OFI tag. There are two options available with different
number of bits for the Communicator ID and MPI tag fields. This tag distribution
offers: 12 bits for Communicator ID (max Communicator ID 4,095) subject to
provider reserved bits (see mem_tag_format below), 18 bits for Source Rank (max
Source Rank 262,143), 32 bits for MPI tag (max MPI tag is INT_MAX).
"ofi_tag_2":
Same as 2 "ofi_tag_1" but offering a different OFI tag distribution for
applications that may require a greater number of supported Communicators at the
expense of fewer MPI tag bits. This tag distribution offers: 24 bits for
Communicator ID (max Communicator ED 16,777,215. See mem_tag_format below), 18
bits for Source Rank (max Source Rank 262,143), 20 bits for MPI tag (max MPI tag
524,287).
"ofi_tag_full":
For executions that cannot accept trimming source rank or MPI tag, this mode sends
source rank for each message in the CQ DATA. The Source Rank is made available at
the remote process CQ (FI_CQ_FORMAT_TAGGED is used, see fi_cq(3)) at the completion
of the matching receive operation. Since the minimum size for FI_REMOTE_CQ_DATA
is 32 bits, the Source Rank fits with no limitations. The OFI tag is used for the
Communicator id (28 bits, max Communicator ID 268,435,455. See mem_tag_format below),
and the MPI tag (max MPI tag is INT_MAX). If this mode is selected by the user
and FI_REMOTE_CQ_DATA or FI_DIRECTED_RECV are not supported, the execution will abort.
mem_tag_format (fi_endpoint(3))
Some providers can reserve the higher order bits from the OFI tag for internal purposes.
This is signaled in mem_tag_format (see fi_endpoint(3)) by setting higher order bits
to zero. In such cases, the OFI MTL will reduce the number of communicator ids supported
by reducing the bits available for the communicator ID field in the OFI tag.
SCALABLE ENDPOINTS:
-------------------
OFI MTL supports OFI Scalable Endpoints (SEP) feature as a means to improve
multi-threaded application throughput and message rate. Currently the feature
is designed to utilize multiple TX/RX contexts exposed by the OFI provider in
conjunction with a multi-communicator MPI application model. Therefore, new OFI
contexts are created as and when communicators are duplicated in a lazy fashion
instead of creating them all at once during init time and this approach also
favours only creating as many contexts as needed.
1. Multi-communicator model:
With this approach, the MPI application is requried to first duplicate
the communicators it wants to use with MPI operations (ideally creating
as many communicators as the number of threads it wants to use to call
into MPI). The duplicated communicators are then used by the
corresponding threads to perform MPI operations. A possible usage
scenario could be in an MPI + OMP application as follows
(example limited to 2 ranks):
MPI_Comm dup_comm[n];
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
for (i = 0; i < n; i++) {
MPI_Comm_dup(MPI_COMM_WORLD, &dup_comm[i]);
}
if (rank == 0) {
#pragma omp parallel for private(host_sbuf, host_rbuf) num_threads(n)
for (i = 0; i < n ; i++) {
MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
1, MSG_TAG, dup_comm[i]);
MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
1, MSG_TAG, dup_comm[i], &status);
}
} else if (rank == 1) {
#pragma omp parallel for private(status, host_sbuf, host_rbuf) num_threads(n)
for (i = 0; i < n ; i++) {
MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
0, MSG_TAG, dup_comm[i], &status);
MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
0, MSG_TAG, dup_comm[i]);
}
}
2. MCA variables:
To utilize the feature, the following MCA variables need to be set:
mtl_ofi_enable_sep:
This MCA variable needs to be set to enable the use of Scalable Endpoints (SEP)
feature in the OFI MTL. The underlying provider is also checked to ensure the
feature is supported. If the provider chosen does not support it, user needs
to either set this variable to 0 or select a different provider which supports
the feature.
For single-threaded applications one OFI context is sufficient, so OFI SEPs
may not add benefit.
Note that mtl_ofi_thread_grouping (see below) needs to be enabled to use the
different OFI SEP contexts. Otherwise, only one context (ctxt 0) will be used.
Default: 0
Command-line syntax:
"-mca mtl_ofi_enable_sep 1"
mtl_ofi_thread_grouping:
Turn Thread Grouping feature on. This is needed to use the Multi-communicator
model explained above. This means that the OFI MTL will use the communicator
ID to decide the SEP contexts to be used by the thread. In this way, each
thread will have direct access to different OFI resources. If disabled,
only context 0 will be used.
Requires mtl_ofi_enable_sep to be set to 1.
Default: 0
It is not recommended to set the MCA variable for:
- Multi-threaded MPI applications not following multi-communicator approach.
- Applications that have multiple threads using a single communicator as
it may degrade performance.
Command-line syntax:
"-mca mtl_ofi_thread_grouping 1"
mtl_ofi_num_ctxts:
This MCA variable allows user to set the number of OFI SEP contexts the
application expects to use. For multi-threaded applications using Thread
Grouping feature, this number should be set to the number of user threads
that will call into MPI. This variable will only have effect if
mtl_ofi_enable_sep is set to 1.
Default: 1
Command-line syntax:
"-mca mtl_ofi_num_ctxts N" [ N: number of OFI contexts required by
application ]
3. Notes on performance:
- OFI MTL will create as many TX/RX contexts as set by MCA mtl_ofi_num_ctxts.
The number of contexts that can be created is also limited by the underlying
provider as each provider may have different thresholds. Once the threshold
is exceeded, contexts are used in a round-robin fashion which leads to
resource sharing among threads. Therefore locks are required to guard
against race conditions. For performance, it is recommended to have
Number of threads = Number of communicators = Number of contexts
For example, when using PSM2 provider, the number of contexts is dictated
by the Intel Omni-Path HFI1 driver module.
- OPAL layer allows for multiple threads to enter progress simultaneously. To
enable this feature, user needs to set MCA variable
"max_thread_in_progress". When using Thread Grouping feature, it is
recommended to set this MCA parameter to the number of threads expected to
call into MPI as it provides performance benefits.
Command-line syntax:
"-mca opal_max_thread_in_progress N" [ N: number of threads expected to
make MPI calls ]
Default: 1
- For applications using a single thread with multiple communicators and MCA
variable "mtl_ofi_thread_grouping" set to 1, the MTL will use multiple
contexts, but the benefits may be negligible as only one thread is driving
progress.
SPECIALIZED FUNCTIONS:
-------------------
To improve performance when calling message passing APIs in the OFI mtl
specialized functions are generated at compile time that eliminate all the
if conditionals that can be determined at init and don't need to be
queried again during the critical path. These functions are generated by
perl scripts during make which generate functions and symbols for every
combination of flags for each function.
1. ADDING NEW FLAGS FOR SPECIALIZATION OF EXISTING FUNCTION:
To add a new flag to an existing specialized function for handling cases
where different OFI providers may or may not support a particular feature,
then you must follow these steps:
1) Update the "_generic" function in mtl_ofi.h with the new flag and
the if conditionals to read the new value.
2) Update the *.pm file corresponding to the function with the new flag in:
gen_funcs(), gen_*_function(), & gen_*_sym_init()
3) Update mtl_ofi_opt.h with:
The new flag as #define NEW_FLAG_TYPES #NUMBER_OF_STATES
example: #define OFI_CQ_DATA 2 (only has TRUE/FALSE states)
Update the function's types with:
#define OMPI_MTL_OFI_FUNCTION_TYPES [NEW_FLAG_TYPES]
2. ADDING A NEW FUNCTION FOR SPECIALIZATION:
To add a new function to be specialized you must
follow these steps:
1) Create a new mtl_ofi_"function_name"_opt.pm based off opt_common/mtl_ofi_opt.pm.template
2) Add new .pm file to generated_source_modules in Makefile.am
3) Add .c file to generated_sources in Makefile.am named the same as the corresponding .pm file
4) Update existing or create function in mtl_ofi.h to _generic with new flags.
5) Update mtl_ofi_opt.h with:
a) New function types: #define OMPI_MTL_OFI_FUNCTION_TYPES [FLAG_TYPES]
b) Add new function to the struct ompi_mtl_ofi_symtable:
struct ompi_mtl_ofi_symtable {
...
int (*ompi_mtl_ofi_FUNCTION OMPI_MTL_OFI_FUNCTION_TYPES )
}
c) Add new symbol table init function definition:
void ompi_mtl_ofi_FUNCTION_symtable_init(struct ompi_mtl_ofi_symtable* sym_table);
6) Add calls to init the new function in the symbol table and assign the function
pointer to be used based off the flags in mtl_ofi_component.c:
ompi_mtl_ofi_FUNCTION_symtable_init(&ompi_mtl_ofi.sym_table);
ompi_mtl_ofi.base.mtl_FUNCTION =
ompi_mtl_ofi.sym_table.ompi_mtl_ofi_FUNCTION[ompi_mtl_ofi.flag];
3. EXAMPLE SPECIALIZED FILE:
The code below is an example of what is generated by the specialization
scripts for use in the OFI mtl. This code specializes the blocking
send functionality based on FI_REMOTE_CQ_DATA & OFI Scalable Endpoint support
provided by an OFI Provider. Only one function and symbol is used during
runtime based on if FI_REMOTE_CQ_DATA is supported and/or if OFI Scalable
Endpoint support is enabled.
/*
* Copyright (c) 2013-2018 Intel, Inc. All rights reserved
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "mtl_ofi.h"
__opal_attribute_always_inline__ static inline int
ompi_mtl_ofi_send_false_false(struct mca_mtl_base_module_t *mtl,
struct ompi_communicator_t *comm,
int dest,
int tag,
struct opal_convertor_t *convertor,
mca_pml_base_send_mode_t mode)
{
const bool OFI_CQ_DATA = false;
const bool OFI_SCEP_EPS = false;
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
convertor, mode,
OFI_CQ_DATA, OFI_SCEP_EPS);
}
__opal_attribute_always_inline__ static inline int
ompi_mtl_ofi_send_false_true(struct mca_mtl_base_module_t *mtl,
struct ompi_communicator_t *comm,
int dest,
int tag,
struct opal_convertor_t *convertor,
mca_pml_base_send_mode_t mode)
{
const bool OFI_CQ_DATA = false;
const bool OFI_SCEP_EPS = true;
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
convertor, mode,
OFI_CQ_DATA, OFI_SCEP_EPS);
}
__opal_attribute_always_inline__ static inline int
ompi_mtl_ofi_send_true_false(struct mca_mtl_base_module_t *mtl,
struct ompi_communicator_t *comm,
int dest,
int tag,
struct opal_convertor_t *convertor,
mca_pml_base_send_mode_t mode)
{
const bool OFI_CQ_DATA = true;
const bool OFI_SCEP_EPS = false;
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
convertor, mode,
OFI_CQ_DATA, OFI_SCEP_EPS);
}
__opal_attribute_always_inline__ static inline int
ompi_mtl_ofi_send_true_true(struct mca_mtl_base_module_t *mtl,
struct ompi_communicator_t *comm,
int dest,
int tag,
struct opal_convertor_t *convertor,
mca_pml_base_send_mode_t mode)
{
const bool OFI_CQ_DATA = true;
const bool OFI_SCEP_EPS = true;
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
convertor, mode,
OFI_CQ_DATA, OFI_SCEP_EPS);
}
void ompi_mtl_ofi_send_symtable_init(struct ompi_mtl_ofi_symtable* sym_table)
{
sym_table->ompi_mtl_ofi_send[false][false]
= ompi_mtl_ofi_send_false_false;
sym_table->ompi_mtl_ofi_send[false][true]
= ompi_mtl_ofi_send_false_true;
sym_table->ompi_mtl_ofi_send[true][false]
= ompi_mtl_ofi_send_true_false;
sym_table->ompi_mtl_ofi_send[true][true]
= ompi_mtl_ofi_send_true_true;
}
###

368
ompi/mca/mtl/ofi/README.md Обычный файл
Просмотреть файл

@ -0,0 +1,368 @@
# Open MPI OFI MTL
The OFI MTL supports Libfabric (a.k.a., [Open Fabrics Interfaces
OFI](https://ofiwg.github.io/libfabric/)) tagged APIs
(`fi_tagged(3)`). At initialization time, the MTL queries libfabric
for providers supporting tag matching (`fi_getinfo(3)`). Libfabric
will return a list of providers that satisfy the requested
capabilities, having the most performant one at the top of the list.
The user may modify the OFI provider selection with mca parameters
`mtl_ofi_provider_include` or `mtl_ofi_provider_exclude`.
## PROGRESS
The MTL registers a progress function to `opal_progress`. There is
currently no support for asynchronous progress. The progress function
reads multiple events from the OFI provider Completion Queue (CQ) per
iteration (defaults to 100, can be modified with the mca
`mtl_ofi_progress_event_cnt`) and iterates until the completion queue is
drained.
## COMPLETIONS
Each operation uses a request type `ompi_mtl_ofi_request_t` which
includes a reference to an operation specific completion callback, an
MPI request, and a context. The context (`fi_context`) is used to map
completion events with `MPI_requests` when reading the CQ.
## OFI TAG
MPI needs to send 96 bits of information per message (32 bits
communicator id, 32 bits source rank, 32 bits MPI tag) but OFI only
offers 64 bits tags. In addition, the OFI MTL uses 2 bits of the OFI
tag for the synchronous send protocol. Therefore, there are only 62
bits available in the OFI tag for message usage. The OFI MTL offers
the `mtl_ofi_tag_mode` mca parameter with 4 modes to address this:
* `auto` (Default):
After the OFI provider is selected, a runtime check is performed to
assess `FI_REMOTE_CQ_DATA` and `FI_DIRECTED_RECV` support (see
`fi_tagged(3)`, `fi_msg(2)` and `fi_getinfo(3)`). If supported,
`ofi_tag_full` is used. If not supported, fall back to `ofi_tag_1`.
* `ofi_tag_1`:
For providers that do not support `FI_REMOTE_CQ_DATA`, the OFI MTL
will trim the fields (Communicator ID, Source Rank, MPI tag) to make
them fit the 62 bits available bit in the OFI tag. There are two
options available with different number of bits for the Communicator
ID and MPI tag fields. This tag distribution offers: 12 bits for
Communicator ID (max Communicator ID 4,095) subject to provider
reserved bits (see `mem_tag_format` below), 18 bits for Source Rank
(max Source Rank 262,143), 32 bits for MPI tag (max MPI tag is
`INT_MAX`).
* `ofi_tag_2`:
Same as 2 `ofi_tag_1` but offering a different OFI tag distribution
for applications that may require a greater number of supported
Communicators at the expense of fewer MPI tag bits. This tag
distribution offers: 24 bits for Communicator ID (max Communicator
ED 16,777,215. See mem_tag_format below), 18 bits for Source Rank
(max Source Rank 262,143), 20 bits for MPI tag (max MPI tag
524,287).
* `ofi_tag_full`:
For executions that cannot accept trimming source rank or MPI tag,
this mode sends source rank for each message in the CQ DATA. The
Source Rank is made available at the remote process CQ
(`FI_CQ_FORMAT_TAGGED` is used, see `fi_cq(3)`) at the completion of
the matching receive operation. Since the minimum size for
`FI_REMOTE_CQ_DATA` is 32 bits, the Source Rank fits with no
limitations. The OFI tag is used for the Communicator id (28 bits,
max Communicator ID 268,435,455. See `mem_tag_format` below), and
the MPI tag (max MPI tag is `INT_MAX`). If this mode is selected by
the user and `FI_REMOTE_CQ_DATA` or `FI_DIRECTED_RECV` are not
supported, the execution will abort.
* `mem_tag_format` (`fi_endpoint(3)`)
Some providers can reserve the higher order bits from the OFI tag
for internal purposes. This is signaled in `mem_tag_format` (see
`fi_endpoint(3)`) by setting higher order bits to zero. In such
cases, the OFI MTL will reduce the number of communicator ids
supported by reducing the bits available for the communicator ID
field in the OFI tag.
## SCALABLE ENDPOINTS
OFI MTL supports OFI Scalable Endpoints (SEP) feature as a means to
improve multi-threaded application throughput and message
rate. Currently the feature is designed to utilize multiple TX/RX
contexts exposed by the OFI provider in conjunction with a
multi-communicator MPI application model. Therefore, new OFI contexts
are created as and when communicators are duplicated in a lazy fashion
instead of creating them all at once during init time and this
approach also favours only creating as many contexts as needed.
1. Multi-communicator model:
With this approach, the MPI application is requried to first duplicate
the communicators it wants to use with MPI operations (ideally creating
as many communicators as the number of threads it wants to use to call
into MPI). The duplicated communicators are then used by the
corresponding threads to perform MPI operations. A possible usage
scenario could be in an MPI + OMP application as follows
(example limited to 2 ranks):
```c
MPI_Comm dup_comm[n];
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
for (i = 0; i < n; i++) {
MPI_Comm_dup(MPI_COMM_WORLD, &dup_comm[i]);
}
if (rank == 0) {
#pragma omp parallel for private(host_sbuf, host_rbuf) num_threads(n)
for (i = 0; i < n ; i++) {
MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
1, MSG_TAG, dup_comm[i]);
MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
1, MSG_TAG, dup_comm[i], &status);
}
} else if (rank == 1) {
#pragma omp parallel for private(status, host_sbuf, host_rbuf) num_threads(n)
for (i = 0; i < n ; i++) {
MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
0, MSG_TAG, dup_comm[i], &status);
MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
0, MSG_TAG, dup_comm[i]);
}
}
```
2. MCA variables:
To utilize the feature, the following MCA variables need to be set:
* `mtl_ofi_enable_sep`:
This MCA variable needs to be set to enable the use of Scalable
Endpoints (SEP) feature in the OFI MTL. The underlying provider
is also checked to ensure the feature is supported. If the
provider chosen does not support it, user needs to either set
this variable to 0 or select a different provider which supports
the feature. For single-threaded applications one OFI context is
sufficient, so OFI SEPs may not add benefit. Note that
`mtl_ofi_thread_grouping` (see below) needs to be enabled to use
the different OFI SEP contexts. Otherwise, only one context (ctxt
0) will be used.
Default: 0
Command-line syntax: `--mca mtl_ofi_enable_sep 1`
* `mtl_ofi_thread_grouping`:
Turn Thread Grouping feature on. This is needed to use the
Multi-communicator model explained above. This means that the OFI
MTL will use the communicator ID to decide the SEP contexts to be
used by the thread. In this way, each thread will have direct
access to different OFI resources. If disabled, only context 0
will be used. Requires `mtl_ofi_enable_sep` to be set to 1.
Default: 0
It is not recommended to set the MCA variable for:
* Multi-threaded MPI applications not following multi-communicator
approach.
* Applications that have multiple threads using a single
communicator as it may degrade performance.
Command-line syntax: `--mca mtl_ofi_thread_grouping 1`
* `mtl_ofi_num_ctxts`:
This MCA variable allows user to set the number of OFI SEP
contexts the application expects to use. For multi-threaded
applications using Thread Grouping feature, this number should be
set to the number of user threads that will call into MPI. This
variable will only have effect if `mtl_ofi_enable_sep` is set to 1.
Default: 1
Command-line syntax: `--mca mtl_ofi_num_ctxts N` (`N`: number of OFI contexts required by application)
3. Notes on performance:
* OFI MTL will create as many TX/RX contexts as set by MCA
mtl_ofi_num_ctxts. The number of contexts that can be created is
also limited by the underlying provider as each provider may have
different thresholds. Once the threshold is exceeded, contexts are
used in a round-robin fashion which leads to resource sharing
among threads. Therefore locks are required to guard against race
conditions. For performance, it is recommended to have
Number of threads = Number of communicators = Number of contexts
For example, when using PSM2 provider, the number of contexts is
dictated by the Intel Omni-Path HFI1 driver module.
* OPAL layer allows for multiple threads to enter progress
simultaneously. To enable this feature, user needs to set MCA
variable `max_thread_in_progress`. When using Thread Grouping
feature, it is recommended to set this MCA parameter to the number
of threads expected to call into MPI as it provides performance
benefits.
Default: 1
Command-line syntax: `--mca opal_max_thread_in_progress N` (`N`: number of threads expected to make MPI calls )
* For applications using a single thread with multiple communicators
and MCA variable `mtl_ofi_thread_grouping` set to 1, the MTL will
use multiple contexts, but the benefits may be negligible as only
one thread is driving progress.
## SPECIALIZED FUNCTIONS
To improve performance when calling message passing APIs in the OFI
mtl specialized functions are generated at compile time that eliminate
all the if conditionals that can be determined at init and don't need
to be queried again during the critical path. These functions are
generated by perl scripts during make which generate functions and
symbols for every combination of flags for each function.
1. ADDING NEW FLAGS FOR SPECIALIZATION OF EXISTING FUNCTION:
To add a new flag to an existing specialized function for handling
cases where different OFI providers may or may not support a
particular feature, then you must follow these steps:
1. Update the `_generic` function in `mtl_ofi.h` with the new flag
and the if conditionals to read the new value.
1. Update the `*.pm` file corresponding to the function with the
new flag in: `gen_funcs()`, `gen_*_function()`, &
`gen_*_sym_init()`
1. Update `mtl_ofi_opt.h` with:
* The new flag as `#define NEW_FLAG_TYPES #NUMBER_OF_STATES`.
Example: #define OFI_CQ_DATA 2 (only has TRUE/FALSE states)
* Update the function's types with:
`#define OMPI_MTL_OFI_FUNCTION_TYPES [NEW_FLAG_TYPES]`
1. ADDING A NEW FUNCTION FOR SPECIALIZATION:
To add a new function to be specialized you must
follow these steps:
1. Create a new `mtl_ofi_<function_name>_opt.pm` based off
`opt_common/mtl_ofi_opt.pm.template`
1. Add new `.pm` file to `generated_source_modules` in `Makefile.am`
1. Add `.c` file to `generated_sources` in `Makefile.am` named the
same as the corresponding `.pm` file
1. Update existing or create function in `mtl_ofi.h` to `_generic`
with new flags.
1. Update `mtl_ofi_opt.h` with:
1. New function types: `#define OMPI_MTL_OFI_FUNCTION_TYPES` `[FLAG_TYPES]`
1. Add new function to the `struct ompi_mtl_ofi_symtable`:
```c
struct ompi_mtl_ofi_symtable {
...
int (*ompi_mtl_ofi_FUNCTION OMPI_MTL_OFI_FUNCTION_TYPES )
}
```
1. Add new symbol table init function definition:
```c
void ompi_mtl_ofi_FUNCTION_symtable_init(struct ompi_mtl_ofi_symtable* sym_table);
```
1. Add calls to init the new function in the symbol table and
assign the function pointer to be used based off the flags in
`mtl_ofi_component.c`:
* `ompi_mtl_ofi_FUNCTION_symtable_init(&ompi_mtl_ofi.sym_table);`
* `ompi_mtl_ofi.base.mtl_FUNCTION = ompi_mtl_ofi.sym_table.ompi_mtl_ofi_FUNCTION[ompi_mtl_ofi.flag];`
## EXAMPLE SPECIALIZED FILE
The code below is an example of what is generated by the
specialization scripts for use in the OFI mtl. This code specializes
the blocking send functionality based on `FI_REMOTE_CQ_DATA` & OFI
Scalable Endpoint support provided by an OFI Provider. Only one
function and symbol is used during runtime based on if
`FI_REMOTE_CQ_DATA` is supported and/or if OFI Scalable Endpoint support
is enabled.
```c
/*
* Copyright (c) 2013-2018 Intel, Inc. All rights reserved
*
* $COPYRIGHT$
*
* Additional copyrights may follow
*
* $HEADER$
*/
#include "mtl_ofi.h"
__opal_attribute_always_inline__ static inline int
ompi_mtl_ofi_send_false_false(struct mca_mtl_base_module_t *mtl,
struct ompi_communicator_t *comm,
int dest,
int tag,
struct opal_convertor_t *convertor,
mca_pml_base_send_mode_t mode)
{
const bool OFI_CQ_DATA = false;
const bool OFI_SCEP_EPS = false;
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
convertor, mode,
OFI_CQ_DATA, OFI_SCEP_EPS);
}
__opal_attribute_always_inline__ static inline int
ompi_mtl_ofi_send_false_true(struct mca_mtl_base_module_t *mtl,
struct ompi_communicator_t *comm,
int dest,
int tag,
struct opal_convertor_t *convertor,
mca_pml_base_send_mode_t mode)
{
const bool OFI_CQ_DATA = false;
const bool OFI_SCEP_EPS = true;
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
convertor, mode,
OFI_CQ_DATA, OFI_SCEP_EPS);
}
__opal_attribute_always_inline__ static inline int
ompi_mtl_ofi_send_true_false(struct mca_mtl_base_module_t *mtl,
struct ompi_communicator_t *comm,
int dest,
int tag,
struct opal_convertor_t *convertor,
mca_pml_base_send_mode_t mode)
{
const bool OFI_CQ_DATA = true;
const bool OFI_SCEP_EPS = false;
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
convertor, mode,
OFI_CQ_DATA, OFI_SCEP_EPS);
}
__opal_attribute_always_inline__ static inline int
ompi_mtl_ofi_send_true_true(struct mca_mtl_base_module_t *mtl,
struct ompi_communicator_t *comm,
int dest,
int tag,
struct opal_convertor_t *convertor,
mca_pml_base_send_mode_t mode)
{
const bool OFI_CQ_DATA = true;
const bool OFI_SCEP_EPS = true;
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
convertor, mode,
OFI_CQ_DATA, OFI_SCEP_EPS);
}
void ompi_mtl_ofi_send_symtable_init(struct ompi_mtl_ofi_symtable* sym_table)
{
sym_table->ompi_mtl_ofi_send[false][false]
= ompi_mtl_ofi_send_false_false;
sym_table->ompi_mtl_ofi_send[false][true]
= ompi_mtl_ofi_send_false_true;
sym_table->ompi_mtl_ofi_send[true][false]
= ompi_mtl_ofi_send_true_false;
sym_table->ompi_mtl_ofi_send[true][true]
= ompi_mtl_ofi_send_true_true;
}
```

Просмотреть файл

@ -1,5 +1,3 @@
Copyright 2009 Cisco Systems, Inc. All rights reserved.
This is a simple example op component meant to be a template /
springboard for people to write their own op components. There are
many different ways to write components and modules; this is but one
@ -13,28 +11,26 @@ same end effect. Feel free to customize / simplify / strip out what
you don't need from this example.
This example component supports a ficticious set of hardware that
provides acceleation for the MPI_MAX and MPI_BXOR MPI_Ops. The
provides acceleation for the `MPI_MAX` and `MPI_BXOR` `MPI_Ops`. The
ficticious hardware has multiple versions, too: some versions only
support single precision floating point types for MAX and single
precision integer types for BXOR, whereas later versions support both
single and double precision floating point types for MAX and both
single and double precision integer types for BXOR. Hence, this
example walks through setting up particular MPI_Op function pointers
based on:
support single precision floating point types for `MAX` and single
precision integer types for `BXOR`, whereas later versions support
both single and double precision floating point types for `MAX` and
both single and double precision integer types for `BXOR`. Hence,
this example walks through setting up particular `MPI_Op` function
pointers based on:
a) hardware availability (e.g., does the node where this MPI process
1. hardware availability (e.g., does the node where this MPI process
is running have the relevant hardware/resources?)
b) MPI_Op (e.g., in this example, only MPI_MAX and MPI_BXOR are
1. `MPI_Op` (e.g., in this example, only `MPI_MAX` and `MPI_BXOR` are
supported)
c) datatype (e.g., single/double precision floating point for MAX and
single/double precision integer for BXOR)
1. datatype (e.g., single/double precision floating point for `MAX`
and single/double precision integer for `BXOR`)
Additionally, there are other considerations that should be factored
in at run time. Hardware accelerators are great, but they do induce
overhead -- for example, some accelerator hardware require registered
memory. So even if a particular MPI_Op and datatype are supported, it
memory. So even if a particular `MPI_Op` and datatype are supported, it
may not be worthwhile to use the hardware unless the amount of data to
be processed is "big enough" (meaning that the cost of the
registration and/or copy-in/copy-out is ameliorated) or the memory to
@ -47,57 +43,65 @@ failover strategy is well-supported by the op framework; during the
query process, a component can "stack" itself similar to how POSIX
signal handlers can be stacked. Specifically, op components can cache
other implementations of operation functions for use in the case of
failover. The MAX and BXOR module implementations show one way of
failover. The `MAX` and `BXOR` module implementations show one way of
using this method.
Here's a listing of the files in the example component and what they
do:
- configure.m4: Tests that get slurped into OMPI's top-level configure
script to determine whether this component will be built or not.
- Makefile.am: Automake makefile that builds this component.
- op_example_component.c: The main "component" source file.
- op_example_module.c: The main "module" source file.
- op_example.h: information that is shared between the .c files.
- .ompi_ignore: the presence of this file causes OMPI's autogen.pl to
skip this component in the configure/build/install process (see
- `configure.m4`: Tests that get slurped into OMPI's top-level
`configure` script to determine whether this component will be built
or not.
- `Makefile.am`: Automake makefile that builds this component.
- `op_example_component.c`: The main "component" source file.
- `op_example_module.c`: The main "module" source file.
- `op_example.h`: information that is shared between the `.c` files.
- `.ompi_ignore`: the presence of this file causes OMPI's `autogen.pl`
to skip this component in the configure/build/install process (see
below).
To use this example as a template for your component (assume your new
component is named "foo"):
component is named `foo`):
```
shell$ cd (top_ompi_dir)/ompi/mca/op
shell$ cp -r example foo
shell$ cd foo
```
Remove the .ompi_ignore file (which makes the component "visible" to
all developers) *OR* add an .ompi_unignore file with one username per
line (as reported by `whoami`). OMPI's autogen.pl will skip any
component with a .ompi_ignore file *unless* there is also an
Remove the `.ompi_ignore` file (which makes the component "visible" to
all developers) *OR* add an `.ompi_unignore` file with one username per
line (as reported by `whoami`). OMPI's `autogen.pl` will skip any
component with a `.ompi_ignore` file *unless* there is also an
.ompi_unignore file containing your user ID in it. This is a handy
mechanism to have a component in the tree but have it not built / used
by most other developers:
```
shell$ rm .ompi_ignore
*OR*
shell$ whoami > .ompi_unignore
```
Now rename any file that contains "example" in the filename to have
"foo", instead. For example:
Now rename any file that contains `example` in the filename to have
`foo`, instead. For example:
```
shell$ mv op_example_component.c op_foo_component.c
#...etc.
```
Now edit all the files and s/example/foo/gi. Specifically, replace
all instances of "example" with "foo" in all function names, type
names, header #defines, strings, and global variables.
Now edit all the files and `s/example/foo/gi`. Specifically, replace
all instances of `example` with `foo` in all function names, type
names, header `#defines`, strings, and global variables.
Now your component should be fully functional (although entirely
renamed as "foo" instead of "example"). You can go to the top-level
OMPI directory and run "autogen.pl" (which will find your component
and att it to the configure/build process) and then "configure ..."
and "make ..." as normal.
renamed as `foo` instead of `example`). You can go to the top-level
OMPI directory and run `autogen.pl` (which will find your component
and att it to the configure/build process) and then `configure ...`
and `make ...` as normal.
```
shell$ cd (top_ompi_dir)
shell$ ./autogen.pl
# ...lots of output...
@ -107,19 +111,21 @@ shell$ make -j 4 all
# ...lots of output...
shell$ make install
# ...lots of output...
```
After you have installed Open MPI, running "ompi_info" should show
your "foo" component in the output.
After you have installed Open MPI, running `ompi_info` should show
your `foo` component in the output.
```
shell$ ompi_info | grep op:
MCA op: example (MCA v2.0, API v1.0, Component v1.4)
MCA op: foo (MCA v2.0, API v1.0, Component v1.4)
shell$
```
If you do not see your foo component, check the above steps, and check
the output of autogen.pl, configure, and make to ensure that "foo" was
found, configured, and built successfully.
Once ompi_info sees your component, start editing the "foo" component
files in a meaningful way.
If you do not see your `foo` component, check the above steps, and
check the output of `autogen.pl`, `configure`, and `make` to ensure
that `foo` was found, configured, and built successfully.
Once `ompi_info` sees your component, start editing the `foo`
component files in a meaningful way.

Просмотреть файл

@ -10,3 +10,5 @@
#
SUBDIRS = java c
EXTRA_DIST = README.md

Просмотреть файл

@ -1,26 +1,27 @@
***************************************************************************
# Open MPI Java bindings
Note about the Open MPI Java bindings
The Java bindings in this directory are not part of the MPI specification,
as noted in the README.JAVA.txt file in the root directory. That file also
contains some information regarding the installation and use of the Java
bindings. Further details can be found in the paper [1].
The Java bindings in this directory are not part of the MPI
specification, as noted in the README.JAVA.md file in the root
directory. That file also contains some information regarding the
installation and use of the Java bindings. Further details can be
found in the paper [1].
We originally took the code from the mpiJava project [2] as starting point
for our developments, but we have pretty much rewritten 100% of it. The
original copyrights and license terms of mpiJava are listed below.
[1] O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
implementation of Java bindings in Open MPI". Parallel Comput.
59: 1-20 (2016).
1. O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
implementation of Java bindings in Open MPI". Parallel Comput.
59: 1-20 (2016).
1. M. Baker et al. "mpiJava: An object-oriented Java interface to
MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
pp. 748-762, Springer (1999).
[2] M. Baker et al. "mpiJava: An object-oriented Java interface to
MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
pp. 748-762, Springer (1999).
***************************************************************************
## Original citation
```
mpiJava - A Java Interface to MPI
---------------------------------
Copyright 2003
@ -39,6 +40,7 @@ original copyrights and license terms of mpiJava are listed below.
(Bugfixes/Additions, CMake based configure/build)
Blasius Czink
HLRS, University of Stuttgart
```
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.

Просмотреть файл

@ -1,4 +1,5 @@
Symbol conventions for Open MPI extensions
# Symbol conventions for Open MPI extensions
Last updated: January 2015
This README provides some rule-of-thumb guidance for how to name
@ -15,26 +16,22 @@ Generally speaking, there are usually three kinds of extensions:
3. Functionality that is strongly expected to be in an upcoming
version of the MPI specification.
----------------------------------------------------------------------
## Case 1
Case 1
The OMPI_Paffinity_str() extension is a good example of this type: it
is solely intended to be for Open MPI. It will likely never be pushed
to other MPI implementations, and it will likely never be pushed to
the MPI Forum.
The `OMPI_Paffinity_str()` extension is a good example of this type:
it is solely intended to be for Open MPI. It will likely never be
pushed to other MPI implementations, and it will likely never be
pushed to the MPI Forum.
It's Open MPI-specific functionality, through and through.
Public symbols of this type of functionality should be named with an
"OMPI_" prefix to emphasize its Open MPI-specific nature. To be
clear: the "OMPI_" prefix clearly identifies parts of user code that
`OMPI_` prefix to emphasize its Open MPI-specific nature. To be
clear: the `OMPI_` prefix clearly identifies parts of user code that
are relying on Open MPI (and likely need to be surrounded with #if
OPEN_MPI blocks, etc.).
`OPEN_MPI` blocks, etc.).
----------------------------------------------------------------------
Case 2
## Case 2
The MPI extensions mechanism in Open MPI was designed to help MPI
Forum members prototype new functionality that is intended for the
@ -43,23 +40,21 @@ functionality is not only to be included in the MPI spec, but possibly
also be included in another MPI implementation.
As such, it seems reasonable to prefix public symbols in this type of
functionality with "MPIX_". This commonly-used prefix allows the same
functionality with `MPIX_`. This commonly-used prefix allows the same
symbols to be available in multiple MPI implementations, and therefore
allows user code to easily check for it. E.g., user apps can check
for the presence of MPIX_Foo to know if both Open MPI and Other MPI
support the proposed MPIX_Foo functionality.
for the presence of `MPIX_Foo` to know if both Open MPI and Other MPI
support the proposed `MPIX_Foo` functionality.
Of course, when using the MPIX_ namespace, there is the possibility of
symbol name collisions. E.g., what if Open MPI has an MPIX_Foo and
Other MPI has a *different* MPIX_Foo?
Of course, when using the `MPIX_` namespace, there is the possibility of
symbol name collisions. E.g., what if Open MPI has an `MPIX_Foo` and
Other MPI has a *different* `MPIX_Foo`?
While we technically can't prevent such collisions from happening, we
encourage extension authors to avoid such symbol clashes whenever
possible.
----------------------------------------------------------------------
Case 3
## Case 3
It is well-known that the MPI specification (intentionally) takes a
long time to publish. MPI implementers can typically know, with a
@ -72,13 +67,13 @@ functionality early (i.e., before the actual publication of the
corresponding MPI specification document).
Case in point: the non-blocking collective operations that were
included in MPI-3.0 (e.g., MPI_Ibarrier). It was known for a year or
two before MPI-3.0 was published that these functions would be
included in MPI-3.0 (e.g., `MPI_Ibarrier()`). It was known for a year
or two before MPI-3.0 was published that these functions would be
included in MPI-3.0.
There is a continual debate among the developer community: when
implementing such functionality, should the symbols be in the MPIX_
namespace or in the MPI_ namespace? On one hand, the symbols are not
namespace or in the `MPI_` namespace? On one hand, the symbols are not
yet officially standardized -- *they could change* before publication.
On the other hand, developers who participate in the Forum typically
have a good sense for whether symbols are going to change before
@ -89,35 +84,31 @@ before the MPI specification is published. ...and so on.
After much debate: for functionality that has a high degree of
confidence that it will be included in an upcoming spec (e.g., it has
passed at least one vote in the MPI Forum), our conclusion is that it
is OK to use the MPI_ namespace.
is OK to use the `MPI_` namespace.
Case in point: Open MPI released non-blocking collectives with the
MPI_ prefix (not the MPIX_ prefix) before the MPI-3.0 specification
officially standardized these functions.
`MPI_` prefix (not the `MPIX_` prefix) before the MPI-3.0
specification officially standardized these functions.
The rationale was threefold:
1. Let users use the functionality as soon as possible.
2. If OMPI initially creates MPIX_Foo, but eventually renames it to
MPI_Foo when the MPI specification is published, then users will
1. If OMPI initially creates `MPIX_Foo`, but eventually renames it to
`MPI_Foo` when the MPI specification is published, then users will
have to modify their codes to match. This is an artificial change
inserted just to be "pure" to the MPI spec (i.e., it's a "lawyer's
answer"). But since the MPIX_Foo -> MPI_Foo change is inevitable,
it just ends up annoying users.
3. Once OMPI introduces MPIX_ symbols, if we want to *not* annoy
answer"). But since the `MPIX_Foo` -> `MPI_Foo` change is
inevitable, it just ends up annoying users.
1. Once OMPI introduces `MPIX_` symbols, if we want to *not* annoy
users, we'll likely have weak symbols / aliased versions of both
MPIX_Foo and MPI_Foo once the Foo functionality is included in a
published MPI specification. However, when can we delete the
MPIX_Foo symbol? It becomes a continuing annoyance of backwards
`MPIX_Foo` and `MPI_Foo` once the Foo functionality is included in
a published MPI specification. However, when can we delete the
`MPIX_Foo` symbol? It becomes a continuing annoyance of backwards
compatibility that we have to keep carrying forward.
For all these reasons, we believe that it's better to put
expected-upcoming official MPI functionality in the MPI_ namespace,
not the MPIX_ namespace.
----------------------------------------------------------------------
expected-upcoming official MPI functionality in the `MPI_` namespace,
not the `MPIX_` namespace.
All that being said, these are rules of thumb. They are not an
official mandate. There may well be cases where there are reasons to

Просмотреть файл

@ -2,7 +2,7 @@
# Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation. All rights reserved.
# Copyright (c) 2010-2012 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2010-2020 Cisco Systems, Inc. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
@ -20,4 +20,4 @@
SUBDIRS = c
EXTRA_DIST = README.txt
EXTRA_DIST = README.md

30
ompi/mpiext/affinity/README.md Обычный файл
Просмотреть файл

@ -0,0 +1,30 @@
# Open MPI extension: Affinity
## Copyrights
```
Copyright (c) 2010-2012 Cisco Systems, Inc. All rights reserved.
Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
```
## Authors
* Jeff Squyres, 19 April 2010, and 16 April 2012
* Terry Dontje, 18 November 2010
## Description
This extension provides a single new function, `OMPI_Affinity_str()`,
that takes a format value and then provides 3 prettyprint strings as
output:
* `fmt_type`: is an enum that tells `OMPI_Affinity_str()` whether to
use a resource description string or layout string format for
`ompi_bound` and `currently_bound` output strings.
* `ompi_bound`: describes what sockets/cores Open MPI bound this process
to (or indicates that Open MPI did not bind this process).
* `currently_bound`: describes what sockets/cores this process is
currently bound to (or indicates that it is unbound).
* `exists`: describes what processors are available in the current host.
See `OMPI_Affinity_str(3)` for more details.

Просмотреть файл

@ -1,29 +0,0 @@
# Copyright (c) 2010-2012 Cisco Systems, Inc. All rights reserved.
Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
$COPYRIGHT$
Jeff Squyres
19 April 2010, and
16 April 2012
Terry Dontje
18 November 2010
This extension provides a single new function, OMPI_Affinity_str(),
that takes a format value and then provides 3 prettyprint strings as
output:
fmt_type: is an enum that tells OMPI_Affinity_str() whether to use a
resource description string or layout string format for ompi_bound and
currently_bound output strings.
ompi_bound: describes what sockets/cores Open MPI bound this process
to (or indicates that Open MPI did not bind this process).
currently_bound: describes what sockets/cores this process is
currently bound to (or indicates that it is unbound).
exists: describes what processors are available in the current host.
See OMPI_Affinity_str(3) for more details.

Просмотреть файл

@ -21,4 +21,4 @@
SUBDIRS = c
EXTRA_DIST = README.txt
EXTRA_DIST = README.md

11
ompi/mpiext/cuda/README.md Обычный файл
Просмотреть файл

@ -0,0 +1,11 @@
# Open MPI extension: Cuda
Copyright (c) 2015 NVIDIA, Inc. All rights reserved.
Author: Rolf vandeVaart
This extension provides a macro for compile time check of CUDA aware
support. It also provides a function for runtime check of CUDA aware
support.
See `MPIX_Query_cuda_support(3)` for more details.

Просмотреть файл

@ -1,11 +0,0 @@
# Copyright (c) 2015 NVIDIA, Inc. All rights reserved.
$COPYRIGHT$
Rolf vandeVaart
This extension provides a macro for compile time check of CUDA aware support.
It also provides a function for runtime check of CUDA aware support.
See MPIX_Query_cuda_support(3) for more details.

Просмотреть файл

@ -1,5 +1,5 @@
#
# Copyright (c) 2012 Cisco Systems, Inc. All rights reserved.
# Copyright (c) 2020 Cisco Systems, Inc. All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
@ -17,4 +17,4 @@
SUBDIRS = c mpif-h use-mpi use-mpi-f08
EXTRA_DIST = README.txt
EXTRA_DIST = README.md

148
ompi/mpiext/example/README.md Обычный файл
Просмотреть файл

@ -0,0 +1,148 @@
# Open MPI extension: Example
## Overview
This example MPI extension shows how to make an MPI extension for Open
MPI.
An MPI extension provides new top-level APIs in Open MPI that are
available to user-level applications (vs. adding new code/APIs that is
wholly internal to Open MPI). MPI extensions are generally used to
prototype new MPI APIs, or provide Open MPI-specific APIs to
applications. This example MPI extension provides a new top-level MPI
API named `OMPI_Progress` that is callable in both C and Fortran.
MPI extensions are similar to Open MPI components, but due to
complex ordering requirements for the Fortran-based MPI bindings,
their build order is a little different.
Note that MPI has 4 different sets of bindings (C, Fortran `mpif.h`,
the Fortran `mpi` module, and the Fortran `mpi_f08` module), and Open
MPI extensions allow adding API calls to all 4 of them. Prototypes
for the user-accessible functions/subroutines/constants are included
in the following publicly-available mechanisms:
* C: `mpi-ext.h`
* Fortran mpif.h: `mpif-ext.h`
* Fortran "use mpi": `use mpi_ext`
* Fortran "use mpi_f08": `use mpi_f08_ext`
This example extension defines a new top-level API named
`OMPI_Progress()` in all four binding types, and provides test programs
to call this API in each of the four binding types. Code (and
comments) is worth 1,000 words -- see the code in this example
extension to understand how it works and how the build system builds
and inserts each piece into the publicly-available mechansisms (e.g.,
`mpi-ext.h` and the `mpi_f08_ext` module).
## Comparison to General Open MPI MCA Components
Here's the ways that MPI extensions are similar to Open MPI
components:
1. Extensions have a top-level `configure.m4` with a well-known m4 macro
that is run during Open MPI's configure that determines whether the
component wants to build or not.
Note, however, that unlike components, extensions *must* have a
`configure.m4`. No other method of configuration is supported.
1. Extensions must adhere to normal Automake-based targets. We
strongly suggest that you use `Makefile.am`'s and have the
extension's `configure.m4` `AC_CONFIG_FILE` each `Makefile.am` in
the extension. Using other build systems may work, but are
untested and unsupported.
1. Extensions create specifically-named libtool convenience archives
(i.e., `*.la` files) that the build system slurps into higher-level
libraries.
Unlike components, however, extensions:
1. Have a bit more rigid directory and file naming scheme.
1. Have up to four different, specifically-named subdirectories (one
for each MPI binding type).
1. Also install some specifically-named header files (for C and the
Fortran `mpif.h` bindings).
Similar to components, an MPI extension's name is determined by its
directory name: `ompi/mpiext/EXTENSION_NAME`
## Extension requirements
### Required: C API
Under this top-level directory, the extension *must* have a directory
named `c` (for the C bindings) that:
1. contains a file named `mpiext_EXTENSION_NAME_c.h`
1. installs `mpiext_EXTENSION_NAME_c.h` to
`$includedir/openmpi/mpiext/EXTENSION_NAME/c`
1. builds a Libtool convenience library named
`libmpiext_EXTENSION_NAME_c.la`
### Optional: `mpif.h` bindings
Optionally, the extension may have a director named `mpif-h` (for the
Fortran `mpif.h` bindings) that:
1. contains a file named `mpiext_EXTENSION_NAME_mpifh.h`
1. installs `mpiext_EXTENSION_NAME_mpih.h` to
`$includedir/openmpi/mpiext/EXTENSION_NAME/mpif-h`
1. builds a Libtool convenience library named
`libmpiext_EXTENSION_NAME_mpifh.la`
### Optional: `mpi` module bindings
Optionally, the extension may have a directory named `use-mpi` (for the
Fortran `mpi` module) that:
1. contains a file named `mpiext_EXTENSION_NAME_usempi.h`
***NOTE:*** The MPI extension system does NOT support building an
additional library in the `use-mpi` extension directory. It is
assumed that the `use-mpi` bindings will use the same back-end symbols
as the `mpif.h` bindings, and that the only output product of the
`use-mpi` directory is a file to be included in the `mpi-ext` module
(i.e., strong Fortran prototypes for the functions/global variables in
this extension).
### Optional: `mpi_f08` module bindings
Optionally, the extension may have a directory named `use-mpi-f08` (for
the Fortran `mpi_f08` module) that:
1. contains a file named `mpiext_EXTENSION_NAME_usempif08.h`
1. builds a Libtool convenience library named
`libmpiext_EXTENSION_NAME_usempif08.la`
See the comments in all the header and source files in this tree to
see what each file is for and what should be in each.
## Notes
Note that the build order of MPI extensions is a bit strange. The
directories in a MPI extensions are NOT traversed top-down in
sequential order. Instead, due to ordering requirements when building
the Fortran module-based interfaces, each subdirectory in extensions
are traversed individually at different times in the overall Open MPI
build.
As such, `ompi/mpiext/EXTENSION_NAME/Makefile.am` is not traversed
during a normal top-level `make all` target. This `Makefile.am`
exists for two reasons, however:
1. For the conveneince of the developer, so that you can issue normal
`make` commands at the top of your extension tree (e.g., `make all`
will still build all bindings in an extension).
1. During a top-level `make dist`, extension directories *are*
traversed top-down in sequence order. Having a top-level
`Makefile.am` in an extension allows `EXTRA_DIST`ing of files, such
as this `README.md` file.
This are reasons for this strange ordering, but suffice it to say that
`make dist` doesn't have the same ordering requiements as `make all`,
and is therefore easier to have a "normal" Automake-usual top-down
sequential directory traversal.
Enjoy!

Просмотреть файл

@ -1,138 +0,0 @@
Copyright (C) 2012 Cisco Systems, Inc. All rights reserved.
$COPYRIGHT$
This example MPI extension shows how to make an MPI extension for Open
MPI.
An MPI extension provides new top-level APIs in Open MPI that are
available to user-level applications (vs. adding new code/APIs that is
wholly internal to Open MPI). MPI extensions are generally used to
prototype new MPI APIs, or provide Open MPI-specific APIs to
applications. This example MPI extension provides a new top-level MPI
API named "OMPI_Progress" that is callable in both C and Fortran.
MPI extensions are similar to Open MPI components, but due to
complex ordering requirements for the Fortran-based MPI bindings,
their build order is a little different.
Note that MPI has 4 different sets of bindings (C, Fortran mpif.h,
Fortran "use mpi", and Fortran "use mpi_f08"), and Open MPI extensions
allow adding API calls to all 4 of them. Prototypes for the
user-accessible functions/subroutines/constants are included in the
following publicly-available mechanisms:
- C: mpi-ext.h
- Fortran mpif.h: mpif-ext.h
- Fortran "use mpi": use mpi_ext
- Fortran "use mpi_f08": use mpi_f08_ext
This example extension defines a new top-level API named
"OMPI_Progress" in all four binding types, and provides test programs
to call this API in each of the four binding types. Code (and
comments) is worth 1,000 words -- see the code in this example
extension to understand how it works and how the build system builds
and inserts each piece into the publicly-available mechansisms (e.g.,
mpi-ext.h and the mpi_f08_ext module).
--------------------------------------------------------------------------------
Here's the ways that MPI extensions are similar to Open MPI
components:
- Extensions have a top-level configure.m4 with a well-known m4 macro
that is run during Open MPI's configure that determines whether the
component wants to build or not.
Note, however, that unlike components, extensions *must* have a
configure.m4. No other method of configuration is supported.
- Extensions must adhere to normal Automake-based targets. We
strongly suggest that you use Makefile.am's and have the extension's
configure.m4 AC_CONFIG_FILE each Makefile.am in the extension.
Using other build systems may work, but are untested and
unsupported.
- Extensions create specifically-named libtool convenience archives
(i.e., *.la files) that the build system slurps into higher-level
libraries.
Unlike components, however, extensions:
- Have a bit more rigid directory and file naming scheme.
- Have up to four different, specifically-named subdirectories (one
for each MPI binding type).
- Also install some specifically-named header files (for C and the
Fortran mpif.h bindings).
Similar to components, an MPI extension's name is determined by its
directory name: ompi/mpiext/<extension name>
Under this top-level directory, the extension *must* have a directory
named "c" (for the C bindings) that:
- contains a file named mpiext_<ext_name>_c.h
- installs mpiext_<ext_name>_c.h to
$includedir/openmpi/mpiext/<ext_name>/c
- builds a Libtool convenience library named libmpiext_<ext_name>_c.la
Optionally, the extension may have a director named "mpif-h" (for the
Fortran mpif.h bindings) that:
- contains a file named mpiext_<ext_name>_mpifh.h
- installs mpiext_<ext_name>_mpih.h to
$includedir/openmpi/mpiext/<ext_name>/mpif-h
- builds a Libtool convenience library named libmpiext_<ext_name>_mpifh.la
Optionally, the extension may have a director named "use-mpi" (for the
Fortran "use mpi" bindings) that:
- contains a file named mpiext_<ext_name>_usempi.h
NOTE: The MPI extension system does NOT support building an additional
library in the use-mpi extension directory. It is assumed that the
use-mpi bindings will use the same back-end symbols as the mpif.h
bindings, and that the only output product of the use-mpi directory is
a file to be included in the mpi-ext module (i.e., strong Fortran
prototypes for the functions/global variables in this extension).
Optionally, the extension may have a director named "use-mpi-f08" (for
the Fortran mpi_f08 bindings) that:
- contains a file named mpiext_<ext_name>_usempif08.h
- builds a Libtool convenience library named
libmpiext_<ext_name>_usempif08.la
See the comments in all the header and source files in this tree to
see what each file is for and what should be in each.
--------------------------------------------------------------------------------
Note that the build order of MPI extensions is a bit strange. The
directories in a MPI extensions are NOT traversed top-down in
sequential order. Instead, due to ordering requirements when building
the Fortran module-based interfaces, each subdirectory in extensions
are traversed individually at different times in the overall Open MPI
build.
As such, ompi/mpiext/<ext_name>/Makefile.am is not traversed during a
normal top-level "make all" target. This Makefile.am exists for two
reasons, however:
1. For the conveneince of the developer, so that you can issue normal
"make" commands at the top of your extension tree (e.g., "make all"
will still build all bindings in an extension).
2. During a top-level "make dist", extension directories *are*
traversed top-down in sequence order. Having a top-level Makefile.am
in an extension allows EXTRA_DISTing of files, such as this README
file.
This are reasons for this strange ordering, but suffice it to say that
"make dist" doesn't have the same ordering requiements as "make all",
and is therefore easier to have a "normal" Automake-usual top-down
sequential directory traversal.
Enjoy!

Просмотреть файл

@ -8,3 +8,5 @@
#
SUBDIRS = c mpif-h use-mpi use-mpi-f08
EXTRA_DIST = README.md

14
ompi/mpiext/pcollreq/README.md Обычный файл
Просмотреть файл

@ -0,0 +1,14 @@
# Open MPI extension: pcollreq
Copyright (c) 2018 FUJITSU LIMITED. All rights reserved.
This extension provides the feature of persistent collective
communication operations and persistent neighborhood collective
communication operations, which is planned to be included in the next
MPI Standard after MPI-3.1 as of Nov. 2018.
See `MPIX_Barrier_init(3)` for more details.
The code will be moved to the `ompi/mpi` directory and the `MPIX_`
prefix will be switch to the `MPI_` prefix once the MPI Standard which
includes this feature is published.

Просмотреть файл

@ -1,14 +0,0 @@
Copyright (c) 2018 FUJITSU LIMITED. All rights reserved.
$COPYRIGHT$
This extension provides the feature of persistent collective communication
operations and persistent neighborhood collective communication operations,
which is planned to be included in the next MPI Standard after MPI-3.1 as
of Nov. 2018.
See MPIX_Barrier_init(3) for more details.
The code will be moved to the ompi/mpi directory and the MPIX_ prefix will
be switch to the MPI_ prefix once the MPI Standard which includes this
feature is published.

Просмотреть файл

@ -8,3 +8,5 @@
#
SUBDIRS = c mpif-h use-mpi use-mpi-f08
EXTRA_DIST = README.md

35
ompi/mpiext/shortfloat/README.md Обычный файл
Просмотреть файл

@ -0,0 +1,35 @@
# Open MPI extension: shortfloat
Copyright (c) 2018 FUJITSU LIMITED. All rights reserved.
This extension provides additional MPI datatypes `MPIX_SHORT_FLOAT`,
`MPIX_C_SHORT_FLOAT_COMPLEX`, and `MPIX_CXX_SHORT_FLOAT_COMPLEX`,
which are proposed with the `MPI_` prefix in June 2017 for proposal in
the MPI 4.0 standard. As of February 2019, it is not accepted yet.
See https://github.com/mpi-forum/mpi-issues/issues/65 for moe details
Each MPI datatype corresponds to the C/C++ type `short float`, the C
type `short float _Complex`, and the C++ type `std::complex<short
float>`, respectively.
In addition, this extension provides a datatype `MPIX_C_FLOAT16` for
the C type `_Float16`, which is defined in ISO/IEC JTC 1/SC 22/WG 14
N1945 (ISO/IEC TS 18661-3:2015). This name and meaning are same as
that of MPICH. See https://github.com/pmodels/mpich/pull/3455.
This extension is enabled only if the C compiler supports `short float`
or `_Float16`, or the `--enable-alt-short-float=TYPE` option is passed
to the Open MPI `configure` script.
NOTE: The Clang 6.0.x and 7.0.x compilers support the `_Float16` type
(via software emulation), but require an additional linker flag to
function properly. If you wish to enable Clang 6.0.x or 7.0.x's
software emulation of `_Float16`, use the following CLI options to Open
MPI configure script:
```
./configure \
LDFLAGS=--rtlib=compiler-rt \
--with-wrapper-ldflags=--rtlib=compiler-rt ...
```

Просмотреть файл

@ -1,35 +0,0 @@
Copyright (c) 2018 FUJITSU LIMITED. All rights reserved.
$COPYRIGHT$
This extension provides additional MPI datatypes MPIX_SHORT_FLOAT,
MPIX_C_SHORT_FLOAT_COMPLEX, and MPIX_CXX_SHORT_FLOAT_COMPLEX, which
are proposed with the MPI_ prefix in June 2017 for proposal in the
MPI 4.0 standard. As of February 2019, it is not accepted yet.
https://github.com/mpi-forum/mpi-issues/issues/65
Each MPI datatype corresponds to the C/C++ type 'short float', the C type
'short float _Complex', and the C++ type 'std::complex<short float>',
respectively.
In addition, this extension provides a datatype MPIX_C_FLOAT16 for
the C type _Float16, which is defined in ISO/IEC JTC 1/SC 22/WG 14
N1945 (ISO/IEC TS 18661-3:2015). This name and meaning are same as
that of MPICH.
https://github.com/pmodels/mpich/pull/3455
This extension is enabled only if the C compiler supports 'short float'
or '_Float16', or the '--enable-alt-short-float=TYPE' option is passed
to the configure script.
NOTE: The Clang 6.0.x and 7.0.x compilers support the "_Float16" type
(via software emulation), but require an additional linker flag to
function properly. If you wish to enable Clang 6.0.x or 7.0.x's
software emulation of _Float16, use the following CLI options to Open
MPI configure script:
./configure \
LDFLAGS=--rtlib=compiler-rt \
--with-wrapper-ldflags=--rtlib=compiler-rt ...

Просмотреть файл

@ -1,110 +0,0 @@
========================================
Design notes on BTL/OFI
========================================
This is the RDMA only btl based on OFI Libfabric. The goal is to enable RDMA
with multiple vendor hardware through one interface. Most of the operations are
managed by upper layer (osc/rdma). This BTL is mostly doing the low level work.
Tested providers: sockets,psm2,ugni
========================================
Component
This BTL is requesting libfabric version 1.5 API and will not support older versions.
The required capabilities of this BTL is FI_ATOMIC and FI_RMA with the endpoint type
of FI_EP_RDM only. This BTL does NOT support libfabric provider that requires local
memory registration (FI_MR_LOCAL).
BTL/OFI will initialize a module with ONLY the first compatible info returned from OFI.
This means it will rely on OFI provider to do load balancing. The support for multiple
device might be added later.
The BTL creates only one endpoint and one CQ.
========================================
Memory Registration
Open MPI has a system in place to exchange remote address and always use the remote
virtual address to refer to a piece of memory. However, some libfabric providers might
not support the use of virtual address and instead will use zero-based offset addressing.
FI_MR_VIRT_ADDR is the flag that determine this behavior. mca_btl_ofi_reg_mem() handles
this by storing the base address in registration handle in case of the provider does not
support FI_MR_VIRT_ADDR. This base address will be used to calculate the offset later in
RDMA/Atomic operations.
The BTL will try to use the address of registration handle as the key. However, if the
provider supports FI_MR_PROV_KEY, it will use provider provided key. Simply does not care.
The BTL does not register local operand or compare. This is why this BTL does not support
FI_MR_LOCAL and will allocate every buffer before registering. This means FI_MR_ALLOCATED
is supported. So to be explicit.
Supported MR mode bits (will work with or without):
enum:
- FI_MR_BASIC
- FI_MR_SCALABLE
mode bits:
- FI_MR_VIRT_ADDR
- FI_MR_ALLOCATED
- FI_MR_PROV_KEY
The BTL does NOT support (will not work with):
- FI_MR_LOCAL
- FI_MR_MMU_NOTIFY
- FI_MR_RMA_EVENT
- FI_MR_ENDPOINT
Just a reminder, in libfabric API 1.5...
FI_MR_BASIC == (FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR)
========================================
Completions
Every operation in this BTL is asynchronous. The completion handling will occur in
mca_btl_ofi_component_progress() where we read the CQ with the completion context and
execute the callback functions. The completions are local. No remote completion event is
generated as local completion already guarantee global completion.
The BTL keep tracks of number of outstanding operations and provide flush interface.
========================================
Sockets Provider
Sockets provider is the proof of concept provider for libfabric. It is supposed to support
all the OFI API with emulations. This provider is considered very slow and bound to raise
problems that we might not see from other faster providers.
Known Problems:
- sockets provider uses progress thread and can cause segfault in finalize as we free
the resources while progress thread is still using it. sleep(1) was put in
mca_btl_ofi_componenet_close() for this reason.
- sockets provider deadlock in two-sided mode. Might be something about buffered recv.
(August 2018).
========================================
Scalable Endpoint
This BTL will try to use scalable endpoint to create communication context. This will increase
multithreaded performance for some application. The default number of context created is 1 and
can be tuned VIA MCA parameter "btl_ofi_num_contexts_per_module". It is advised that the number
of context should be equal to number of physical core for optimal performance.
User can disable scalable endpoint by MCA parameter "btl_ofi_disable_sep".
With scalable endpoint disbled, the BTL will alias OFI endpoint to both tx and rx context.
========================================
Two sided communication
Two sided communication is added later on to BTL OFI to enable non tag-matching provider
to be able to use in Open MPI with this BTL. However, the support is only for "functional"
and has not been optimized for performance at this point. (August 2018)

113
opal/mca/btl/ofi/README.md Обычный файл
Просмотреть файл

@ -0,0 +1,113 @@
# Design notes on BTL/OFI
This is the RDMA only btl based on OFI Libfabric. The goal is to
enable RDMA with multiple vendor hardware through one interface. Most
of the operations are managed by upper layer (osc/rdma). This BTL is
mostly doing the low level work.
Tested providers: sockets,psm2,ugni
## Component
This BTL is requesting libfabric version 1.5 API and will not support
older versions.
The required capabilities of this BTL is `FI_ATOMIC` and `FI_RMA` with
the endpoint type of `FI_EP_RDM` only. This BTL does NOT support
libfabric provider that requires local memory registration
(`FI_MR_LOCAL`).
BTL/OFI will initialize a module with ONLY the first compatible info
returned from OFI. This means it will rely on OFI provider to do load
balancing. The support for multiple device might be added later.
The BTL creates only one endpoint and one CQ.
## Memory Registration
Open MPI has a system in place to exchange remote address and always
use the remote virtual address to refer to a piece of memory. However,
some libfabric providers might not support the use of virtual address
and instead will use zero-based offset addressing.
`FI_MR_VIRT_ADDR` is the flag that determine this
behavior. `mca_btl_ofi_reg_mem()` handles this by storing the base
address in registration handle in case of the provider does not
support `FI_MR_VIRT_ADDR`. This base address will be used to calculate
the offset later in RDMA/Atomic operations.
The BTL will try to use the address of registration handle as the
key. However, if the provider supports `FI_MR_PROV_KEY`, it will use
provider provided key. Simply does not care.
The BTL does not register local operand or compare. This is why this
BTL does not support `FI_MR_LOCAL` and will allocate every buffer
before registering. This means `FI_MR_ALLOCATED` is supported. So to
be explicit.
Supported MR mode bits (will work with or without):
* enum:
* `FI_MR_BASIC`
* `FI_MR_SCALABLE`
* mode bits:
* `FI_MR_VIRT_ADDR`
* `FI_MR_ALLOCATED`
* `FI_MR_PROV_KEY`
The BTL does NOT support (will not work with):
* `FI_MR_LOCAL`
* `FI_MR_MMU_NOTIFY`
* `FI_MR_RMA_EVENT`
* `FI_MR_ENDPOINT`
Just a reminder, in libfabric API 1.5...
`FI_MR_BASIC == (FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR)`
## Completions
Every operation in this BTL is asynchronous. The completion handling
will occur in `mca_btl_ofi_component_progress()` where we read the CQ
with the completion context and execute the callback functions. The
completions are local. No remote completion event is generated as
local completion already guarantee global completion.
The BTL keep tracks of number of outstanding operations and provide
flush interface.
## Sockets Provider
Sockets provider is the proof of concept provider for libfabric. It is
supposed to support all the OFI API with emulations. This provider is
considered very slow and bound to raise problems that we might not see
from other faster providers.
Known Problems:
* sockets provider uses progress thread and can cause segfault in
finalize as we free the resources while progress thread is still
using it. `sleep(1)` was put in `mca_btl_ofi_component_close()` for
this reason.
* sockets provider deadlock in two-sided mode. Might be something
about buffered recv. (August 2018).
## Scalable Endpoint
This BTL will try to use scalable endpoint to create communication
context. This will increase multithreaded performance for some
application. The default number of context created is 1 and can be
tuned VIA MCA parameter `btl_ofi_num_contexts_per_module`. It is
advised that the number of context should be equal to number of
physical core for optimal performance.
User can disable scalable endpoint by MCA parameter
`btl_ofi_disable_sep`. With scalable endpoint disbled, the BTL will
alias OFI endpoint to both tx and rx context.
## Two sided communication
Two sided communication is added later on to BTL OFI to enable non
tag-matching provider to be able to use in Open MPI with this
BTL. However, the support is only for "functional" and has not been
optimized for performance at this point. (August 2018)

Просмотреть файл

@ -1,113 +0,0 @@
Copyright (c) 2013 NVIDIA Corporation. All rights reserved.
August 21, 2013
SMCUDA DESIGN DOCUMENT
This document describes the design and use of the smcuda BTL.
BACKGROUND
The smcuda btl is a copy of the sm btl but with some additional features.
The main extra feature is the ability to make use of the CUDA IPC APIs to
quickly move GPU buffers from one GPU to another. Without this support,
the GPU buffers would all be moved into and then out of host memory.
GENERAL DESIGN
The general design makes use of the large message RDMA RGET support in the
OB1 PML. However, there are some interesting choices to make use of it.
First, we disable any large message RDMA support in the BTL for host
messages. This is done because we need to use the mca_btl_smcuda_get() for
the GPU buffers. This is also done because the upper layers expect there
to be a single mpool but we need one for the GPU memory and one for the
host memory. Since the advantages of using RDMA with host memory is
unclear, we disabled it. This means no KNEM or CMA support built in to the
smcuda BTL.
Also note that we give the smcuda BTL a higher rank than the sm BTL. This
means it will always be selected even if we are doing host only data
transfers. The smcuda BTL is not built if it is not requested via the
--with-cuda flag to the configure line.
Secondly, the smcuda does not make use of the traditional method of
enabling RDMA operations. The traditional method checks for the existence
of an RDMA btl hanging off the endpoint. The smcuda works in conjunction
with the OB1 PML and uses flags that it sends in the BML layer.
OTHER CONSIDERATIONS
CUDA IPC is not necessarily supported by all GPUs on a node. In NUMA
nodes, CUDA IPC may only work between GPUs that are not connected
over the IOH. In addition, we want to check for CUDA IPC support lazily,
when the first GPU access occurs, rather than during MPI_Init() time.
This complicates the design.
INITIALIZATION
When the smcuda BTL initializes, it starts with no support for CUDA IPC.
Upon the first access of a GPU buffer, the smcuda checks which GPU device
it has and sends that to the remote side using a smcuda specific control
message. The other rank receives the message, and checks to see if there
is CUDA IPC support between the two GPUs via a call to
cuDeviceCanAccessPeer(). If it is true, then the smcuda BTL piggy backs on
the PML error handler callback to make a call into the PML and let it know
to enable CUDA IPC. We created a new flag so that the error handler does
the right thing. Large message RDMA is enabled by setting a flag in the
bml->btl_flags field. Control returns to the smcuda BTL where a reply
message is sent so the sending side can set its flag.
At that point, the PML layer starts using the large message RDMA support
in the smcuda BTL. This is done in some special CUDA code in the PML layer.
ESTABLISHING CUDA IPC SUPPORT
A check has been added into both the send and sendi path in the smcuda btl
that checks to see if it should send a request for CUDA IPC setup message.
/* Initiate setting up CUDA IPC support. */
if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstatus)) {
mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint);
}
The first check is to see if the CUDA environment has been initialized. If
not, then presumably we are not sending any GPU buffers yet and there is
nothing to be done. If we are initialized, then check the status of the
CUDA IPC endpoint. If it is in the IPC_INIT stage, then call the function
to send of a control message to the endpoint.
On the receiving side, we first check to see if we are initialized. If
not, then send a message back to the sender saying we are not initialized.
This will cause the sender to reset its state to IPC_INIT so it can try
again on the next send.
I considered putting the receiving side into a new state like IPC_NOTREADY,
and then when it switches to ready, to then sending the ACK to the sender.
The problem with this is that we would need to do these checks during the
progress loop which adds some extra overhead as we would have to check all
endpoints to see if they were ready.
Note that any rank can initiate the setup of CUDA IPC. It is triggered by
whichever side does a send or sendi call of a GPU buffer.
I have the sender attempt 5 times to set up the connection. After that, we
give up. Note that I do not expect many scenarios where the sender has to
resend. It could happen in a race condition where one rank has initialized
its CUDA environment but the other side has not.
There are several states the connections can go through.
IPC_INIT - nothing has happened
IPC_SENT - message has been sent to other side
IPC_ACKING - Received request and figuring out what to send back
IPC_ACKED - IPC ACK sent
IPC_OK - IPC ACK received back
IPC_BAD - Something went wrong, so marking as no IPC support
NOTE ABOUT CUDA IPC AND MEMORY POOLS
The CUDA IPC support works in the following way. A sender makes a call to
cuIpcGetMemHandle() and gets a memory handle for its local memory. The
sender then sends that handle to receiving side. The receiver calls
cuIpcOpenMemHandle() using that handle and gets back an address to the
remote memory. The receiver then calls cuMemcpyAsync() to initiate a
remote read of the GPU data.
The receiver maintains a cache of remote memory that it has handles open on.
This is because a call to cuIpcOpenMemHandle() can be very expensive (90usec) so
we want to avoid it when we can. The cache of remote memory is kept in a memory
pool that is associated with each endpoint. Note that we do not cache the local
memory handles because getting them is very cheap and there is no need.

126
opal/mca/btl/smcuda/README.md Обычный файл
Просмотреть файл

@ -0,0 +1,126 @@
# Open MPI SMCUDA design document
Copyright (c) 2013 NVIDIA Corporation. All rights reserved.
August 21, 2013
This document describes the design and use of the `smcuda` BTL.
## BACKGROUND
The `smcuda` btl is a copy of the `sm` btl but with some additional
features. The main extra feature is the ability to make use of the
CUDA IPC APIs to quickly move GPU buffers from one GPU to another.
Without this support, the GPU buffers would all be moved into and then
out of host memory.
## GENERAL DESIGN
The general design makes use of the large message RDMA RGET support in
the OB1 PML. However, there are some interesting choices to make use
of it. First, we disable any large message RDMA support in the BTL
for host messages. This is done because we need to use the
`mca_btl_smcuda_get()` for the GPU buffers. This is also done because
the upper layers expect there to be a single mpool but we need one for
the GPU memory and one for the host memory. Since the advantages of
using RDMA with host memory is unclear, we disabled it. This means no
KNEM or CMA support built in to the `smcuda` BTL.
Also note that we give the `smcuda` BTL a higher rank than the `sm`
BTL. This means it will always be selected even if we are doing host
only data transfers. The `smcuda` BTL is not built if it is not
requested via the `--with-cuda` flag to the configure line.
Secondly, the `smcuda` does not make use of the traditional method of
enabling RDMA operations. The traditional method checks for the existence
of an RDMA btl hanging off the endpoint. The `smcuda` works in conjunction
with the OB1 PML and uses flags that it sends in the BML layer.
## OTHER CONSIDERATIONS
CUDA IPC is not necessarily supported by all GPUs on a node. In NUMA
nodes, CUDA IPC may only work between GPUs that are not connected
over the IOH. In addition, we want to check for CUDA IPC support lazily,
when the first GPU access occurs, rather than during `MPI_Init()` time.
This complicates the design.
## INITIALIZATION
When the `smcuda` BTL initializes, it starts with no support for CUDA IPC.
Upon the first access of a GPU buffer, the `smcuda` checks which GPU device
it has and sends that to the remote side using a `smcuda` specific control
message. The other rank receives the message, and checks to see if there
is CUDA IPC support between the two GPUs via a call to
`cuDeviceCanAccessPeer()`. If it is true, then the `smcuda` BTL piggy backs on
the PML error handler callback to make a call into the PML and let it know
to enable CUDA IPC. We created a new flag so that the error handler does
the right thing. Large message RDMA is enabled by setting a flag in the
`bml->btl_flags` field. Control returns to the `smcuda` BTL where a reply
message is sent so the sending side can set its flag.
At that point, the PML layer starts using the large message RDMA
support in the `smcuda` BTL. This is done in some special CUDA code
in the PML layer.
## ESTABLISHING CUDA IPC SUPPORT
A check has been added into both the `send` and `sendi` path in the
`smcuda` btl that checks to see if it should send a request for CUDA
IPC setup message.
```c
/* Initiate setting up CUDA IPC support. */
if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstatus)) {
mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint);
}
```
The first check is to see if the CUDA environment has been
initialized. If not, then presumably we are not sending any GPU
buffers yet and there is nothing to be done. If we are initialized,
then check the status of the CUDA IPC endpoint. If it is in the
IPC_INIT stage, then call the function to send of a control message to
the endpoint.
On the receiving side, we first check to see if we are initialized.
If not, then send a message back to the sender saying we are not
initialized. This will cause the sender to reset its state to
IPC_INIT so it can try again on the next send.
I considered putting the receiving side into a new state like
IPC_NOTREADY, and then when it switches to ready, to then sending the
ACK to the sender. The problem with this is that we would need to do
these checks during the progress loop which adds some extra overhead
as we would have to check all endpoints to see if they were ready.
Note that any rank can initiate the setup of CUDA IPC. It is
triggered by whichever side does a send or sendi call of a GPU buffer.
I have the sender attempt 5 times to set up the connection. After
that, we give up. Note that I do not expect many scenarios where the
sender has to resend. It could happen in a race condition where one
rank has initialized its CUDA environment but the other side has not.
There are several states the connections can go through.
1. IPC_INIT - nothing has happened
1. IPC_SENT - message has been sent to other side
1. IPC_ACKING - Received request and figuring out what to send back
1. IPC_ACKED - IPC ACK sent
1. IPC_OK - IPC ACK received back
1. IPC_BAD - Something went wrong, so marking as no IPC support
## NOTE ABOUT CUDA IPC AND MEMORY POOLS
The CUDA IPC support works in the following way. A sender makes a
call to `cuIpcGetMemHandle()` and gets a memory handle for its local
memory. The sender then sends that handle to receiving side. The
receiver calls `cuIpcOpenMemHandle()` using that handle and gets back
an address to the remote memory. The receiver then calls
`cuMemcpyAsync()` to initiate a remote read of the GPU data.
The receiver maintains a cache of remote memory that it has handles
open on. This is because a call to `cuIpcOpenMemHandle()` can be very
expensive (90usec) so we want to avoid it when we can. The cache of
remote memory is kept in a memory pool that is associated with each
endpoint. Note that we do not cache the local memory handles because
getting them is very cheap and there is no need.

Просмотреть файл

@ -27,7 +27,7 @@
AM_CPPFLAGS = $(opal_ofi_CPPFLAGS) -DOMPI_LIBMPI_NAME=\"$(OMPI_LIBMPI_NAME)\"
EXTRA_DIST = README.txt README.test
EXTRA_DIST = README.md README.test
dist_opaldata_DATA = \
help-mpi-btl-usnic.txt

330
opal/mca/btl/usnic/README.md Обычный файл
Просмотреть файл

@ -0,0 +1,330 @@
# Design notes on usnic BTL
## nomenclature
* fragment - something the PML asks us to send or put, any size
* segment - something we can put on the wire in a single packet
* chunk - a piece of a fragment that fits into one segment
a segment can contain either an entire fragment or a chunk of a fragment
each segment and fragment has associated descriptor.
Each segment data structure has a block of registered memory associated with
it which matches MTU for that segment
* ACK - acks get special small segments with only enough memory for an ACK
* non-ACK segments always have a parent fragment
* fragments are either large (> MTU) or small (<= MTU)
* a small fragment has a segment descriptor embedded within it since it
always needs exactly one.
* a large fragment has no permanently associated segments, but allocates them
as needed.
## channels
A channel is a queue pair with an associated completion queue
each channel has its own MTU and r/w queue entry counts
There are 2 channels, command and data:
* command queue is generally for higher priority fragments
* data queue is for standard data traffic
* command queue should possibly be called "priority" queue
command queue is shorter and has a smaller MTU that the data queue.
this makes the command queue a lot faster than the data queue, so we
hijack it for sending very small fragments (<= tiny_mtu, currently 768 bytes)
command queue is used for ACKs and tiny fragments.
data queue is used for everything else.
PML fragments marked priority should perhaps use command queue
## sending
Normally, all send requests are simply enqueued and then actually posted
to the NIC by the routine `opal_btl_usnic_module_progress_sends()`.
"fastpath" tiny sends are the exception.
Each module maintains a queue of endpoints that are ready to send.
An endpoint is ready to send if all of the following are met:
1. the endpoint has fragments to send
1. the endpoint has send credits
1. the endpoint's send window is "open" (not full of un-ACKed segments)
Each module also maintains a list of segments that need to be retransmitted.
Note that the list of pending retrans is per-module, not per-endpoint.
Send progression first posts any pending retransmissions, always using
the data channel. (reason is that if we start getting heavy
congestion and there are lots of retransmits, it becomes more
important than ever to prioritize ACKs, clogging command channel with
retrans data makes things worse, not better)
Next, progression loops sending segments to the endpoint at the top of
the `endpoints_with_sends` queue. When an endpoint exhausts its send
credits or fills its send window or runs out of segments to send, it
removes itself from the `endpoint_with_sends` list. Any pending ACKs
will be picked up and piggy-backed on these sends.
Finally, any endpoints that still need ACKs whose timer has expired will
be sent explicit ACK packets.
## fragment sending
The middle part of the progression loop handles both small
(single-segment) and large (multi-segment) sends.
For small fragments, the verbs descriptor within the embedded segment
is updated with length, BTL header is updated, then we call
`opal_btl_usnic_endpoint_send_segment()` to send the segment. After
posting, we make a PML callback if needed.
For large fragments, a little more is needed. segments froma large
fragment have a slightly larger BTL header which contains a fragment
ID, and offset, and a size. The fragment ID is allocated when the
first chunk the fragment is sent. A segment gets allocated, next blob
of data is copied into this segment, segment is posted. If last chunk
of fragment sent, perform callback if needed, then remove fragment
from endpoint send queue.
## `opal_btl_usnic_endpoint_send_segment()`
This is common posting code for large or small segments. It assigns a
sequence number to a segment, checks for an ACK to piggy-back,
posts the segment to the NIC, and then starts the retransmit timer
by checking the segment into hotel. Send credits are consumed here.
## send dataflow
PML control messages with no user data are sent via:
* `desc = usnic_alloc(size)`
* `usnic_send(desc)`
user messages less than eager limit and 1st part of larger
messages are sent via:
* `desc = usnic_prepare_src(convertor, size)`
* `usnic_send(desc)`
larger msgs:
* `desc = usnic_prepare_src(convertor, size)`
* `usnic_put(desc)`
`usnic_alloc()` currently asserts the length is "small", allocates and
fills in a small fragment. src pointer will point to start of
associated registered mem + sizeof BTL header, and PML will put its
data there.
`usnic_prepare_src()` allocated either a large or small fragment based
on size The fragment descriptor is filled in to have 2 SG entries, 1st
pointing to place where PML should construct its header. If the data
convertor says data is contiguous, 2nd SG entry points to user buffer,
else it is null and sf_convertor is filled in with address of
convertor.
### `usnic_send()`
If the fragment being sent is small enough, has contiguous data, and
"very few" command queue send WQEs have been consumed, `usnic_send()`
does a fastpath send. This means it posts the segment immediately to
the NIC with INLINE flag set.
If all of the conditions for fastpath send are not met, and this is a
small fragment, the user data is copied into the associated registered
memory at this time and the SG list in the descriptor is collapsed to
one entry.
After the checks above are done, the fragment is enqueued to be sent
via `opal_btl_usnic_endpoint_enqueue_frag()`
### `usnic_put()`
Do a fast version of what happens in `prepare_src()` (can take shortcuts
because we know it will always be a contiguous buffer / no convertor
needed). PML gives us the destination address, which we save on the
fragment (which is the sentinel value that the underlying engine uses
to know that this is a PUT and not a SEND), and the fragment is
enqueued for processing.
### `opal_btl_usnic_endpoint_enqueue_frag()`
This appends the fragment to the "to be sent" list of the endpoint and
conditionally adds the endpoint to the list of endpoints with data to
send via `opal_btl_usnic_check_rts()`
## receive dataflow
BTL packets has one of 3 types in header: frag, chunk, or ack.
* A frag packet is a full PML fragment.
* A chunk packet is a piece of a fragment that needs to be reassembled.
* An ack packet is header only with a sequence number being ACKed.
* Both frag and chunk packets go through some of the same processing.
* Both may carry piggy-backed ACKs which may need to be processed.
* Both have sequence numbers which must be processed and may result in
dropping the packet and/or queueing an ACK to the sender.
frag packets may be either regular PML fragments or PUT segments. If
the "put_addr" field of the BTL header is set, this is a PUT and the
data is copied directly to the user buffer. If this field is NULL,
the segment is passed up to the PML. The PML is expected to do
everything it needs with this packet in the callback, including
copying data out if needed. Once the callback is complete, the
receive buffer is recycled.
chunk packets are parts of a larger fragment. If an active fragment
receive for the matching fragment ID cannot be found, and new fragment
info descriptor is allocated. If this is not a PUT (`put_addr == NULL`),
we `malloc()` data to reassemble the fragment into. Each
subsequent chunk is copied either into this reassembly buffer or
directly into user memory. When the last chunk of a fragment arrives,
a PML callback is made for non-PUTs, then the fragment info descriptor
is released.
## fast receive optimization
In order to optimize latency of small packets, the component progress
routine implements a fast path for receives. If the first completion
is a receive on the priority queue, then it is handled by a routine
called `opal_btl_usnic_recv_fast()` which does nothing but validates
that the packet is OK to be received (sequence number OK and not a
DUP) and then delivers it to the PML. This packet is recorded in the
channel structure, and all bookeeping for the packet is deferred until
the next time `component_progress` is called again.
This fast path cannot be taken every time we pass through
`component_progress` because there will be other completions that need
processing, and the receive bookeeping for one fast receive must be
complete before allowing another fast receive to occur, as only one
recv segment can be saved for deferred processing at a time. This is
handled by maintaining a variable in `opal_btl_usnic_recv_fast()`
called fastpath_ok which is set to false every time the fastpath is
taken. A call into the regular progress routine will set this flag
back to true.
## reliability:
* every packet has sequence #
* each endpoint has a "send window" , currently 4096 entries.
* once a segment is sent, it is saved in window array until ACK is received
* ACKs acknowledge all packets <= specified sequence #
* rcvr only ACKs a sequence # when all packets up to that sequence have arrived
* each pkt has dflt retrans timer of 100ms
* packet will be scheduled for retrans if timer expires
Once a segment is sent, it always has its retransmit timer started.
This is accomplished by `opal_hotel_checkin()`.
Any time a segment is posted to the NIC for retransmit, it is checked out
of the hotel (timer stopped).
So, a send segment is always in one of 4 states:
* on free list, unallocated
* on endpoint to-send list in the case of segment associated with small fragment
* posted to NIC and in hotel awaiting ACK
* on module re-send list awaiting retransmission
rcvr:
* if a pkt with seq >= expected seq is received, schedule ack of largest
in-order sequence received if not already scheduled. dflt time is 50us
* if a packet with seq < expected seq arrives, we send an ACK immediately,
as this indicates a lost ACK
sender:
* duplicate ACK triggers immediate retrans if one is not pending for
that segment
## Reordering induced by two queues and piggy-backing:
ACKs can be reordered-
* not an issue at all, old ACKs are simply ignored
Sends can be reordered-
* (small send can jump far ahead of large sends)
* large send followed by lots of small sends could trigger many
retrans of the large sends. smalls would have to be paced pretty
precisely to keep command queue empty enough and also beat out the
large sends. send credits limit how many larges can be queued on
the sender, but there could be many on the receiver
## RDMA emulation
We emulate the RDMA PUT because it's more efficient than regular send:
it allows the receive to copy directly to the target buffer
(vs. making an intermediate copy out of the bounce buffer).
It would actually be better to morph this PUT into a GET -- GET would
be slightly more efficient. In short, when the target requests the
actual RDMA data, with PUT, the request has to go up to the PML, which
will then invoke PUT on the source's BTL module. With GET, the target
issues the GET, and the source BTL module can reply without needing to
go up the stack to the PML.
Once we start supporting RDMA in hardware:
* we need to provide `module.btl_register_mem` and
`module.btl_deregister_mem` functions (see openib for an example)
* we need to put something meaningful in
`btl_usnic_frag.h:mca_btl_base_registration_handle_t`.
* we need to set `module.btl_registration_handle_size` to `sizeof(struct
mca_btl_base_registration_handle_t`).
* `module.btl_put` / `module.btl_get` will receive the
`mca_btl_base_registration_handle_t` from the peer as a cookie.
Also, `module.btl_put` / `module.btl_get` do not need to make
descriptors (this was an optimization added in BTL 3.0). They are now
called with enough information to do whatever they need to do.
module.btl_put still makes a descriptor and submits it to the usnic
sending engine so as to utilize a common infrastructure for send and
put.
But it doesn't necessarily have to be that way -- we could optimize
out the use of the descriptors. Have not investigated how easy/hard
that would be.
## libfabric abstractions:
* `fi_fabric`: corresponds to a VIC PF
* `fi_domain`: corresponds to a VIC VF
* `fi_endpoint`: resources inside the VIC VF (basically a QP)
## `MPI_THREAD_MULTIPLE` support
In order to make usnic btl thread-safe, the mutex locks are issued to
protect the critical path. ie; libfabric routines, book keeping, etc.
The said lock is `btl_usnic_lock`. It is a RECURSIVE lock, meaning
that the same thread can take the lock again even if it already has
the lock to allow the callback function to post another segment right
away if we know that the current segment is completed inline. (So we
can call send in send without deadlocking)
These two functions taking care of hotel checkin/checkout and we have
to protect that part. So we take the mutex lock before we enter the
function.
* `opal_btl_usnic_check_rts()`
* `opal_btl_usnic_handle_ack()`
We also have to protect the call to libfabric routines
* `opal_btl_usnic_endpoint_send_segment()` (`fi_send`)
* `opal_btl_usnic_recv_call()` (`fi_recvmsg`)
have to be protected as well.
Also cclient connection checking (`opal_btl_usnic_connectivity_ping`)
has to be protected. This happens only in the beginning but cclient
communicate with cagent through `opal_fd_read/write()` and if two or
more clients do `opal_fd_write()` at the same time, the data might be
corrupt.
With this concept, many functions in btl/usnic that make calls to the
listed functions are protected by `OPAL_THREAD_LOCK` macro which will
only be active if the user specify `MPI_Init_thread()` with
`MPI_THREAD_MULTIPLE` support.

Просмотреть файл

@ -1,383 +0,0 @@
Design notes on usnic BTL
======================================
nomenclature
fragment - something the PML asks us to send or put, any size
segment - something we can put on the wire in a single packet
chunk - a piece of a fragment that fits into one segment
a segment can contain either an entire fragment or a chunk of a fragment
each segment and fragment has associated descriptor.
Each segment data structure has a block of registered memory associated with
it which matches MTU for that segment
ACK - acks get special small segments with only enough memory for an ACK
non-ACK segments always have a parent fragment
fragments are either large (> MTU) or small (<= MTU)
a small fragment has a segment descriptor embedded within it since it
always needs exactly one.
a large fragment has no permanently associated segments, but allocates them
as needed.
======================================
channels
a channel is a queue pair with an associated completion queue
each channel has its own MTU and r/w queue entry counts
There are 2 channels, command and data
command queue is generally for higher priority fragments
data queue is for standard data traffic
command queue should possibly be called "priority" queue
command queue is shorter and has a smaller MTU that the data queue
this makes the command queue a lot faster than the data queue, so we
hijack it for sending very small fragments (<= tiny_mtu, currently 768 bytes)
command queue is used for ACKs and tiny fragments
data queue is used for everything else
PML fragments marked priority should perhaps use command queue
======================================
sending
Normally, all send requests are simply enqueued and then actually posted
to the NIC by the routine opal_btl_usnic_module_progress_sends().
"fastpath" tiny sends are the exception.
Each module maintains a queue of endpoints that are ready to send.
An endpoint is ready to send if all of the following are met:
- the endpoint has fragments to send
- the endpoint has send credits
- the endpoint's send window is "open" (not full of un-ACKed segments)
Each module also maintains a list of segments that need to be retransmitted.
Note that the list of pending retrans is per-module, not per-endpoint.
send progression first posts any pending retransmissions, always using the
data channel. (reason is that if we start getting heavy congestion and
there are lots of retransmits, it becomes more important than ever to
prioritize ACKs, clogging command channel with retrans data makes things worse,
not better)
Next, progression loops sending segments to the endpoint at the top of
the "endpoints_with_sends" queue. When an endpoint exhausts its send
credits or fills its send window or runs out of segments to send, it removes
itself from the endpoint_with_sends list. Any pending ACKs will be
picked up and piggy-backed on these sends.
Finally, any endpoints that still need ACKs whose timer has expired will
be sent explicit ACK packets.
[double-click fragment sending]
The middle part of the progression loop handles both small (single-segment)
and large (multi-segment) sends.
For small fragments, the verbs descriptor within the embedded segment is
updated with length, BTL header is updated, then we call
opal_btl_usnic_endpoint_send_segment() to send the segment.
After posting, we make a PML callback if needed.
For large fragments, a little more is needed. segments froma large
fragment have a slightly larger BTL header which contains a fragment ID,
and offset, and a size. The fragment ID is allocated when the first chunk
the fragment is sent. A segment gets allocated, next blob of data is
copied into this segment, segment is posted. If last chunk of fragment
sent, perform callback if needed, then remove fragment from endpoint
send queue.
[double-click opal_btl_usnic_endpoint_send_segment()]
This is common posting code for large or small segments. It assigns a
sequence number to a segment, checks for an ACK to piggy-back,
posts the segment to the NIC, and then starts the retransmit timer
by checking the segment into hotel. Send credits are consumed here.
======================================
send dataflow
PML control messages with no user data are sent via:
desc = usnic_alloc(size)
usnic_send(desc)
user messages less than eager limit and 1st part of larger
messages are sent via:
desc = usnic_prepare_src(convertor, size)
usnic_send(desc)
larger msgs
desc = usnic_prepare_src(convertor, size)
usnic_put(desc)
usnic_alloc() currently asserts the length is "small", allocates and
fills in a small fragment. src pointer will point to start of
associated registered mem + sizeof BTL header, and PML will put its
data there.
usnic_prepare_src() allocated either a large or small fragment based on size
The fragment descriptor is filled in to have 2 SG entries, 1st pointing to
place where PML should construct its header. If the data convertor says
data is contiguous, 2nd SG entry points to user buffer, else it is null and
sf_convertor is filled in with address of convertor.
usnic_send()
If the fragment being sent is small enough, has contiguous data, and
"very few" command queue send WQEs have been consumed, usnic_send() does
a fastpath send. This means it posts the segment immediately to the NIC
with INLINE flag set.
If all of the conditions for fastpath send are not met, and this is a small
fragment, the user data is copied into the associated registered memory at this
time and the SG list in the descriptor is collapsed to one entry.
After the checks above are done, the fragment is enqueued to be sent
via opal_btl_usnic_endpoint_enqueue_frag()
usnic_put()
Do a fast version of what happens in prepare_src() (can take shortcuts
because we know it will always be a contiguous buffer / no convertor
needed). PML gives us the destination address, which we save on the
fragment (which is the sentinel value that the underlying engine uses
to know that this is a PUT and not a SEND), and the fragment is
enqueued for processing.
opal_btl_usnic_endpoint_enqueue_frag()
This appends the fragment to the "to be sent" list of the endpoint and
conditionally adds the endpoint to the list of endpoints with data to send
via opal_btl_usnic_check_rts()
======================================
receive dataflow
BTL packets has one of 3 types in header: frag, chunk, or ack.
A frag packet is a full PML fragment.
A chunk packet is a piece of a fragment that needs to be reassembled.
An ack packet is header only with a sequence number being ACKed.
Both frag and chunk packets go through some of the same processing.
Both may carry piggy-backed ACKs which may need to be processed.
Both have sequence numbers which must be processed and may result in
dropping the packet and/or queueing an ACK to the sender.
frag packets may be either regular PML fragments or PUT segments.
If the "put_addr" field of the BTL header is set, this is a PUT and
the data is copied directly to the user buffer. If this field is NULL,
the segment is passed up to the PML. The PML is expected to do everything
it needs with this packet in the callback, including copying data out if
needed. Once the callback is complete, the receive buffer is recycled.
chunk packets are parts of a larger fragment. If an active fragment receive
for the matching fragment ID cannot be found, and new fragment info
descriptor is allocated. If this is not a PUT (put_addr == NULL), we
malloc() data to reassemble the fragment into. Each subsequent chunk
is copied either into this reassembly buffer or directly into user memory.
When the last chunk of a fragment arrives, a PML callback is made for non-PUTs,
then the fragment info descriptor is released.
======================================
fast receive optimization
In order to optimize latency of small packets, the component progress routine
implements a fast path for receives. If the first completion is a receive on
the priority queue, then it is handled by a routine called
opal_btl_usnic_recv_fast() which does nothing but validates that the packet
is OK to be received (sequence number OK and not a DUP) and then delivers it
to the PML. This packet is recorded in the channel structure, and all
bookeeping for the packet is deferred until the next time component_progress
is called again.
This fast path cannot be taken every time we pass through component_progress
because there will be other completions that need processing, and the receive
bookeeping for one fast receive must be complete before allowing another fast
receive to occur, as only one recv segment can be saved for deferred
processing at a time. This is handled by maintaining a variable in
opal_btl_usnic_recv_fast() called fastpath_ok which is set to false every time
the fastpath is taken. A call into the regular progress routine will set this
flag back to true.
======================================
reliability:
every packet has sequence #
each endpoint has a "send window" , currently 4096 entries.
once a segment is sent, it is saved in window array until ACK is received
ACKs acknowledge all packets <= specified sequence #
rcvr only ACKs a sequence # when all packets up to that sequence have arrived
each pkt has dflt retrans timer of 100ms
packet will be scheduled for retrans if timer expires
Once a segment is sent, it always has its retransmit timer started.
This is accomplished by opal_hotel_checkin()
Any time a segment is posted to the NIC for retransmit, it is checked out
of the hotel (timer stopped).
So, a send segment is always in one of 4 states:
- on free list, unallocated
- on endpoint to-send list in the case of segment associated with small fragment
- posted to NIC and in hotel awaiting ACK
- on module re-send list awaiting retransmission
rcvr:
- if a pkt with seq >= expected seq is received, schedule ack of largest
in-order sequence received if not already scheduled. dflt time is 50us
- if a packet with seq < expected seq arrives, we send an ACK immediately,
as this indicates a lost ACK
sender:
duplicate ACK triggers immediate retrans if one is not pending for that segment
======================================
Reordering induced by two queues and piggy-backing:
ACKs can be reordered-
not an issue at all, old ACKs are simply ignored
Sends can be reordered-
(small send can jump far ahead of large sends)
large send followed by lots of small sends could trigger many retrans
of the large sends. smalls would have to be paced pretty precisely to
keep command queue empty enough and also beat out the large sends.
send credits limit how many larges can be queued on the sender, but there
could be many on the receiver
======================================
RDMA emulation
We emulate the RDMA PUT because it's more efficient than regular send:
it allows the receive to copy directly to the target buffer
(vs. making an intermediate copy out of the bounce buffer).
It would actually be better to morph this PUT into a GET -- GET would
be slightly more efficient. In short, when the target requests the
actual RDMA data, with PUT, the request has to go up to the PML, which
will then invoke PUT on the source's BTL module. With GET, the target
issues the GET, and the source BTL module can reply without needing to
go up the stack to the PML.
Once we start supporting RDMA in hardware:
- we need to provide module.btl_register_mem and
module.btl_deregister_mem functions (see openib for an example)
- we need to put something meaningful in
btl_usnic_frag.h:mca_btl_base_registration_handle_t.
- we need to set module.btl_registration_handle_size to sizeof(struct
mca_btl_base_registration_handle_t).
- module.btl_put / module.btl_get will receive the
mca_btl_base_registration_handle_t from the peer as a cookie.
Also, module.btl_put / module.btl_get do not need to make descriptors
(this was an optimization added in BTL 3.0). They are now called with
enough information to do whatever they need to do. module.btl_put
still makes a descriptor and submits it to the usnic sending engine so
as to utilize a common infrastructure for send and put.
But it doesn't necessarily have to be that way -- we could optimize
out the use of the descriptors. Have not investigated how easy/hard
that would be.
======================================
November 2014 / SC 2014
Update February 2015
The usnic BTL code has been unified across master and the v1.8
branches.
NOTE: As of May 2018, this is no longer true. This was generally
only necessary back when the BTLs were moved from the OMPI layer to
the OPAL layer. Now that the BTLs have been down in OPAL for
several years, this tomfoolery is no longer necessary. This note
is kept for historical purposes, just in case someone needs to go
back and look at the v1.8 series.
That is, you can copy the code from v1.8:ompi/mca/btl/usnic/* to
master:opal/mca/btl/usnic*, and then only have to make 3 changes in
the resulting code in master:
1. Edit Makefile.am: s/ompi/opal/gi
2. Edit configure.m4: s/ompi/opal/gi
--> EXCEPT for:
- opal_common_libfabric_* (which will eventually be removed,
when the embedded libfabric goes away)
- OPAL_BTL_USNIC_FI_EXT_USNIC_H (which will eventually be
removed, when the embedded libfabric goes away)
- OPAL_VAR_SCOPE_*
3. Edit Makefile.am: change -DBTL_IN_OPAL=0 to -DBTL_IN_OPAL=1
*** Note: the BTL_IN_OPAL preprocessor macro is set in Makefile.am
rather that in btl_usnic_compat.h to avoid all kinds of include
file dependency issues (i.e., btl_usnic_compat.h would need to be
included first, but it requires some data structures to be
defined, which means it either can't be first or we have to
declare various structs first... just put BTL_IN_OPAL in
Makefile.am and be happy).
*** Note 2: CARE MUST BE TAKEN WHEN COPYING THE OTHER DIRECTION! It
is *not* as simple as simple s/opal/ompi/gi in configure.m4 and
Makefile.am. It certainly can be done, but there's a few strings
that need to stay "opal" or "OPAL" (e.g., OPAL_HAVE_FOO).
Hence, the string replace will likely need to be done via manual
inspection.
Things still to do:
- VF/PF sanity checks in component.c:check_usnic_config() uses
usnic-specific fi_provider info. The exact mechanism might change
as provider-specific info is still being discussed upstream.
- component.c:usnic_handle_cq_error is using a USD_* constant from
usnic_direct. Need to get that value through libfabric somehow.
======================================
libfabric abstractions:
fi_fabric: corresponds to a VIC PF
fi_domain: corresponds to a VIC VF
fi_endpoint: resources inside the VIC VF (basically a QP)
======================================
MPI_THREAD_MULTIPLE support
In order to make usnic btl thread-safe, the mutex locks are issued
to protect the critical path. ie; libfabric routines, book keeping, etc.
The said lock is btl_usnic_lock. It is a RECURSIVE lock, meaning that
the same thread can take the lock again even if it already has the lock to
allow the callback function to post another segment right away if we know
that the current segment is completed inline. (So we can call send in send
without deadlocking)
These two functions taking care of hotel checkin/checkout and we
have to protect that part. So we take the mutex lock before we enter the
function.
- opal_btl_usnic_check_rts()
- opal_btl_usnic_handle_ack()
We also have to protect the call to libfabric routines
- opal_btl_usnic_endpoint_send_segment() (fi_send)
- opal_btl_usnic_recv_call() (fi_recvmsg)
have to be protected as well.
Also cclient connection checking (opal_btl_usnic_connectivity_ping) has to be
protected. This happens only in the beginning but cclient communicate with cagent
through opal_fd_read/write() and if two or more clients do opal_fd_write() at the
same time, the data might be corrupt.
With this concept, many functions in btl/usnic that make calls to the
listed functions are protected by OPAL_THREAD_LOCK macro which will only
be active if the user specify MPI_Init_thread() with MPI_THREAD_MULTIPLE
support.

Просмотреть файл

@ -1,50 +0,0 @@
# Copyright (c) 2013 Mellanox Technologies, Inc.
# All rights reserved
# $COPYRIGHT$
MEMHEAP Infrustructure documentation
------------------------------------
MEMHEAP Infrustructure is responsible for managing the symmetric heap.
The framework currently has following components: buddy and ptmalloc. buddy which uses a buddy allocator in order to manage the Memory allocations on the symmetric heap. Ptmalloc is an adaptation of ptmalloc3.
Additional components may be added easily to the framework by defining the component's and the module's base and extended structures, and their funtionalities.
The buddy allocator has the following data structures:
1. Base component - of type struct mca_memheap_base_component_2_0_0_t
2. Base module - of type struct mca_memheap_base_module_t
3. Buddy component - of type struct mca_memheap_base_component_2_0_0_t
4. Buddy module - of type struct mca_memheap_buddy_module_t extending the base module (struct mca_memheap_base_module_t)
Each data structure includes the following fields:
1. Base component - memheap_version, memheap_data and memheap_init
2. Base module - Holds pointers to the base component and to the functions: alloc, free and finalize
3. Buddy component - is a base component.
4. Buddy module - Extends the base module and holds additional data on the components's priority, buddy allocator,
maximal order of the symmetric heap, symmetric heap, pointer to the symmetric heap and hashtable maintaining the size of each allocated address.
In the case that the user decides to implement additional components, the Memheap infrastructure chooses a component with the maximal priority.
Handling the component opening is done under the base directory, in three stages:
1. Open all available components. Implemented by memheap_base_open.c and called from shmem_init.
2. Select the maximal priority component. This procedure involves the initialization of all components and then their
finalization except to the chosen component. It is implemented by memheap_base_select.c and called from shmem_init.
3. Close the max priority active cmponent. Implemented by memheap_base_close.c and called from shmem finalize.
Buddy Component/Module
----------------------
Responsible for handling the entire activities of the symmetric heap.
The supported activities are:
- buddy_init (Initialization)
- buddy_alloc (Allocates a variable on the symmetric heap)
- buddy_free (frees a variable previously allocated on the symetric heap)
- buddy_finalize (Finalization).
Data members of buddy module: - priority. The module's priority.
- buddy allocator: bits, num_free, lock and the maximal order (log2 of the maximal size)
of a variable on the symmetric heap. Buddy Allocator gives the offset in the symmetric heap
where a variable should be allocated.
- symmetric_heap: a range of reserved addresses (equal in all executing PE's) dedicated to "shared memory" allocation.
- symmetric_heap_hashtable (holding the size of an allocated variable on the symmetric heap.
used to free an allocated variable on the symmetric heap)

71
oshmem/mca/memheap/README.md Обычный файл
Просмотреть файл

@ -0,0 +1,71 @@
# MEMHEAP infrastructure documentation
Copyright (c) 2013 Mellanox Technologies, Inc.
All rights reserved
MEMHEAP Infrustructure is responsible for managing the symmetric heap.
The framework currently has following components: buddy and
ptmalloc. buddy which uses a buddy allocator in order to manage the
Memory allocations on the symmetric heap. Ptmalloc is an adaptation of
ptmalloc3.
Additional components may be added easily to the framework by defining
the component's and the module's base and extended structures, and
their funtionalities.
The buddy allocator has the following data structures:
1. Base component - of type struct mca_memheap_base_component_2_0_0_t
2. Base module - of type struct mca_memheap_base_module_t
3. Buddy component - of type struct mca_memheap_base_component_2_0_0_t
4. Buddy module - of type struct mca_memheap_buddy_module_t extending
the base module (struct mca_memheap_base_module_t)
Each data structure includes the following fields:
1. Base component - memheap_version, memheap_data and memheap_init
2. Base module - Holds pointers to the base component and to the
functions: alloc, free and finalize
3. Buddy component - is a base component.
4. Buddy module - Extends the base module and holds additional data on
the components's priority, buddy allocator,
maximal order of the symmetric heap, symmetric heap, pointer to the
symmetric heap and hashtable maintaining the size of each allocated
address.
In the case that the user decides to implement additional components,
the Memheap infrastructure chooses a component with the maximal
priority. Handling the component opening is done under the base
directory, in three stages:
1. Open all available components. Implemented by memheap_base_open.c
and called from shmem_init.
2. Select the maximal priority component. This procedure involves the
initialization of all components and then their finalization except
to the chosen component. It is implemented by memheap_base_select.c
and called from shmem_init.
3. Close the max priority active cmponent. Implemented by
memheap_base_close.c and called from shmem finalize.
## Buddy Component/Module
Responsible for handling the entire activities of the symmetric heap.
The supported activities are:
1. buddy_init (Initialization)
1. buddy_alloc (Allocates a variable on the symmetric heap)
1. buddy_free (frees a variable previously allocated on the symetric heap)
1. buddy_finalize (Finalization).
Data members of buddy module:
1. priority. The module's priority.
1. buddy allocator: bits, num_free, lock and the maximal order (log2
of the maximal size) of a variable on the symmetric heap. Buddy
Allocator gives the offset in the symmetric heap where a variable
should be allocated.
1. symmetric_heap: a range of reserved addresses (equal in all
executing PE's) dedicated to "shared memory" allocation.
1. symmetric_heap_hashtable (holding the size of an allocated variable
on the symmetric heap. used to free an allocated variable on the
symmetric heap)

Просмотреть файл

@ -1,7 +0,0 @@
The functions in this directory are all intended to test registry operations against a persistent seed. Thus, they perform a system init/finalize. The functions in the directory above this one should be used to test basic registry operations within the replica - they will isolate the replica so as to avoid the communications issues and the init/finalize problems in other subsystems that may cause problems here.
To run these tests, you need to first start a persistent daemon. This can be done using the command:
orted --seed --scope public --persistent
The daemon will "daemonize" itself and establish the registry (as well as other central services) replica, and then return a system prompt. You can then run any of these functions. If desired, you can utilize gdb and/or debug options on the persistent orted to watch/debug replica operations as well.

20
test/runtime/README.md Обычный файл
Просмотреть файл

@ -0,0 +1,20 @@
The functions in this directory are all intended to test registry
operations against a persistent seed. Thus, they perform a system
init/finalize. The functions in the directory above this one should be
used to test basic registry operations within the replica - they will
isolate the replica so as to avoid the communications issues and the
init/finalize problems in other subsystems that may cause problems
here.
To run these tests, you need to first start a persistent daemon. This
can be done using the command:
```
orted --seed --scope public --persistent
```
The daemon will "daemonize" itself and establish the registry (as well
as other central services) replica, and then return a system
prompt. You can then run any of these functions. If desired, you can
utilize gdb and/or debug options on the persistent orted to
watch/debug replica operations as well.