Convert all README files to Markdown
A mindless task for a lazy weekend: convert all the README and README.txt files to Markdown. Paired with the slow conversion of all of our man pages to Markdown, this gives a uniform language to the Open MPI docs. This commit moved a bunch of copyright headers out of the top-level README.txt file, so I updated the relevant copyright header years in the top-level LICENSE file to match what was removed from README.txt. Additionally, this commit did (very) little to update the actual content of the README files. A very small number of updates were made for topics that I found blatently obvious while Markdown-izing the content, but in general, I did not update content during this commit. For example, there's still quite a bit of text about ORTE that was not meaningfully updated. Signed-off-by: Jeff Squyres <jsquyres@cisco.com> Co-authored-by: Josh Hursey <jhursey@us.ibm.com>
Этот коммит содержится в:
родитель
686c2142e2
Коммит
c960d292ec
272
HACKING
272
HACKING
@ -1,272 +0,0 @@
|
|||||||
Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
|
||||||
University Research and Technology
|
|
||||||
Corporation. All rights reserved.
|
|
||||||
Copyright (c) 2004-2005 The University of Tennessee and The University
|
|
||||||
of Tennessee Research Foundation. All rights
|
|
||||||
reserved.
|
|
||||||
Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
||||||
University of Stuttgart. All rights reserved.
|
|
||||||
Copyright (c) 2004-2005 The Regents of the University of California.
|
|
||||||
All rights reserved.
|
|
||||||
Copyright (c) 2008-2020 Cisco Systems, Inc. All rights reserved.
|
|
||||||
Copyright (c) 2013 Intel, Inc. All rights reserved.
|
|
||||||
$COPYRIGHT$
|
|
||||||
|
|
||||||
Additional copyrights may follow
|
|
||||||
|
|
||||||
$HEADER$
|
|
||||||
|
|
||||||
Overview
|
|
||||||
========
|
|
||||||
|
|
||||||
This file is here for those who are building/exploring OMPI in its
|
|
||||||
source code form, most likely through a developer's tree (i.e., a
|
|
||||||
Git clone).
|
|
||||||
|
|
||||||
|
|
||||||
Developer Builds: Compiler Pickyness by Default
|
|
||||||
===============================================
|
|
||||||
|
|
||||||
If you are building Open MPI from a Git clone (i.e., there is a ".git"
|
|
||||||
directory in your build tree), the default build includes extra
|
|
||||||
compiler pickyness, which will result in more compiler warnings than
|
|
||||||
in non-developer builds. Getting these extra compiler warnings is
|
|
||||||
helpful to Open MPI developers in making the code base as clean as
|
|
||||||
possible.
|
|
||||||
|
|
||||||
Developers can disable this picky-by-default behavior by using the
|
|
||||||
--disable-picky configure option. Also note that extra-picky compiles
|
|
||||||
do *not* happen automatically when you do a VPATH build (e.g., if
|
|
||||||
".git" is in your source tree, but not in your build tree).
|
|
||||||
|
|
||||||
Prior versions of Open MPI would automatically activate a lot of
|
|
||||||
(performance-reducing) debugging code by default if ".git" was found
|
|
||||||
in your build tree. This is no longer true. You can manually enable
|
|
||||||
these (performance-reducing) debugging features in the Open MPI code
|
|
||||||
base with these configure options:
|
|
||||||
|
|
||||||
--enable-debug
|
|
||||||
--enable-mem-debug
|
|
||||||
--enable-mem-profile
|
|
||||||
|
|
||||||
NOTE: These options are really only relevant to those who are
|
|
||||||
developing Open MPI itself. They are not generally helpful for
|
|
||||||
debugging general MPI applications.
|
|
||||||
|
|
||||||
|
|
||||||
Use of GNU Autoconf, Automake, and Libtool (and m4)
|
|
||||||
===================================================
|
|
||||||
|
|
||||||
You need to read/care about this section *ONLY* if you are building
|
|
||||||
from a developer's tree (i.e., a Git clone of the Open MPI source
|
|
||||||
tree). If you have an Open MPI distribution tarball, the contents of
|
|
||||||
this section are optional -- you can (and probably should) skip
|
|
||||||
reading this section.
|
|
||||||
|
|
||||||
If you are building Open MPI from a developer's tree, you must first
|
|
||||||
install fairly recent versions of the GNU tools Autoconf, Automake,
|
|
||||||
and Libtool (and possibly GNU m4, because recent versions of Autoconf
|
|
||||||
have specific GNU m4 version requirements). The specific versions
|
|
||||||
required depend on if you are using the Git master branch or a release
|
|
||||||
branch (and which release branch you are using). The specific
|
|
||||||
versions can be found here:
|
|
||||||
|
|
||||||
https://www.open-mpi.org/source/building.php
|
|
||||||
|
|
||||||
You can check what versions of the autotools you have installed with
|
|
||||||
the following:
|
|
||||||
|
|
||||||
shell$ m4 --version
|
|
||||||
shell$ autoconf --version
|
|
||||||
shell$ automake --version
|
|
||||||
shell$ libtoolize --version
|
|
||||||
|
|
||||||
Required version levels for all the OMPI releases can be found here:
|
|
||||||
|
|
||||||
https://www.open-mpi.org/source/building.php
|
|
||||||
|
|
||||||
To strengthen the above point: the core Open MPI developers typically
|
|
||||||
use very, very recent versions of the GNU tools. There are known bugs
|
|
||||||
in older versions of the GNU tools that Open MPI no longer compensates
|
|
||||||
for (it seemed senseless to indefinitely support patches for ancient
|
|
||||||
versions of Autoconf, for example). You *WILL* have problems if you
|
|
||||||
do not use recent versions of the GNU tools.
|
|
||||||
|
|
||||||
If you need newer versions, you are *strongly* encouraged to heed the
|
|
||||||
following advice:
|
|
||||||
|
|
||||||
NOTE: On MacOS/X, the default "libtool" program is different than the
|
|
||||||
GNU libtool. You must download and install the GNU version
|
|
||||||
(e.g., via MacPorts, Homebrew, or some other mechanism).
|
|
||||||
|
|
||||||
1. Unless your OS distribution has easy-to-use binary installations,
|
|
||||||
the sources can be can be downloaded from:
|
|
||||||
|
|
||||||
ftp://ftp.gnu.org/gnu/autoconf/
|
|
||||||
ftp://ftp.gnu.org/gnu/automake/
|
|
||||||
ftp://ftp.gnu.org/gnu/libtool/
|
|
||||||
and if you need it:
|
|
||||||
ftp://ftp.gnu.org/gnu/m4/
|
|
||||||
|
|
||||||
NOTE: It is certainly easiest to download/build/install all four of
|
|
||||||
these tools together. But note that Open MPI has no specific m4
|
|
||||||
requirements; it is only listed here because Autoconf requires
|
|
||||||
minimum versions of GNU m4. Hence, you may or may not *need* to
|
|
||||||
actually install a new version of GNU m4. That being said, if you
|
|
||||||
are confused or don't know, just install the latest GNU m4 with the
|
|
||||||
rest of the GNU Autotools and everything will work out fine.
|
|
||||||
|
|
||||||
2. Build and install the tools in the following order:
|
|
||||||
|
|
||||||
2a. m4
|
|
||||||
2b. Autoconf
|
|
||||||
2c. Automake
|
|
||||||
2d. Libtool
|
|
||||||
|
|
||||||
3. You MUST install the last three tools (Autoconf, Automake, Libtool)
|
|
||||||
into the same prefix directory. These three tools are somewhat
|
|
||||||
inter-related, and if they're going to be used together, they MUST
|
|
||||||
share a common installation prefix.
|
|
||||||
|
|
||||||
You can install m4 anywhere as long as it can be found in the path;
|
|
||||||
it may be convenient to install it in the same prefix as the other
|
|
||||||
three. Or you can use any recent-enough m4 that is in your path.
|
|
||||||
|
|
||||||
3a. It is *strongly* encouraged that you do not install your new
|
|
||||||
versions over the OS-installed versions. This could cause
|
|
||||||
other things on your system to break. Instead, install into
|
|
||||||
$HOME/local, or /usr/local, or wherever else you tend to
|
|
||||||
install "local" kinds of software.
|
|
||||||
3b. In doing so, be sure to prefix your $path with the directory
|
|
||||||
where they are installed. For example, if you install into
|
|
||||||
$HOME/local, you may want to edit your shell startup file
|
|
||||||
(.bashrc, .cshrc, .tcshrc, etc.) to have something like:
|
|
||||||
|
|
||||||
# For bash/sh:
|
|
||||||
export PATH=$HOME/local/bin:$PATH
|
|
||||||
# For csh/tcsh:
|
|
||||||
set path = ($HOME/local/bin $path)
|
|
||||||
|
|
||||||
3c. Ensure to set your $path *BEFORE* you configure/build/install
|
|
||||||
the four packages.
|
|
||||||
|
|
||||||
4. All four packages require two simple commands to build and
|
|
||||||
install (where PREFIX is the prefix discussed in 3, above).
|
|
||||||
|
|
||||||
shell$ cd <m4 directory>
|
|
||||||
shell$ ./configure --prefix=PREFIX
|
|
||||||
shell$ make; make install
|
|
||||||
|
|
||||||
--> If you are using the csh or tcsh shells, be sure to run the
|
|
||||||
"rehash" command after you install each package.
|
|
||||||
|
|
||||||
shell$ cd <autoconf directory>
|
|
||||||
shell$ ./configure --prefix=PREFIX
|
|
||||||
shell$ make; make install
|
|
||||||
|
|
||||||
--> If you are using the csh or tcsh shells, be sure to run the
|
|
||||||
"rehash" command after you install each package.
|
|
||||||
|
|
||||||
shell$ cd <automake directory>
|
|
||||||
shell$ ./configure --prefix=PREFIX
|
|
||||||
shell$ make; make install
|
|
||||||
|
|
||||||
--> If you are using the csh or tcsh shells, be sure to run the
|
|
||||||
"rehash" command after you install each package.
|
|
||||||
|
|
||||||
shell$ cd <libtool directory>
|
|
||||||
shell$ ./configure --prefix=PREFIX
|
|
||||||
shell$ make; make install
|
|
||||||
|
|
||||||
--> If you are using the csh or tcsh shells, be sure to run the
|
|
||||||
"rehash" command after you install each package.
|
|
||||||
|
|
||||||
m4, Autoconf and Automake build and install very quickly; Libtool will
|
|
||||||
take a minute or two.
|
|
||||||
|
|
||||||
5. You can now run OMPI's top-level "autogen.pl" script. This script
|
|
||||||
will invoke the GNU Autoconf, Automake, and Libtool commands in the
|
|
||||||
proper order and setup to run OMPI's top-level "configure" script.
|
|
||||||
|
|
||||||
Running autogen.pl may take a few minutes, depending on your
|
|
||||||
system. It's not very exciting to watch. :-)
|
|
||||||
|
|
||||||
If you have a multi-processor system, enabling the multi-threaded
|
|
||||||
behavior in Automake 1.11 (or newer) can result in autogen.pl
|
|
||||||
running faster. Do this by setting the AUTOMAKE_JOBS environment
|
|
||||||
variable to the number of processors (threads) that you want it to
|
|
||||||
use before invoking autogen.pl. For example (you can again put
|
|
||||||
this in your shell startup files):
|
|
||||||
|
|
||||||
# For bash/sh:
|
|
||||||
export AUTOMAKE_JOBS=4
|
|
||||||
# For csh/tcsh:
|
|
||||||
set AUTOMAKE_JOBS 4
|
|
||||||
|
|
||||||
5a. You generally need to run autogen.pl whenever the top-level
|
|
||||||
file "configure.ac" changes, or any files in the config/ or
|
|
||||||
<project>/config/ directories change (these directories are
|
|
||||||
where a lot of "include" files for OMPI's configure script
|
|
||||||
live).
|
|
||||||
|
|
||||||
5b. You do *NOT* need to re-run autogen.pl if you modify a
|
|
||||||
Makefile.am.
|
|
||||||
|
|
||||||
Use of Flex
|
|
||||||
===========
|
|
||||||
|
|
||||||
Flex is used during the compilation of a developer's checkout (it is
|
|
||||||
not used to build official distribution tarballs). Other flavors of
|
|
||||||
lex are *not* supported: given the choice of making parsing code
|
|
||||||
portable between all flavors of lex and doing more interesting work on
|
|
||||||
Open MPI, we greatly prefer the latter.
|
|
||||||
|
|
||||||
Note that no testing has been performed to see what the minimum
|
|
||||||
version of Flex is required by Open MPI. We suggest that you use
|
|
||||||
v2.5.35 at the earliest.
|
|
||||||
|
|
||||||
*** NOTE: Windows developer builds of Open MPI *require* Flex version
|
|
||||||
2.5.35. Specifically, we know that v2.5.35 works and 2.5.4a does not.
|
|
||||||
We have not tested to figure out exactly what the minimum required
|
|
||||||
flex version is on Windows; we suggest that you use 2.5.35 at the
|
|
||||||
earliest. It is for this reason that the
|
|
||||||
contrib/dist/make_dist_tarball script checks for a Windows-friendly
|
|
||||||
version of flex before continuing.
|
|
||||||
|
|
||||||
For now, Open MPI will allow developer builds with Flex 2.5.4. This
|
|
||||||
is primarily motivated by the fact that RedHat/Centos 5 ships with
|
|
||||||
Flex 2.5.4. It is likely that someday Open MPI developer builds will
|
|
||||||
require Flex version >=2.5.35.
|
|
||||||
|
|
||||||
Note that the flex-generated code generates some compiler warnings on
|
|
||||||
some platforms, but the warnings do not seem to be consistent or
|
|
||||||
uniform on all platforms, compilers, and flex versions. As such, we
|
|
||||||
have done little to try to remove those warnings.
|
|
||||||
|
|
||||||
If you do not have Flex installed, it can be downloaded from the
|
|
||||||
following URL:
|
|
||||||
|
|
||||||
https://github.com/westes/flex
|
|
||||||
|
|
||||||
Use of Pandoc
|
|
||||||
=============
|
|
||||||
|
|
||||||
Similar to prior sections, you need to read/care about this section
|
|
||||||
*ONLY* if you are building from a developer's tree (i.e., a Git clone
|
|
||||||
of the Open MPI source tree). If you have an Open MPI distribution
|
|
||||||
tarball, the contents of this section are optional -- you can (and
|
|
||||||
probably should) skip reading this section.
|
|
||||||
|
|
||||||
The Pandoc tool is used to generate Open MPI's man pages.
|
|
||||||
Specifically: Open MPI's man pages are written in Markdown; Pandoc is
|
|
||||||
the tool that converts that Markdown to nroff (i.e., the format of man
|
|
||||||
pages).
|
|
||||||
|
|
||||||
You must have Pandoc >=v1.12 when building Open MPI from a developer's
|
|
||||||
tree. If configure cannot find Pandoc >=v1.12, it will abort.
|
|
||||||
|
|
||||||
If you need to install Pandoc, check your operating system-provided
|
|
||||||
packages (to include MacOS Homebrew and MacPorts). The Pandoc project
|
|
||||||
itself also offers binaries for their releases:
|
|
||||||
|
|
||||||
https://pandoc.org/
|
|
258
HACKING.md
Обычный файл
258
HACKING.md
Обычный файл
@ -0,0 +1,258 @@
|
|||||||
|
# Open MPI Hacking / Developer's Guide
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This file is here for those who are building/exploring OMPI in its
|
||||||
|
source code form, most likely through a developer's tree (i.e., a
|
||||||
|
Git clone).
|
||||||
|
|
||||||
|
|
||||||
|
## Developer Builds: Compiler Pickyness by Default
|
||||||
|
|
||||||
|
If you are building Open MPI from a Git clone (i.e., there is a `.git`
|
||||||
|
directory in your build tree), the default build includes extra
|
||||||
|
compiler pickyness, which will result in more compiler warnings than
|
||||||
|
in non-developer builds. Getting these extra compiler warnings is
|
||||||
|
helpful to Open MPI developers in making the code base as clean as
|
||||||
|
possible.
|
||||||
|
|
||||||
|
Developers can disable this picky-by-default behavior by using the
|
||||||
|
`--disable-picky` configure option. Also note that extra-picky compiles
|
||||||
|
do *not* happen automatically when you do a VPATH build (e.g., if
|
||||||
|
`.git` is in your source tree, but not in your build tree).
|
||||||
|
|
||||||
|
Prior versions of Open MPI would automatically activate a lot of
|
||||||
|
(performance-reducing) debugging code by default if `.git` was found
|
||||||
|
in your build tree. This is no longer true. You can manually enable
|
||||||
|
these (performance-reducing) debugging features in the Open MPI code
|
||||||
|
base with these configure options:
|
||||||
|
|
||||||
|
* `--enable-debug`
|
||||||
|
* `--enable-mem-debug`
|
||||||
|
* `--enable-mem-profile`
|
||||||
|
|
||||||
|
***NOTE:*** These options are really only relevant to those who are
|
||||||
|
developing Open MPI itself. They are not generally helpful for
|
||||||
|
debugging general MPI applications.
|
||||||
|
|
||||||
|
|
||||||
|
## Use of GNU Autoconf, Automake, and Libtool (and m4)
|
||||||
|
|
||||||
|
You need to read/care about this section *ONLY* if you are building
|
||||||
|
from a developer's tree (i.e., a Git clone of the Open MPI source
|
||||||
|
tree). If you have an Open MPI distribution tarball, the contents of
|
||||||
|
this section are optional -- you can (and probably should) skip
|
||||||
|
reading this section.
|
||||||
|
|
||||||
|
If you are building Open MPI from a developer's tree, you must first
|
||||||
|
install fairly recent versions of the GNU tools Autoconf, Automake,
|
||||||
|
and Libtool (and possibly GNU m4, because recent versions of Autoconf
|
||||||
|
have specific GNU m4 version requirements). The specific versions
|
||||||
|
required depend on if you are using the Git master branch or a release
|
||||||
|
branch (and which release branch you are using). [The specific
|
||||||
|
versions can be found
|
||||||
|
here](https://www.open-mpi.org/source/building.php).
|
||||||
|
|
||||||
|
You can check what versions of the autotools you have installed with
|
||||||
|
the following:
|
||||||
|
|
||||||
|
```
|
||||||
|
shell$ m4 --version
|
||||||
|
shell$ autoconf --version
|
||||||
|
shell$ automake --version
|
||||||
|
shell$ libtoolize --version
|
||||||
|
```
|
||||||
|
|
||||||
|
[Required version levels for all the OMPI releases can be found
|
||||||
|
here](https://www.open-mpi.org/source/building.php).
|
||||||
|
|
||||||
|
To strengthen the above point: the core Open MPI developers typically
|
||||||
|
use very, very recent versions of the GNU tools. There are known bugs
|
||||||
|
in older versions of the GNU tools that Open MPI no longer compensates
|
||||||
|
for (it seemed senseless to indefinitely support patches for ancient
|
||||||
|
versions of Autoconf, for example). You *WILL* have problems if you
|
||||||
|
do not use recent versions of the GNU tools.
|
||||||
|
|
||||||
|
***NOTE:*** On MacOS/X, the default `libtool` program is different
|
||||||
|
than the GNU libtool. You must download and install the GNU version
|
||||||
|
(e.g., via MacPorts, Homebrew, or some other mechanism).
|
||||||
|
|
||||||
|
If you need newer versions, you are *strongly* encouraged to heed the
|
||||||
|
following advice:
|
||||||
|
|
||||||
|
1. Unless your OS distribution has easy-to-use binary installations,
|
||||||
|
the sources can be can be downloaded from:
|
||||||
|
* https://ftp.gnu.org/gnu/autoconf/
|
||||||
|
* https://ftp.gnu.org/gnu/automake/
|
||||||
|
* https://ftp.gnu.org/gnu/libtool/
|
||||||
|
* And if you need it: https://ftp.gnu.org/gnu/m4/
|
||||||
|
|
||||||
|
***NOTE:*** It is certainly easiest to download/build/install all
|
||||||
|
four of these tools together. But note that Open MPI has no
|
||||||
|
specific m4 requirements; it is only listed here because Autoconf
|
||||||
|
requires minimum versions of GNU m4. Hence, you may or may not
|
||||||
|
*need* to actually install a new version of GNU m4. That being
|
||||||
|
said, if you are confused or don't know, just install the latest
|
||||||
|
GNU m4 with the rest of the GNU Autotools and everything will work
|
||||||
|
out fine.
|
||||||
|
|
||||||
|
1. Build and install the tools in the following order:
|
||||||
|
1. m4
|
||||||
|
1. Autoconf
|
||||||
|
1. Automake
|
||||||
|
1. Libtool
|
||||||
|
|
||||||
|
1. You MUST install the last three tools (Autoconf, Automake, Libtool)
|
||||||
|
into the same prefix directory. These three tools are somewhat
|
||||||
|
inter-related, and if they're going to be used together, they MUST
|
||||||
|
share a common installation prefix.
|
||||||
|
|
||||||
|
You can install m4 anywhere as long as it can be found in the path;
|
||||||
|
it may be convenient to install it in the same prefix as the other
|
||||||
|
three. Or you can use any recent-enough m4 that is in your path.
|
||||||
|
|
||||||
|
1. It is *strongly* encouraged that you do not install your new
|
||||||
|
versions over the OS-installed versions. This could cause
|
||||||
|
other things on your system to break. Instead, install into
|
||||||
|
`$HOME/local`, or `/usr/local`, or wherever else you tend to
|
||||||
|
install "local" kinds of software.
|
||||||
|
1. In doing so, be sure to prefix your $path with the directory
|
||||||
|
where they are installed. For example, if you install into
|
||||||
|
`$HOME/local`, you may want to edit your shell startup file
|
||||||
|
(`.bashrc`, `.cshrc`, `.tcshrc`, etc.) to have something like:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# For bash/sh:
|
||||||
|
export PATH=$HOME/local/bin:$PATH
|
||||||
|
# For csh/tcsh:
|
||||||
|
set path = ($HOME/local/bin $path)
|
||||||
|
```
|
||||||
|
|
||||||
|
1. Ensure to set your `$PATH` *BEFORE* you configure/build/install
|
||||||
|
the four packages.
|
||||||
|
|
||||||
|
1. All four packages require two simple commands to build and
|
||||||
|
install (where PREFIX is the prefix discussed in 3, above).
|
||||||
|
|
||||||
|
```
|
||||||
|
shell$ cd <m4 directory>
|
||||||
|
shell$ ./configure --prefix=PREFIX
|
||||||
|
shell$ make; make install
|
||||||
|
```
|
||||||
|
|
||||||
|
***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
|
||||||
|
run the `rehash` command after you install each package.
|
||||||
|
|
||||||
|
```
|
||||||
|
shell$ cd <autoconf directory>
|
||||||
|
shell$ ./configure --prefix=PREFIX
|
||||||
|
shell$ make; make install
|
||||||
|
```
|
||||||
|
|
||||||
|
***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
|
||||||
|
run the `rehash` command after you install each package.
|
||||||
|
|
||||||
|
```
|
||||||
|
shell$ cd <automake directory>
|
||||||
|
shell$ ./configure --prefix=PREFIX
|
||||||
|
shell$ make; make install
|
||||||
|
```
|
||||||
|
|
||||||
|
***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
|
||||||
|
run the `rehash` command after you install each package.
|
||||||
|
|
||||||
|
```
|
||||||
|
shell$ cd <libtool directory>
|
||||||
|
shell$ ./configure --prefix=PREFIX
|
||||||
|
shell$ make; make install
|
||||||
|
```
|
||||||
|
|
||||||
|
***NOTE:*** If you are using the `csh` or `tcsh` shells, be sure to
|
||||||
|
run the `rehash` command after you install each package.
|
||||||
|
|
||||||
|
m4, Autoconf and Automake build and install very quickly; Libtool
|
||||||
|
will take a minute or two.
|
||||||
|
|
||||||
|
1. You can now run OMPI's top-level `autogen.pl` script. This script
|
||||||
|
will invoke the GNU Autoconf, Automake, and Libtool commands in the
|
||||||
|
proper order and setup to run OMPI's top-level `configure` script.
|
||||||
|
|
||||||
|
Running `autogen.pl` may take a few minutes, depending on your
|
||||||
|
system. It's not very exciting to watch. :smile:
|
||||||
|
|
||||||
|
If you have a multi-processor system, enabling the multi-threaded
|
||||||
|
behavior in Automake 1.11 (or newer) can result in `autogen.pl`
|
||||||
|
running faster. Do this by setting the `AUTOMAKE_JOBS` environment
|
||||||
|
variable to the number of processors (threads) that you want it to
|
||||||
|
use before invoking `autogen`.pl. For example (you can again put
|
||||||
|
this in your shell startup files):
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# For bash/sh:
|
||||||
|
export AUTOMAKE_JOBS=4
|
||||||
|
# For csh/tcsh:
|
||||||
|
set AUTOMAKE_JOBS 4
|
||||||
|
```
|
||||||
|
|
||||||
|
1. You generally need to run autogen.pl whenever the top-level file
|
||||||
|
`configure.ac` changes, or any files in the `config/` or
|
||||||
|
`<project>/config/` directories change (these directories are
|
||||||
|
where a lot of "include" files for Open MPI's `configure` script
|
||||||
|
live).
|
||||||
|
|
||||||
|
1. You do *NOT* need to re-run `autogen.pl` if you modify a
|
||||||
|
`Makefile.am`.
|
||||||
|
|
||||||
|
## Use of Flex
|
||||||
|
|
||||||
|
Flex is used during the compilation of a developer's checkout (it is
|
||||||
|
not used to build official distribution tarballs). Other flavors of
|
||||||
|
lex are *not* supported: given the choice of making parsing code
|
||||||
|
portable between all flavors of lex and doing more interesting work on
|
||||||
|
Open MPI, we greatly prefer the latter.
|
||||||
|
|
||||||
|
Note that no testing has been performed to see what the minimum
|
||||||
|
version of Flex is required by Open MPI. We suggest that you use
|
||||||
|
v2.5.35 at the earliest.
|
||||||
|
|
||||||
|
***NOTE:*** Windows developer builds of Open MPI *require* Flex version
|
||||||
|
2.5.35. Specifically, we know that v2.5.35 works and 2.5.4a does not.
|
||||||
|
We have not tested to figure out exactly what the minimum required
|
||||||
|
flex version is on Windows; we suggest that you use 2.5.35 at the
|
||||||
|
earliest. It is for this reason that the
|
||||||
|
`contrib/dist/make_dist_tarball` script checks for a Windows-friendly
|
||||||
|
version of Flex before continuing.
|
||||||
|
|
||||||
|
For now, Open MPI will allow developer builds with Flex 2.5.4. This
|
||||||
|
is primarily motivated by the fact that RedHat/Centos 5 ships with
|
||||||
|
Flex 2.5.4. It is likely that someday Open MPI developer builds will
|
||||||
|
require Flex version >=2.5.35.
|
||||||
|
|
||||||
|
Note that the `flex`-generated code generates some compiler warnings
|
||||||
|
on some platforms, but the warnings do not seem to be consistent or
|
||||||
|
uniform on all platforms, compilers, and flex versions. As such, we
|
||||||
|
have done little to try to remove those warnings.
|
||||||
|
|
||||||
|
If you do not have Flex installed, see [the Flex Github
|
||||||
|
repository](https://github.com/westes/flex).
|
||||||
|
|
||||||
|
## Use of Pandoc
|
||||||
|
|
||||||
|
Similar to prior sections, you need to read/care about this section
|
||||||
|
*ONLY* if you are building from a developer's tree (i.e., a Git clone
|
||||||
|
of the Open MPI source tree). If you have an Open MPI distribution
|
||||||
|
tarball, the contents of this section are optional -- you can (and
|
||||||
|
probably should) skip reading this section.
|
||||||
|
|
||||||
|
The Pandoc tool is used to generate Open MPI's man pages.
|
||||||
|
Specifically: Open MPI's man pages are written in Markdown; Pandoc is
|
||||||
|
the tool that converts that Markdown to nroff (i.e., the format of man
|
||||||
|
pages).
|
||||||
|
|
||||||
|
You must have Pandoc >=v1.12 when building Open MPI from a developer's
|
||||||
|
tree. If configure cannot find Pandoc >=v1.12, it will abort.
|
||||||
|
|
||||||
|
If you need to install Pandoc, check your operating system-provided
|
||||||
|
packages (to include MacOS Homebrew and MacPorts). [The Pandoc
|
||||||
|
project web site](https://pandoc.org/) itself also offers binaries for
|
||||||
|
their releases.
|
11
LICENSE
11
LICENSE
@ -15,9 +15,9 @@ Copyright (c) 2004-2010 High Performance Computing Center Stuttgart,
|
|||||||
University of Stuttgart. All rights reserved.
|
University of Stuttgart. All rights reserved.
|
||||||
Copyright (c) 2004-2008 The Regents of the University of California.
|
Copyright (c) 2004-2008 The Regents of the University of California.
|
||||||
All rights reserved.
|
All rights reserved.
|
||||||
Copyright (c) 2006-2017 Los Alamos National Security, LLC. All rights
|
Copyright (c) 2006-2018 Los Alamos National Security, LLC. All rights
|
||||||
reserved.
|
reserved.
|
||||||
Copyright (c) 2006-2017 Cisco Systems, Inc. All rights reserved.
|
Copyright (c) 2006-2020 Cisco Systems, Inc. All rights reserved.
|
||||||
Copyright (c) 2006-2010 Voltaire, Inc. All rights reserved.
|
Copyright (c) 2006-2010 Voltaire, Inc. All rights reserved.
|
||||||
Copyright (c) 2006-2017 Sandia National Laboratories. All rights reserved.
|
Copyright (c) 2006-2017 Sandia National Laboratories. All rights reserved.
|
||||||
Copyright (c) 2006-2010 Sun Microsystems, Inc. All rights reserved.
|
Copyright (c) 2006-2010 Sun Microsystems, Inc. All rights reserved.
|
||||||
@ -25,7 +25,7 @@ Copyright (c) 2006-2010 Sun Microsystems, Inc. All rights reserved.
|
|||||||
Copyright (c) 2006-2017 The University of Houston. All rights reserved.
|
Copyright (c) 2006-2017 The University of Houston. All rights reserved.
|
||||||
Copyright (c) 2006-2009 Myricom, Inc. All rights reserved.
|
Copyright (c) 2006-2009 Myricom, Inc. All rights reserved.
|
||||||
Copyright (c) 2007-2017 UT-Battelle, LLC. All rights reserved.
|
Copyright (c) 2007-2017 UT-Battelle, LLC. All rights reserved.
|
||||||
Copyright (c) 2007-2017 IBM Corporation. All rights reserved.
|
Copyright (c) 2007-2020 IBM Corporation. All rights reserved.
|
||||||
Copyright (c) 1998-2005 Forschungszentrum Juelich, Juelich Supercomputing
|
Copyright (c) 1998-2005 Forschungszentrum Juelich, Juelich Supercomputing
|
||||||
Centre, Federal Republic of Germany
|
Centre, Federal Republic of Germany
|
||||||
Copyright (c) 2005-2008 ZIH, TU Dresden, Federal Republic of Germany
|
Copyright (c) 2005-2008 ZIH, TU Dresden, Federal Republic of Germany
|
||||||
@ -45,7 +45,7 @@ Copyright (c) 2016 ARM, Inc. All rights reserved.
|
|||||||
Copyright (c) 2010-2011 Alex Brick <bricka@ccs.neu.edu>. All rights reserved.
|
Copyright (c) 2010-2011 Alex Brick <bricka@ccs.neu.edu>. All rights reserved.
|
||||||
Copyright (c) 2012 The University of Wisconsin-La Crosse. All rights
|
Copyright (c) 2012 The University of Wisconsin-La Crosse. All rights
|
||||||
reserved.
|
reserved.
|
||||||
Copyright (c) 2013-2016 Intel, Inc. All rights reserved.
|
Copyright (c) 2013-2020 Intel, Inc. All rights reserved.
|
||||||
Copyright (c) 2011-2017 NVIDIA Corporation. All rights reserved.
|
Copyright (c) 2011-2017 NVIDIA Corporation. All rights reserved.
|
||||||
Copyright (c) 2016 Broadcom Limited. All rights reserved.
|
Copyright (c) 2016 Broadcom Limited. All rights reserved.
|
||||||
Copyright (c) 2011-2017 Fujitsu Limited. All rights reserved.
|
Copyright (c) 2011-2017 Fujitsu Limited. All rights reserved.
|
||||||
@ -56,7 +56,8 @@ Copyright (c) 2013-2017 Research Organization for Information Science (RIST).
|
|||||||
Copyright (c) 2017-2020 Amazon.com, Inc. or its affiliates. All Rights
|
Copyright (c) 2017-2020 Amazon.com, Inc. or its affiliates. All Rights
|
||||||
reserved.
|
reserved.
|
||||||
Copyright (c) 2018 DataDirect Networks. All rights reserved.
|
Copyright (c) 2018 DataDirect Networks. All rights reserved.
|
||||||
Copyright (c) 2018-2019 Triad National Security, LLC. All rights reserved.
|
Copyright (c) 2018-2020 Triad National Security, LLC. All rights reserved.
|
||||||
|
Copyright (c) 2020 Google, LLC. All rights reserved.
|
||||||
|
|
||||||
$COPYRIGHT$
|
$COPYRIGHT$
|
||||||
|
|
||||||
|
@ -24,7 +24,7 @@
|
|||||||
|
|
||||||
SUBDIRS = config contrib 3rd-party $(MCA_PROJECT_SUBDIRS) test
|
SUBDIRS = config contrib 3rd-party $(MCA_PROJECT_SUBDIRS) test
|
||||||
DIST_SUBDIRS = config contrib 3rd-party $(MCA_PROJECT_DIST_SUBDIRS) test
|
DIST_SUBDIRS = config contrib 3rd-party $(MCA_PROJECT_DIST_SUBDIRS) test
|
||||||
EXTRA_DIST = README INSTALL VERSION Doxyfile LICENSE autogen.pl README.JAVA.txt AUTHORS
|
EXTRA_DIST = README.md INSTALL VERSION Doxyfile LICENSE autogen.pl README.JAVA.md AUTHORS
|
||||||
|
|
||||||
include examples/Makefile.include
|
include examples/Makefile.include
|
||||||
|
|
||||||
|
2243
README
2243
README
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
281
README.JAVA.md
Обычный файл
281
README.JAVA.md
Обычный файл
@ -0,0 +1,281 @@
|
|||||||
|
# Open MPI Java Bindings
|
||||||
|
|
||||||
|
## Important node
|
||||||
|
|
||||||
|
JAVA BINDINGS ARE PROVIDED ON A "PROVISIONAL" BASIS - I.E., THEY ARE
|
||||||
|
NOT PART OF THE CURRENT OR PROPOSED MPI STANDARDS. THUS, INCLUSION OF
|
||||||
|
JAVA SUPPORT IS NOT REQUIRED BY THE STANDARD. CONTINUED INCLUSION OF
|
||||||
|
THE JAVA BINDINGS IS CONTINGENT UPON ACTIVE USER INTEREST AND
|
||||||
|
CONTINUED DEVELOPER SUPPORT.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This version of Open MPI provides support for Java-based
|
||||||
|
MPI applications.
|
||||||
|
|
||||||
|
The rest of this document provides step-by-step instructions on
|
||||||
|
building OMPI with Java bindings, and compiling and running Java-based
|
||||||
|
MPI applications. Also, part of the functionality is explained with
|
||||||
|
examples. Further details about the design, implementation and usage
|
||||||
|
of Java bindings in Open MPI can be found in [1]. The bindings follow
|
||||||
|
a JNI approach, that is, we do not provide a pure Java implementation
|
||||||
|
of MPI primitives, but a thin layer on top of the C
|
||||||
|
implementation. This is the same approach as in mpiJava [2]; in fact,
|
||||||
|
mpiJava was taken as a starting point for Open MPI Java bindings, but
|
||||||
|
they were later totally rewritten.
|
||||||
|
|
||||||
|
1. O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
|
||||||
|
implementation of Java bindings in Open MPI". Parallel Comput.
|
||||||
|
59: 1-20 (2016).
|
||||||
|
2. M. Baker et al. "mpiJava: An object-oriented Java interface to
|
||||||
|
MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
|
||||||
|
pp. 748-762, Springer (1999).
|
||||||
|
|
||||||
|
## Building Java Bindings
|
||||||
|
|
||||||
|
If this software was obtained as a developer-level checkout as opposed
|
||||||
|
to a tarball, you will need to start your build by running
|
||||||
|
`./autogen.pl`. This will also require that you have a fairly recent
|
||||||
|
version of GNU Autotools on your system - see the HACKING.md file for
|
||||||
|
details.
|
||||||
|
|
||||||
|
Java support requires that Open MPI be built at least with shared libraries
|
||||||
|
(i.e., `--enable-shared`) - any additional options are fine and will not
|
||||||
|
conflict. Note that this is the default for Open MPI, so you don't
|
||||||
|
have to explicitly add the option. The Java bindings will build only
|
||||||
|
if `--enable-mpi-java` is specified, and a JDK is found in a typical
|
||||||
|
system default location.
|
||||||
|
|
||||||
|
If the JDK is not in a place where we automatically find it, you can
|
||||||
|
specify the location. For example, this is required on the Mac
|
||||||
|
platform as the JDK headers are located in a non-typical location. Two
|
||||||
|
options are available for this purpose:
|
||||||
|
|
||||||
|
1. `--with-jdk-bindir=<foo>`: the location of `javac` and `javah`
|
||||||
|
1. `--with-jdk-headers=<bar>`: the directory containing `jni.h`
|
||||||
|
|
||||||
|
For simplicity, typical configurations are provided in platform files
|
||||||
|
under `contrib/platform/hadoop`. These will meet the needs of most
|
||||||
|
users, or at least provide a starting point for your own custom
|
||||||
|
configuration.
|
||||||
|
|
||||||
|
In summary, therefore, you can configure the system using the
|
||||||
|
following Java-related options:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ ./configure --with-platform=contrib/platform/hadoop/<your-platform> ...
|
||||||
|
|
||||||
|
````
|
||||||
|
|
||||||
|
or
|
||||||
|
|
||||||
|
```
|
||||||
|
$ ./configure --enable-mpi-java --with-jdk-bindir=<foo> --with-jdk-headers=<bar> ...
|
||||||
|
```
|
||||||
|
|
||||||
|
or simply
|
||||||
|
|
||||||
|
```
|
||||||
|
$ ./configure --enable-mpi-java ...
|
||||||
|
```
|
||||||
|
|
||||||
|
if JDK is in a "standard" place that we automatically find.
|
||||||
|
|
||||||
|
## Running Java Applications
|
||||||
|
|
||||||
|
For convenience, the `mpijavac` wrapper compiler has been provided for
|
||||||
|
compiling Java-based MPI applications. It ensures that all required MPI
|
||||||
|
libraries and class paths are defined. You can see the actual command
|
||||||
|
line using the `--showme` option, if you are interested.
|
||||||
|
|
||||||
|
Once your application has been compiled, you can run it with the
|
||||||
|
standard `mpirun` command line:
|
||||||
|
|
||||||
|
```
|
||||||
|
$ mpirun <options> java <your-java-options> <my-app>
|
||||||
|
```
|
||||||
|
|
||||||
|
For convenience, `mpirun` has been updated to detect the `java` command
|
||||||
|
and ensure that the required MPI libraries and class paths are defined
|
||||||
|
to support execution. You therefore do _NOT_ need to specify the Java
|
||||||
|
library path to the MPI installation, nor the MPI classpath. Any class
|
||||||
|
path definitions required for your application should be specified
|
||||||
|
either on the command line or via the `CLASSPATH` environment
|
||||||
|
variable. Note that the local directory will be added to the class
|
||||||
|
path if nothing is specified.
|
||||||
|
|
||||||
|
As always, the `java` executable, all required libraries, and your
|
||||||
|
application classes must be available on all nodes.
|
||||||
|
|
||||||
|
## Basic usage of Java bindings
|
||||||
|
|
||||||
|
There is an MPI package that contains all classes of the MPI Java
|
||||||
|
bindings: `Comm`, `Datatype`, `Request`, etc. These classes have a
|
||||||
|
direct correspondence with classes defined by the MPI standard. MPI
|
||||||
|
primitives are just methods included in these classes. The convention
|
||||||
|
used for naming Java methods and classes is the usual camel-case
|
||||||
|
convention, e.g., the equivalent of `MPI_File_set_info(fh,info)` is
|
||||||
|
`fh.setInfo(info)`, where `fh` is an object of the class `File`.
|
||||||
|
|
||||||
|
Apart from classes, the MPI package contains predefined public
|
||||||
|
attributes under a convenience class `MPI`. Examples are the
|
||||||
|
predefined communicator `MPI.COMM_WORLD` or predefined datatypes such
|
||||||
|
as `MPI.DOUBLE`. Also, MPI initialization and finalization are methods
|
||||||
|
of the `MPI` class and must be invoked by all MPI Java
|
||||||
|
applications. The following example illustrates these concepts:
|
||||||
|
|
||||||
|
```java
|
||||||
|
import mpi.*;
|
||||||
|
|
||||||
|
class ComputePi {
|
||||||
|
|
||||||
|
public static void main(String args[]) throws MPIException {
|
||||||
|
|
||||||
|
MPI.Init(args);
|
||||||
|
|
||||||
|
int rank = MPI.COMM_WORLD.getRank(),
|
||||||
|
size = MPI.COMM_WORLD.getSize(),
|
||||||
|
nint = 100; // Intervals.
|
||||||
|
double h = 1.0/(double)nint, sum = 0.0;
|
||||||
|
|
||||||
|
for(int i=rank+1; i<=nint; i+=size) {
|
||||||
|
double x = h * ((double)i - 0.5);
|
||||||
|
sum += (4.0 / (1.0 + x * x));
|
||||||
|
}
|
||||||
|
|
||||||
|
double sBuf[] = { h * sum },
|
||||||
|
rBuf[] = new double[1];
|
||||||
|
|
||||||
|
MPI.COMM_WORLD.reduce(sBuf, rBuf, 1, MPI.DOUBLE, MPI.SUM, 0);
|
||||||
|
|
||||||
|
if(rank == 0) System.out.println("PI: " + rBuf[0]);
|
||||||
|
MPI.Finalize();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Exception handling
|
||||||
|
|
||||||
|
Java bindings in Open MPI support exception handling. By default, errors
|
||||||
|
are fatal, but this behavior can be changed. The Java API will throw
|
||||||
|
exceptions if the MPI.ERRORS_RETURN error handler is set:
|
||||||
|
|
||||||
|
```java
|
||||||
|
MPI.COMM_WORLD.setErrhandler(MPI.ERRORS_RETURN);
|
||||||
|
```
|
||||||
|
|
||||||
|
If you add this statement to your program, it will show the line
|
||||||
|
where it breaks, instead of just crashing in case of an error.
|
||||||
|
Error-handling code can be separated from main application code by
|
||||||
|
means of try-catch blocks, for instance:
|
||||||
|
|
||||||
|
```java
|
||||||
|
try
|
||||||
|
{
|
||||||
|
File file = new File(MPI.COMM_SELF, "filename", MPI.MODE_RDONLY);
|
||||||
|
}
|
||||||
|
catch(MPIException ex)
|
||||||
|
{
|
||||||
|
System.err.println("Error Message: "+ ex.getMessage());
|
||||||
|
System.err.println(" Error Class: "+ ex.getErrorClass());
|
||||||
|
ex.printStackTrace();
|
||||||
|
System.exit(-1);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## How to specify buffers
|
||||||
|
|
||||||
|
In MPI primitives that require a buffer (either send or receive) the
|
||||||
|
Java API admits a Java array. Since Java arrays can be relocated by
|
||||||
|
the Java runtime environment, the MPI Java bindings need to make a
|
||||||
|
copy of the contents of the array to a temporary buffer, then pass the
|
||||||
|
pointer to this buffer to the underlying C implementation. From the
|
||||||
|
practical point of view, this implies an overhead associated to all
|
||||||
|
buffers that are represented by Java arrays. The overhead is small
|
||||||
|
for small buffers but increases for large arrays.
|
||||||
|
|
||||||
|
There is a pool of temporary buffers with a default capacity of 64K.
|
||||||
|
If a temporary buffer of 64K or less is needed, then the buffer will
|
||||||
|
be obtained from the pool. But if the buffer is larger, then it will
|
||||||
|
be necessary to allocate the buffer and free it later.
|
||||||
|
|
||||||
|
The default capacity of pool buffers can be modified with an Open MPI
|
||||||
|
MCA parameter:
|
||||||
|
|
||||||
|
```
|
||||||
|
shell$ mpirun --mca mpi_java_eager size ...
|
||||||
|
```
|
||||||
|
|
||||||
|
Where `size` is the number of bytes, or kilobytes if it ends with 'k',
|
||||||
|
or megabytes if it ends with 'm'.
|
||||||
|
|
||||||
|
An alternative is to use "direct buffers" provided by standard classes
|
||||||
|
available in the Java SDK such as `ByteBuffer`. For convenience we
|
||||||
|
provide a few static methods `new[Type]Buffer` in the `MPI` class to
|
||||||
|
create direct buffers for a number of basic datatypes. Elements of the
|
||||||
|
direct buffer can be accessed with methods `put()` and `get()`, and
|
||||||
|
the number of elements in the buffer can be obtained with the method
|
||||||
|
`capacity()`. This example illustrates its use:
|
||||||
|
|
||||||
|
```java
|
||||||
|
int myself = MPI.COMM_WORLD.getRank();
|
||||||
|
int tasks = MPI.COMM_WORLD.getSize();
|
||||||
|
|
||||||
|
IntBuffer in = MPI.newIntBuffer(MAXLEN * tasks),
|
||||||
|
out = MPI.newIntBuffer(MAXLEN);
|
||||||
|
|
||||||
|
for(int i = 0; i < MAXLEN; i++)
|
||||||
|
out.put(i, myself); // fill the buffer with the rank
|
||||||
|
|
||||||
|
Request request = MPI.COMM_WORLD.iAllGather(
|
||||||
|
out, MAXLEN, MPI.INT, in, MAXLEN, MPI.INT);
|
||||||
|
request.waitFor();
|
||||||
|
request.free();
|
||||||
|
|
||||||
|
for(int i = 0; i < tasks; i++)
|
||||||
|
{
|
||||||
|
for(int k = 0; k < MAXLEN; k++)
|
||||||
|
{
|
||||||
|
if(in.get(k + i * MAXLEN) != i)
|
||||||
|
throw new AssertionError("Unexpected value");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Direct buffers are available for: `BYTE`, `CHAR`, `SHORT`, `INT`,
|
||||||
|
`LONG`, `FLOAT`, and `DOUBLE`. There is no direct buffer for booleans.
|
||||||
|
|
||||||
|
Direct buffers are not a replacement for arrays, because they have
|
||||||
|
higher allocation and deallocation costs than arrays. In some
|
||||||
|
cases arrays will be a better choice. You can easily convert a
|
||||||
|
buffer into an array and vice versa.
|
||||||
|
|
||||||
|
All non-blocking methods must use direct buffers and only
|
||||||
|
blocking methods can choose between arrays and direct buffers.
|
||||||
|
|
||||||
|
The above example also illustrates that it is necessary to call
|
||||||
|
the `free()` method on objects whose class implements the `Freeable`
|
||||||
|
interface. Otherwise a memory leak is produced.
|
||||||
|
|
||||||
|
## Specifying offsets in buffers
|
||||||
|
|
||||||
|
In a C program, it is common to specify an offset in a array with
|
||||||
|
`&array[i]` or `array+i`, for instance to send data starting from
|
||||||
|
a given position in the array. The equivalent form in the Java bindings
|
||||||
|
is to `slice()` the buffer to start at an offset. Making a `slice()`
|
||||||
|
on a buffer is only necessary, when the offset is not zero. Slices
|
||||||
|
work for both arrays and direct buffers.
|
||||||
|
|
||||||
|
```java
|
||||||
|
import static mpi.MPI.slice;
|
||||||
|
// ...
|
||||||
|
int numbers[] = new int[SIZE];
|
||||||
|
// ...
|
||||||
|
MPI.COMM_WORLD.send(slice(numbers, offset), count, MPI.INT, 1, 0);
|
||||||
|
```
|
||||||
|
|
||||||
|
## Questions? Problems?
|
||||||
|
|
||||||
|
If you have any problems, or find any bugs, please feel free to report
|
||||||
|
them to [Open MPI user's mailing
|
||||||
|
list](https://www.open-mpi.org/community/lists/ompi.php).
|
275
README.JAVA.txt
275
README.JAVA.txt
@ -1,275 +0,0 @@
|
|||||||
***************************************************************************
|
|
||||||
IMPORTANT NOTE
|
|
||||||
|
|
||||||
JAVA BINDINGS ARE PROVIDED ON A "PROVISIONAL" BASIS - I.E., THEY ARE
|
|
||||||
NOT PART OF THE CURRENT OR PROPOSED MPI STANDARDS. THUS, INCLUSION OF
|
|
||||||
JAVA SUPPORT IS NOT REQUIRED BY THE STANDARD. CONTINUED INCLUSION OF
|
|
||||||
THE JAVA BINDINGS IS CONTINGENT UPON ACTIVE USER INTEREST AND
|
|
||||||
CONTINUED DEVELOPER SUPPORT.
|
|
||||||
|
|
||||||
***************************************************************************
|
|
||||||
|
|
||||||
This version of Open MPI provides support for Java-based
|
|
||||||
MPI applications.
|
|
||||||
|
|
||||||
The rest of this document provides step-by-step instructions on
|
|
||||||
building OMPI with Java bindings, and compiling and running
|
|
||||||
Java-based MPI applications. Also, part of the functionality is
|
|
||||||
explained with examples. Further details about the design,
|
|
||||||
implementation and usage of Java bindings in Open MPI can be found
|
|
||||||
in [1]. The bindings follow a JNI approach, that is, we do not
|
|
||||||
provide a pure Java implementation of MPI primitives, but a thin
|
|
||||||
layer on top of the C implementation. This is the same approach
|
|
||||||
as in mpiJava [2]; in fact, mpiJava was taken as a starting point
|
|
||||||
for Open MPI Java bindings, but they were later totally rewritten.
|
|
||||||
|
|
||||||
[1] O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
|
|
||||||
implementation of Java bindings in Open MPI". Parallel Comput.
|
|
||||||
59: 1-20 (2016).
|
|
||||||
|
|
||||||
[2] M. Baker et al. "mpiJava: An object-oriented Java interface to
|
|
||||||
MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
|
|
||||||
pp. 748-762, Springer (1999).
|
|
||||||
|
|
||||||
============================================================================
|
|
||||||
|
|
||||||
Building Java Bindings
|
|
||||||
|
|
||||||
If this software was obtained as a developer-level
|
|
||||||
checkout as opposed to a tarball, you will need to start your build by
|
|
||||||
running ./autogen.pl. This will also require that you have a fairly
|
|
||||||
recent version of autotools on your system - see the HACKING file for
|
|
||||||
details.
|
|
||||||
|
|
||||||
Java support requires that Open MPI be built at least with shared libraries
|
|
||||||
(i.e., --enable-shared) - any additional options are fine and will not
|
|
||||||
conflict. Note that this is the default for Open MPI, so you don't
|
|
||||||
have to explicitly add the option. The Java bindings will build only
|
|
||||||
if --enable-mpi-java is specified, and a JDK is found in a typical
|
|
||||||
system default location.
|
|
||||||
|
|
||||||
If the JDK is not in a place where we automatically find it, you can
|
|
||||||
specify the location. For example, this is required on the Mac
|
|
||||||
platform as the JDK headers are located in a non-typical location. Two
|
|
||||||
options are available for this purpose:
|
|
||||||
|
|
||||||
--with-jdk-bindir=<foo> - the location of javac and javah
|
|
||||||
--with-jdk-headers=<bar> - the directory containing jni.h
|
|
||||||
|
|
||||||
For simplicity, typical configurations are provided in platform files
|
|
||||||
under contrib/platform/hadoop. These will meet the needs of most
|
|
||||||
users, or at least provide a starting point for your own custom
|
|
||||||
configuration.
|
|
||||||
|
|
||||||
In summary, therefore, you can configure the system using the
|
|
||||||
following Java-related options:
|
|
||||||
|
|
||||||
$ ./configure --with-platform=contrib/platform/hadoop/<your-platform>
|
|
||||||
...
|
|
||||||
|
|
||||||
or
|
|
||||||
|
|
||||||
$ ./configure --enable-mpi-java --with-jdk-bindir=<foo>
|
|
||||||
--with-jdk-headers=<bar> ...
|
|
||||||
|
|
||||||
or simply
|
|
||||||
|
|
||||||
$ ./configure --enable-mpi-java ...
|
|
||||||
|
|
||||||
if JDK is in a "standard" place that we automatically find.
|
|
||||||
|
|
||||||
----------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Running Java Applications
|
|
||||||
|
|
||||||
For convenience, the "mpijavac" wrapper compiler has been provided for
|
|
||||||
compiling Java-based MPI applications. It ensures that all required MPI
|
|
||||||
libraries and class paths are defined. You can see the actual command
|
|
||||||
line using the --showme option, if you are interested.
|
|
||||||
|
|
||||||
Once your application has been compiled, you can run it with the
|
|
||||||
standard "mpirun" command line:
|
|
||||||
|
|
||||||
$ mpirun <options> java <your-java-options> <my-app>
|
|
||||||
|
|
||||||
For convenience, mpirun has been updated to detect the "java" command
|
|
||||||
and ensure that the required MPI libraries and class paths are defined
|
|
||||||
to support execution. You therefore do NOT need to specify the Java
|
|
||||||
library path to the MPI installation, nor the MPI classpath. Any class
|
|
||||||
path definitions required for your application should be specified
|
|
||||||
either on the command line or via the CLASSPATH environmental
|
|
||||||
variable. Note that the local directory will be added to the class
|
|
||||||
path if nothing is specified.
|
|
||||||
|
|
||||||
As always, the "java" executable, all required libraries, and your
|
|
||||||
application classes must be available on all nodes.
|
|
||||||
|
|
||||||
----------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Basic usage of Java bindings
|
|
||||||
|
|
||||||
There is an MPI package that contains all classes of the MPI Java
|
|
||||||
bindings: Comm, Datatype, Request, etc. These classes have a direct
|
|
||||||
correspondence with classes defined by the MPI standard. MPI primitives
|
|
||||||
are just methods included in these classes. The convention used for
|
|
||||||
naming Java methods and classes is the usual camel-case convention,
|
|
||||||
e.g., the equivalent of MPI_File_set_info(fh,info) is fh.setInfo(info),
|
|
||||||
where fh is an object of the class File.
|
|
||||||
|
|
||||||
Apart from classes, the MPI package contains predefined public attributes
|
|
||||||
under a convenience class MPI. Examples are the predefined communicator
|
|
||||||
MPI.COMM_WORLD or predefined datatypes such as MPI.DOUBLE. Also, MPI
|
|
||||||
initialization and finalization are methods of the MPI class and must
|
|
||||||
be invoked by all MPI Java applications. The following example illustrates
|
|
||||||
these concepts:
|
|
||||||
|
|
||||||
import mpi.*;
|
|
||||||
|
|
||||||
class ComputePi {
|
|
||||||
|
|
||||||
public static void main(String args[]) throws MPIException {
|
|
||||||
|
|
||||||
MPI.Init(args);
|
|
||||||
|
|
||||||
int rank = MPI.COMM_WORLD.getRank(),
|
|
||||||
size = MPI.COMM_WORLD.getSize(),
|
|
||||||
nint = 100; // Intervals.
|
|
||||||
double h = 1.0/(double)nint, sum = 0.0;
|
|
||||||
|
|
||||||
for(int i=rank+1; i<=nint; i+=size) {
|
|
||||||
double x = h * ((double)i - 0.5);
|
|
||||||
sum += (4.0 / (1.0 + x * x));
|
|
||||||
}
|
|
||||||
|
|
||||||
double sBuf[] = { h * sum },
|
|
||||||
rBuf[] = new double[1];
|
|
||||||
|
|
||||||
MPI.COMM_WORLD.reduce(sBuf, rBuf, 1, MPI.DOUBLE, MPI.SUM, 0);
|
|
||||||
|
|
||||||
if(rank == 0) System.out.println("PI: " + rBuf[0]);
|
|
||||||
MPI.Finalize();
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
----------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Exception handling
|
|
||||||
|
|
||||||
Java bindings in Open MPI support exception handling. By default, errors
|
|
||||||
are fatal, but this behavior can be changed. The Java API will throw
|
|
||||||
exceptions if the MPI.ERRORS_RETURN error handler is set:
|
|
||||||
|
|
||||||
MPI.COMM_WORLD.setErrhandler(MPI.ERRORS_RETURN);
|
|
||||||
|
|
||||||
If you add this statement to your program, it will show the line
|
|
||||||
where it breaks, instead of just crashing in case of an error.
|
|
||||||
Error-handling code can be separated from main application code by
|
|
||||||
means of try-catch blocks, for instance:
|
|
||||||
|
|
||||||
try
|
|
||||||
{
|
|
||||||
File file = new File(MPI.COMM_SELF, "filename", MPI.MODE_RDONLY);
|
|
||||||
}
|
|
||||||
catch(MPIException ex)
|
|
||||||
{
|
|
||||||
System.err.println("Error Message: "+ ex.getMessage());
|
|
||||||
System.err.println(" Error Class: "+ ex.getErrorClass());
|
|
||||||
ex.printStackTrace();
|
|
||||||
System.exit(-1);
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
----------------------------------------------------------------------------
|
|
||||||
|
|
||||||
How to specify buffers
|
|
||||||
|
|
||||||
In MPI primitives that require a buffer (either send or receive) the
|
|
||||||
Java API admits a Java array. Since Java arrays can be relocated by
|
|
||||||
the Java runtime environment, the MPI Java bindings need to make a
|
|
||||||
copy of the contents of the array to a temporary buffer, then pass the
|
|
||||||
pointer to this buffer to the underlying C implementation. From the
|
|
||||||
practical point of view, this implies an overhead associated to all
|
|
||||||
buffers that are represented by Java arrays. The overhead is small
|
|
||||||
for small buffers but increases for large arrays.
|
|
||||||
|
|
||||||
There is a pool of temporary buffers with a default capacity of 64K.
|
|
||||||
If a temporary buffer of 64K or less is needed, then the buffer will
|
|
||||||
be obtained from the pool. But if the buffer is larger, then it will
|
|
||||||
be necessary to allocate the buffer and free it later.
|
|
||||||
|
|
||||||
The default capacity of pool buffers can be modified with an 'mca'
|
|
||||||
parameter:
|
|
||||||
|
|
||||||
mpirun --mca mpi_java_eager size ...
|
|
||||||
|
|
||||||
Where 'size' is the number of bytes, or kilobytes if it ends with 'k',
|
|
||||||
or megabytes if it ends with 'm'.
|
|
||||||
|
|
||||||
An alternative is to use "direct buffers" provided by standard
|
|
||||||
classes available in the Java SDK such as ByteBuffer. For convenience
|
|
||||||
we provide a few static methods "new[Type]Buffer" in the MPI class
|
|
||||||
to create direct buffers for a number of basic datatypes. Elements
|
|
||||||
of the direct buffer can be accessed with methods put() and get(),
|
|
||||||
and the number of elements in the buffer can be obtained with the
|
|
||||||
method capacity(). This example illustrates its use:
|
|
||||||
|
|
||||||
int myself = MPI.COMM_WORLD.getRank();
|
|
||||||
int tasks = MPI.COMM_WORLD.getSize();
|
|
||||||
|
|
||||||
IntBuffer in = MPI.newIntBuffer(MAXLEN * tasks),
|
|
||||||
out = MPI.newIntBuffer(MAXLEN);
|
|
||||||
|
|
||||||
for(int i = 0; i < MAXLEN; i++)
|
|
||||||
out.put(i, myself); // fill the buffer with the rank
|
|
||||||
|
|
||||||
Request request = MPI.COMM_WORLD.iAllGather(
|
|
||||||
out, MAXLEN, MPI.INT, in, MAXLEN, MPI.INT);
|
|
||||||
request.waitFor();
|
|
||||||
request.free();
|
|
||||||
|
|
||||||
for(int i = 0; i < tasks; i++)
|
|
||||||
{
|
|
||||||
for(int k = 0; k < MAXLEN; k++)
|
|
||||||
{
|
|
||||||
if(in.get(k + i * MAXLEN) != i)
|
|
||||||
throw new AssertionError("Unexpected value");
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
Direct buffers are available for: BYTE, CHAR, SHORT, INT, LONG,
|
|
||||||
FLOAT, and DOUBLE. There is no direct buffer for booleans.
|
|
||||||
|
|
||||||
Direct buffers are not a replacement for arrays, because they have
|
|
||||||
higher allocation and deallocation costs than arrays. In some
|
|
||||||
cases arrays will be a better choice. You can easily convert a
|
|
||||||
buffer into an array and vice versa.
|
|
||||||
|
|
||||||
All non-blocking methods must use direct buffers and only
|
|
||||||
blocking methods can choose between arrays and direct buffers.
|
|
||||||
|
|
||||||
The above example also illustrates that it is necessary to call
|
|
||||||
the free() method on objects whose class implements the Freeable
|
|
||||||
interface. Otherwise a memory leak is produced.
|
|
||||||
|
|
||||||
----------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Specifying offsets in buffers
|
|
||||||
|
|
||||||
In a C program, it is common to specify an offset in a array with
|
|
||||||
"&array[i]" or "array+i", for instance to send data starting from
|
|
||||||
a given position in the array. The equivalent form in the Java bindings
|
|
||||||
is to "slice()" the buffer to start at an offset. Making a "slice()"
|
|
||||||
on a buffer is only necessary, when the offset is not zero. Slices
|
|
||||||
work for both arrays and direct buffers.
|
|
||||||
|
|
||||||
import static mpi.MPI.slice;
|
|
||||||
...
|
|
||||||
int numbers[] = new int[SIZE];
|
|
||||||
...
|
|
||||||
MPI.COMM_WORLD.send(slice(numbers, offset), count, MPI.INT, 1, 0);
|
|
||||||
|
|
||||||
----------------------------------------------------------------------------
|
|
||||||
|
|
||||||
If you have any problems, or find any bugs, please feel free to report
|
|
||||||
them to Open MPI user's mailing list (see
|
|
||||||
https://www.open-mpi.org/community/lists/ompi.php).
|
|
2191
README.md
Обычный файл
2191
README.md
Обычный файл
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
@ -64,7 +64,7 @@ EXTRA_DIST = \
|
|||||||
platform/lanl/cray_xc_cle5.2/optimized-common \
|
platform/lanl/cray_xc_cle5.2/optimized-common \
|
||||||
platform/lanl/cray_xc_cle5.2/optimized-lustre \
|
platform/lanl/cray_xc_cle5.2/optimized-lustre \
|
||||||
platform/lanl/cray_xc_cle5.2/optimized-lustre.conf \
|
platform/lanl/cray_xc_cle5.2/optimized-lustre.conf \
|
||||||
platform/lanl/toss/README \
|
platform/lanl/toss/README.md \
|
||||||
platform/lanl/toss/common \
|
platform/lanl/toss/common \
|
||||||
platform/lanl/toss/common-optimized \
|
platform/lanl/toss/common-optimized \
|
||||||
platform/lanl/toss/cray-lustre-optimized \
|
platform/lanl/toss/cray-lustre-optimized \
|
||||||
|
@ -1,121 +1,108 @@
|
|||||||
|
# Description
|
||||||
|
|
||||||
2 Feb 2011
|
2 Feb 2011
|
||||||
|
|
||||||
Description
|
This sample `tcp2` BTL component is a simple example of how to build
|
||||||
===========
|
|
||||||
|
|
||||||
This sample "tcp2" BTL component is a simple example of how to build
|
|
||||||
an Open MPI MCA component from outside of the Open MPI source tree.
|
an Open MPI MCA component from outside of the Open MPI source tree.
|
||||||
This is a valuable technique for 3rd parties who want to provide their
|
This is a valuable technique for 3rd parties who want to provide their
|
||||||
own components for Open MPI, but do not want to be in the mainstream
|
own components for Open MPI, but do not want to be in the mainstream
|
||||||
distribution (i.e., their code is not part of the main Open MPI code
|
distribution (i.e., their code is not part of the main Open MPI code
|
||||||
base).
|
base).
|
||||||
|
|
||||||
NOTE: We do recommend that 3rd party developers investigate using a
|
|
||||||
DVCS such as Mercurial or Git to keep up with Open MPI
|
|
||||||
development. Using a DVCS allows you to host your component in
|
|
||||||
your own copy of the Open MPI source tree, and yet still keep up
|
|
||||||
with development changes, stable releases, etc.
|
|
||||||
|
|
||||||
Previous colloquial knowledge held that building a component from
|
Previous colloquial knowledge held that building a component from
|
||||||
outside of the Open MPI source tree required configuring Open MPI
|
outside of the Open MPI source tree required configuring Open MPI
|
||||||
--with-devel-headers, and then building and installing it. This
|
`--with-devel-headers`, and then building and installing it. This
|
||||||
configure switch installs all of OMPI's internal .h files under
|
configure switch installs all of OMPI's internal `.h` files under
|
||||||
$prefix/include/openmpi, and therefore allows 3rd party code to be
|
`$prefix/include/openmpi`, and therefore allows 3rd party code to be
|
||||||
compiled outside of the Open MPI tree.
|
compiled outside of the Open MPI tree.
|
||||||
|
|
||||||
This method definitely works, but is annoying:
|
This method definitely works, but is annoying:
|
||||||
|
|
||||||
* You have to ask users to use this special configure switch.
|
* You have to ask users to use this special configure switch.
|
||||||
* Not all users install from source; many get binary packages (e.g.,
|
* Not all users install from source; many get binary packages (e.g.,
|
||||||
RPMs).
|
RPMs).
|
||||||
|
|
||||||
This example package shows two ways to build an Open MPI MCA component
|
This example package shows two ways to build an Open MPI MCA component
|
||||||
from outside the Open MPI source tree:
|
from outside the Open MPI source tree:
|
||||||
|
|
||||||
1. Using the above --with-devel-headers technique
|
1. Using the above `--with-devel-headers` technique
|
||||||
2. Compiling against the Open MPI source tree itself (vs. the
|
2. Compiling against the Open MPI source tree itself (vs. the
|
||||||
installation tree)
|
installation tree)
|
||||||
|
|
||||||
The user still has to have a source tree, but at least they don't have
|
The user still has to have a source tree, but at least they don't have
|
||||||
to be required to use --with-devel-headers (which most users don't) --
|
to be required to use `--with-devel-headers` (which most users don't) --
|
||||||
they can likely build off the source tree that they already used.
|
they can likely build off the source tree that they already used.
|
||||||
|
|
||||||
Example project contents
|
# Example project contents
|
||||||
========================
|
|
||||||
|
|
||||||
The "tcp2" component is a direct copy of the TCP BTL as of January
|
The `tcp2` component is a direct copy of the TCP BTL as of January
|
||||||
2011 -- it has just been renamed so that it can be built separately
|
2011 -- it has just been renamed so that it can be built separately
|
||||||
and installed alongside the real TCP BTL component.
|
and installed alongside the real TCP BTL component.
|
||||||
|
|
||||||
Most of the mojo for both methods is handled in the example
|
Most of the mojo for both methods is handled in the example
|
||||||
components' configure.ac, but the same techniques are applicable
|
components' `configure.ac`, but the same techniques are applicable
|
||||||
outside of the GNU Auto toolchain.
|
outside of the GNU Auto toolchain.
|
||||||
|
|
||||||
This sample "tcp2" component has an autogen.sh script that requires
|
This sample `tcp2` component has an `autogen.sh` script that requires
|
||||||
the normal Autoconf, Automake, and Libtool. It also adds the
|
the normal Autoconf, Automake, and Libtool. It also adds the
|
||||||
following two configure switches:
|
following two configure switches:
|
||||||
|
|
||||||
--with-openmpi-install=DIR
|
1. `--with-openmpi-install=DIR`:
|
||||||
|
If provided, `DIR` is an Open MPI installation tree that was
|
||||||
|
installed `--with-devel-headers`.
|
||||||
|
|
||||||
If provided, DIR is an Open MPI installation tree that was
|
This switch uses the installed `mpicc --showme:<foo>` functionality
|
||||||
installed --with-devel-headers.
|
to extract the relevant `CPPFLAGS`, `LDFLAGS`, and `LIBS`.
|
||||||
|
1. `--with-openmpi-source=DIR`:
|
||||||
This switch uses the installed mpicc --showme:<foo> functionality
|
If provided, `DIR` is the source of a configured and built Open MPI
|
||||||
to extract the relevant CPPFLAGS, LDFLAGS, and LIBS.
|
|
||||||
|
|
||||||
--with-openmpi-source=DIR
|
|
||||||
|
|
||||||
If provided, DIR is the source of a configured and built Open MPI
|
|
||||||
source tree (corresponding to the version expected by the example
|
source tree (corresponding to the version expected by the example
|
||||||
component). The source tree is not required to have been
|
component). The source tree is not required to have been
|
||||||
configured --with-devel-headers.
|
configured `--with-devel-headers`.
|
||||||
|
|
||||||
This switch uses the source tree's config.status script to extract
|
This switch uses the source tree's `config.status` script to
|
||||||
the relevant CPPFLAGS and CFLAGS.
|
extract the relevant `CPPFLAGS` and `CFLAGS`.
|
||||||
|
|
||||||
Either one of these two switches must be provided, or appropriate
|
Either one of these two switches must be provided, or appropriate
|
||||||
CPPFLAGS, CFLAGS, LDFLAGS, and/or LIBS must be provided such that
|
`CPPFLAGS`, `CFLAGS`, `LDFLAGS`, and/or `LIBS` must be provided such
|
||||||
valid Open MPI header and library files can be found and compiled /
|
that valid Open MPI header and library files can be found and compiled
|
||||||
linked against, respectively.
|
/ linked against, respectively.
|
||||||
|
|
||||||
Example use
|
# Example use
|
||||||
===========
|
|
||||||
|
|
||||||
First, download, build, and install Open MPI:
|
First, download, build, and install Open MPI:
|
||||||
|
|
||||||
-----
|
```
|
||||||
$ cd $HOME
|
$ cd $HOME
|
||||||
$ wget \
|
$ wget https://www.open-mpi.org/software/ompi/vX.Y/downloads/openmpi-X.Y.Z.tar.bz2
|
||||||
https://www.open-mpi.org/software/ompi/vX.Y/downloads/openmpi-X.Y.Z.tar.bz2
|
[...lots of output...]
|
||||||
[lots of output]
|
|
||||||
$ tar jxf openmpi-X.Y.Z.tar.bz2
|
$ tar jxf openmpi-X.Y.Z.tar.bz2
|
||||||
$ cd openmpi-X.Y.Z
|
$ cd openmpi-X.Y.Z
|
||||||
$ ./configure --prefix=/opt/openmpi ...
|
$ ./configure --prefix=/opt/openmpi ...
|
||||||
[lots of output]
|
[...lots of output...]
|
||||||
$ make -j 4 install
|
$ make -j 4 install
|
||||||
[lots of output]
|
[...lots of output...]
|
||||||
$ /opt/openmpi/bin/ompi_info | grep btl
|
$ /opt/openmpi/bin/ompi_info | grep btl
|
||||||
MCA btl: self (MCA vA.B, API vM.N, Component vX.Y.Z)
|
MCA btl: self (MCA vA.B, API vM.N, Component vX.Y.Z)
|
||||||
MCA btl: sm (MCA vA.B, API vM.N, Component vX.Y.Z)
|
MCA btl: sm (MCA vA.B, API vM.N, Component vX.Y.Z)
|
||||||
MCA btl: tcp (MCA vA.B, API vM.N, Component vX.Y.Z)
|
MCA btl: tcp (MCA vA.B, API vM.N, Component vX.Y.Z)
|
||||||
[where X.Y.Z, A.B, and M.N are appropriate for your version of Open MPI]
|
[where X.Y.Z, A.B, and M.N are appropriate for your version of Open MPI]
|
||||||
$
|
$
|
||||||
-----
|
```
|
||||||
|
|
||||||
Notice the installed BTLs from ompi_info.
|
Notice the installed BTLs from `ompi_info`.
|
||||||
|
|
||||||
Now cd into this example project and build it, pointing it to the
|
Now `cd` into this example project and build it, pointing it to the
|
||||||
source directory of the Open MPI that you just built. Note that we
|
source directory of the Open MPI that you just built. Note that we
|
||||||
use the same --prefix as when installing Open MPI (so that the built
|
use the same `--prefix` as when installing Open MPI (so that the built
|
||||||
component will be installed into the Right place):
|
component will be installed into the Right place):
|
||||||
|
|
||||||
-----
|
```
|
||||||
$ cd /path/to/this/sample
|
$ cd /path/to/this/sample
|
||||||
$ ./autogen.sh
|
$ ./autogen.sh
|
||||||
$ ./configure --prefix=/opt/openmpi --with-openmpi-source=$HOME/openmpi-X.Y.Z
|
$ ./configure --prefix=/opt/openmpi --with-openmpi-source=$HOME/openmpi-X.Y.Z
|
||||||
[lots of output]
|
[...lots of output...]
|
||||||
$ make -j 4 install
|
$ make -j 4 install
|
||||||
[lots of output]
|
[...lots of output...]
|
||||||
$ /opt/openmpi/bin/ompi_info | grep btl
|
$ /opt/openmpi/bin/ompi_info | grep btl
|
||||||
MCA btl: self (MCA vA.B, API vM.N, Component vX.Y.Z)
|
MCA btl: self (MCA vA.B, API vM.N, Component vX.Y.Z)
|
||||||
MCA btl: sm (MCA vA.B, API vM.N, Component vX.Y.Z)
|
MCA btl: sm (MCA vA.B, API vM.N, Component vX.Y.Z)
|
||||||
@ -123,12 +110,11 @@ $ /opt/openmpi/bin/ompi_info | grep btl
|
|||||||
MCA btl: tcp2 (MCA vA.B, API vM.N, Component vX.Y.Z)
|
MCA btl: tcp2 (MCA vA.B, API vM.N, Component vX.Y.Z)
|
||||||
[where X.Y.Z, A.B, and M.N are appropriate for your version of Open MPI]
|
[where X.Y.Z, A.B, and M.N are appropriate for your version of Open MPI]
|
||||||
$
|
$
|
||||||
-----
|
```
|
||||||
|
|
||||||
Notice that the "tcp2" BTL is now installed.
|
Notice that the `tcp2` BTL is now installed.
|
||||||
|
|
||||||
Random notes
|
# Random notes
|
||||||
============
|
|
||||||
|
|
||||||
The component in this project is just an example; I whipped it up in
|
The component in this project is just an example; I whipped it up in
|
||||||
the span of several hours. Your component may be a bit more complex
|
the span of several hours. Your component may be a bit more complex
|
||||||
@ -139,17 +125,15 @@ what you need.
|
|||||||
Changes required to the component to make it build in a standalone
|
Changes required to the component to make it build in a standalone
|
||||||
mode:
|
mode:
|
||||||
|
|
||||||
1. Write your own configure script. This component is just a sample.
|
1. Write your own `configure` script. This component is just a
|
||||||
You basically need to build against an OMPI install that was
|
sample. You basically need to build against an OMPI install that
|
||||||
installed --with-devel-headers or a built OMPI source tree. See
|
was installed `--with-devel-headers` or a built OMPI source tree.
|
||||||
./configure --help for details.
|
See `./configure --help` for details.
|
||||||
|
1. I also provided a bogus `btl_tcp2_config.h` (generated by
|
||||||
2. I also provided a bogus btl_tcp2_config.h (generated by configure).
|
`configure`). This file is not included anywhere, but it does
|
||||||
This file is not included anywhere, but it does provide protection
|
provide protection against re-defined `PACKAGE_*` macros when
|
||||||
against re-defined PACKAGE_* macros when running configure, which
|
running `configure`, which is quite annoying.
|
||||||
is quite annoying.
|
1. Modify `Makefile.am` to only build DSOs. I.e., you can optionally
|
||||||
|
|
||||||
3. Modify Makefile.am to only build DSOs. I.e., you can optionally
|
|
||||||
take the static option out since the component can *only* build in
|
take the static option out since the component can *only* build in
|
||||||
DSO mode when building standalone. That being said, it doesn't
|
DSO mode when building standalone. That being said, it doesn't
|
||||||
hurt to leave the static builds in -- this would (hypothetically)
|
hurt to leave the static builds in -- this would (hypothetically)
|
105
contrib/dist/linux/README
поставляемый
105
contrib/dist/linux/README
поставляемый
@ -1,105 +0,0 @@
|
|||||||
Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
|
|
||||||
University Research and Technology
|
|
||||||
Corporation. All rights reserved.
|
|
||||||
Copyright (c) 2004-2006 The University of Tennessee and The University
|
|
||||||
of Tennessee Research Foundation. All rights
|
|
||||||
reserved.
|
|
||||||
Copyright (c) 2004-2006 High Performance Computing Center Stuttgart,
|
|
||||||
University of Stuttgart. All rights reserved.
|
|
||||||
Copyright (c) 2004-2006 The Regents of the University of California.
|
|
||||||
All rights reserved.
|
|
||||||
Copyright (c) 2006-2016 Cisco Systems, Inc. All rights reserved.
|
|
||||||
$COPYRIGHT$
|
|
||||||
|
|
||||||
Additional copyrights may follow
|
|
||||||
|
|
||||||
$HEADER$
|
|
||||||
|
|
||||||
===========================================================================
|
|
||||||
|
|
||||||
Note that you probably want to download the latest release of the SRPM
|
|
||||||
for any given Open MPI version. The SRPM release number is the
|
|
||||||
version after the dash in the SRPM filename. For example,
|
|
||||||
"openmpi-1.6.3-2.src.rpm" is the 2nd release of the SRPM for Open MPI
|
|
||||||
v1.6.3. Subsequent releases of SRPMs typically contain bug fixes for
|
|
||||||
the RPM packaging, but not Open MPI itself.
|
|
||||||
|
|
||||||
The buildrpm.sh script takes a single mandatory argument -- a filename
|
|
||||||
pointing to an Open MPI tarball (may be either .gz or .bz2). It will
|
|
||||||
create one or more RPMs from this tarball:
|
|
||||||
|
|
||||||
1. Source RPM
|
|
||||||
2. "All in one" RPM, where all of Open MPI is put into a single RPM.
|
|
||||||
3. "Multiple" RPM, where Open MPI is split into several sub-package
|
|
||||||
RPMs:
|
|
||||||
- openmpi-runtime
|
|
||||||
- openmpi-devel
|
|
||||||
- openmpi-docs
|
|
||||||
|
|
||||||
The folowing arguments could be used to affect script behaviour.
|
|
||||||
Please, do NOT set the same settings with parameters and config vars.
|
|
||||||
|
|
||||||
-b
|
|
||||||
If you specify this option, only the all-in-one binary RPM will
|
|
||||||
be built. By default, only the source RPM (SRPM) is built. Other
|
|
||||||
parameters that affect the all-in-one binary RPM will be ignored
|
|
||||||
unless this option is specified.
|
|
||||||
|
|
||||||
-n name
|
|
||||||
This option will change the name of the produced RPM to the "name".
|
|
||||||
It is useful to use with "-o" and "-m" options if you want to have
|
|
||||||
multiple Open MPI versions installed simultaneously in the same
|
|
||||||
enviroment. Requires use of option "-b".
|
|
||||||
|
|
||||||
-o
|
|
||||||
With this option the install path of the binary RPM will be changed
|
|
||||||
to /opt/_NAME_/_VERSION_. Requires use of option "-b".
|
|
||||||
|
|
||||||
-m
|
|
||||||
This option causes the RPM to also install modulefiles
|
|
||||||
to the location specified in the specfile. Requires use of option "-b".
|
|
||||||
|
|
||||||
-i
|
|
||||||
Also build a debuginfo RPM. By default, the debuginfo RPM is not built.
|
|
||||||
Requires use of option "-b".
|
|
||||||
|
|
||||||
-f lf_location
|
|
||||||
Include support for Libfabric. "lf_location" is Libfabric install
|
|
||||||
path. Requires use of option "-b".
|
|
||||||
|
|
||||||
-t tm_location
|
|
||||||
Include support for Torque/PBS Pro. "tm_location" is path of the
|
|
||||||
Torque/PBS Pro header files. Requires use of option "-b".
|
|
||||||
|
|
||||||
-d
|
|
||||||
Build with debugging support. By default,
|
|
||||||
the RPM is built without debugging support.
|
|
||||||
|
|
||||||
-c parameter
|
|
||||||
Add custom configure parameter.
|
|
||||||
|
|
||||||
-r parameter
|
|
||||||
Add custom RPM build parameter.
|
|
||||||
|
|
||||||
-s
|
|
||||||
If specified, the script will try to unpack the openmpi.spec
|
|
||||||
file from the tarball specified on the command line. By default,
|
|
||||||
the script will look for the specfile in the current directory.
|
|
||||||
|
|
||||||
-R directory
|
|
||||||
Specifies the top level RPM build direcotry.
|
|
||||||
|
|
||||||
-h
|
|
||||||
Prints script usage information.
|
|
||||||
|
|
||||||
|
|
||||||
Target architecture is currently hard-coded in the beginning
|
|
||||||
of the buildrpm.sh script.
|
|
||||||
|
|
||||||
Alternatively, you can build directly from the openmpi.spec spec file
|
|
||||||
or SRPM directly. Many options can be passed to the building process
|
|
||||||
via rpmbuild's --define option (there are older versions of rpmbuild
|
|
||||||
that do not seem to handle --define'd values properly in all cases,
|
|
||||||
but we generally don't care about those old versions of rpmbuild...).
|
|
||||||
The available options are described in the comments in the beginning
|
|
||||||
of the spec file in this directory.
|
|
88
contrib/dist/linux/README.md
поставляемый
Обычный файл
88
contrib/dist/linux/README.md
поставляемый
Обычный файл
@ -0,0 +1,88 @@
|
|||||||
|
# Open MPI Linux distribution helpers
|
||||||
|
|
||||||
|
Note that you probably want to download the latest release of the SRPM
|
||||||
|
for any given Open MPI version. The SRPM release number is the
|
||||||
|
version after the dash in the SRPM filename. For example,
|
||||||
|
`openmpi-1.6.3-2.src.rpm` is the 2nd release of the SRPM for Open MPI
|
||||||
|
v1.6.3. Subsequent releases of SRPMs typically contain bug fixes for
|
||||||
|
the RPM packaging, but not Open MPI itself.
|
||||||
|
|
||||||
|
The `buildrpm.sh` script takes a single mandatory argument -- a
|
||||||
|
filename pointing to an Open MPI tarball (may be either `.gz` or
|
||||||
|
`.bz2`). It will create one or more RPMs from this tarball:
|
||||||
|
|
||||||
|
1. Source RPM
|
||||||
|
1. "All in one" RPM, where all of Open MPI is put into a single RPM.
|
||||||
|
1. "Multiple" RPM, where Open MPI is split into several sub-package
|
||||||
|
RPMs:
|
||||||
|
* `openmpi-runtime`
|
||||||
|
* `openmpi-devel`
|
||||||
|
* `openmpi-docs`
|
||||||
|
|
||||||
|
The folowing arguments could be used to affect script behaviour.
|
||||||
|
Please, do NOT set the same settings with parameters and config vars.
|
||||||
|
|
||||||
|
* `-b`:
|
||||||
|
If you specify this option, only the all-in-one binary RPM will
|
||||||
|
be built. By default, only the source RPM (SRPM) is built. Other
|
||||||
|
parameters that affect the all-in-one binary RPM will be ignored
|
||||||
|
unless this option is specified.
|
||||||
|
|
||||||
|
* `-n name`:
|
||||||
|
This option will change the name of the produced RPM to the "name".
|
||||||
|
It is useful to use with "-o" and "-m" options if you want to have
|
||||||
|
multiple Open MPI versions installed simultaneously in the same
|
||||||
|
enviroment. Requires use of option `-b`.
|
||||||
|
|
||||||
|
* `-o`:
|
||||||
|
With this option the install path of the binary RPM will be changed
|
||||||
|
to `/opt/_NAME_/_VERSION_`. Requires use of option `-b`.
|
||||||
|
|
||||||
|
* `-m`:
|
||||||
|
This option causes the RPM to also install modulefiles
|
||||||
|
to the location specified in the specfile. Requires use of option `-b`.
|
||||||
|
|
||||||
|
* `-i`:
|
||||||
|
Also build a debuginfo RPM. By default, the debuginfo RPM is not built.
|
||||||
|
Requires use of option `-b`.
|
||||||
|
|
||||||
|
* `-f lf_location`:
|
||||||
|
Include support for Libfabric. "lf_location" is Libfabric install
|
||||||
|
path. Requires use of option `-b`.
|
||||||
|
|
||||||
|
* `-t tm_location`:
|
||||||
|
Include support for Torque/PBS Pro. "tm_location" is path of the
|
||||||
|
Torque/PBS Pro header files. Requires use of option `-b`.
|
||||||
|
|
||||||
|
* `-d`:
|
||||||
|
Build with debugging support. By default,
|
||||||
|
the RPM is built without debugging support.
|
||||||
|
|
||||||
|
* `-c parameter`:
|
||||||
|
Add custom configure parameter.
|
||||||
|
|
||||||
|
* `-r parameter`:
|
||||||
|
Add custom RPM build parameter.
|
||||||
|
|
||||||
|
* `-s`:
|
||||||
|
If specified, the script will try to unpack the openmpi.spec
|
||||||
|
file from the tarball specified on the command line. By default,
|
||||||
|
the script will look for the specfile in the current directory.
|
||||||
|
|
||||||
|
* `-R directory`:
|
||||||
|
Specifies the top level RPM build direcotry.
|
||||||
|
|
||||||
|
* `-h`:
|
||||||
|
Prints script usage information.
|
||||||
|
|
||||||
|
|
||||||
|
Target architecture is currently hard-coded in the beginning
|
||||||
|
of the `buildrpm.sh` script.
|
||||||
|
|
||||||
|
Alternatively, you can build directly from the `openmpi.spec` spec
|
||||||
|
file or SRPM directly. Many options can be passed to the building
|
||||||
|
process via `rpmbuild`'s `--define` option (there are older versions
|
||||||
|
of `rpmbuild` that do not seem to handle `--define`'d values properly
|
||||||
|
in all cases, but we generally don't care about those old versions of
|
||||||
|
`rpmbuild`...). The available options are described in the comments
|
||||||
|
in the beginning of the spec file in this directory.
|
@ -61,7 +61,7 @@ created.
|
|||||||
- copy of toss3-hfi-optimized.conf with the following changes:
|
- copy of toss3-hfi-optimized.conf with the following changes:
|
||||||
- change: comment "Add the interface for out-of-band communication and set
|
- change: comment "Add the interface for out-of-band communication and set
|
||||||
it up" to "Set up the interface for out-of-band communication"
|
it up" to "Set up the interface for out-of-band communication"
|
||||||
- remove: oob_tcp_if_exclude = ib0
|
- remove: oob_tcp_if_exclude = ib0
|
||||||
- remove: btl (let Open MPI figure out what best to use for ethernet-
|
- remove: btl (let Open MPI figure out what best to use for ethernet-
|
||||||
connected hardware)
|
connected hardware)
|
||||||
- remove: btl_openib_want_fork_support (no infiniband)
|
- remove: btl_openib_want_fork_support (no infiniband)
|
@ -33,7 +33,7 @@
|
|||||||
# Automake).
|
# Automake).
|
||||||
|
|
||||||
EXTRA_DIST += \
|
EXTRA_DIST += \
|
||||||
examples/README \
|
examples/README.md \
|
||||||
examples/Makefile \
|
examples/Makefile \
|
||||||
examples/hello_c.c \
|
examples/hello_c.c \
|
||||||
examples/hello_mpifh.f \
|
examples/hello_mpifh.f \
|
||||||
|
@ -1,67 +0,0 @@
|
|||||||
Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana
|
|
||||||
University Research and Technology
|
|
||||||
Corporation. All rights reserved.
|
|
||||||
Copyright (c) 2006-2012 Cisco Systems, Inc. All rights reserved.
|
|
||||||
Copyright (c) 2007-2009 Sun Microsystems, Inc. All rights reserved.
|
|
||||||
Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
|
|
||||||
Copyright (c) 2013 Mellanox Technologies, Inc. All rights reserved.
|
|
||||||
|
|
||||||
$COPYRIGHT$
|
|
||||||
|
|
||||||
The files in this directory are sample MPI applications provided both
|
|
||||||
as a trivial primer to MPI as well as simple tests to ensure that your
|
|
||||||
Open MPI installation is working properly.
|
|
||||||
|
|
||||||
If you are looking for a comprehensive MPI tutorial, these samples are
|
|
||||||
not enough. Excellent MPI tutorials are available here:
|
|
||||||
|
|
||||||
http://www.citutor.org/login.php
|
|
||||||
|
|
||||||
Get a free account and login; you can then browse to the list of
|
|
||||||
available courses. Look for the ones with "MPI" in the title.
|
|
||||||
|
|
||||||
There are two MPI examples in this directory, each using one of six
|
|
||||||
different MPI interfaces:
|
|
||||||
|
|
||||||
- Hello world
|
|
||||||
C: hello_c.c
|
|
||||||
C++: hello_cxx.cc
|
|
||||||
Fortran mpif.h: hello_mpifh.f
|
|
||||||
Fortran use mpi: hello_usempi.f90
|
|
||||||
Fortran use mpi_f08: hello_usempif08.f90
|
|
||||||
Java: Hello.java
|
|
||||||
C shmem.h: hello_oshmem_c.c
|
|
||||||
Fortran shmem.fh: hello_oshmemfh.f90
|
|
||||||
|
|
||||||
- Send a trivial message around in a ring
|
|
||||||
C: ring_c.c
|
|
||||||
C++: ring_cxx.cc
|
|
||||||
Fortran mpif.h: ring_mpifh.f
|
|
||||||
Fortran use mpi: ring_usempi.f90
|
|
||||||
Fortran use mpi_f08: ring_usempif08.f90
|
|
||||||
Java: Ring.java
|
|
||||||
C shmem.h: ring_oshmem_c.c
|
|
||||||
Fortran shmem.fh: ring_oshmemfh.f90
|
|
||||||
|
|
||||||
Additionally, there's one further example application, but this one
|
|
||||||
only uses the MPI C bindings:
|
|
||||||
|
|
||||||
- Test the connectivity between all processes
|
|
||||||
C: connectivity_c.c
|
|
||||||
|
|
||||||
The Makefile in this directory will build as many of the examples as
|
|
||||||
you have language support (e.g., if you do not have the Fortran "use
|
|
||||||
mpi" bindings compiled as part of Open MPI, the those examples will be
|
|
||||||
skipped).
|
|
||||||
|
|
||||||
The Makefile assumes that the wrapper compilers mpicc, mpic++, and
|
|
||||||
mpifort are in your path.
|
|
||||||
|
|
||||||
Although the Makefile is tailored for Open MPI (e.g., it checks the
|
|
||||||
"ompi_info" command to see if you have support for C++, mpif.h, use
|
|
||||||
mpi, and use mpi_f08 F90), all of the example programs are pure MPI,
|
|
||||||
and therefore not specific to Open MPI. Hence, you can use a
|
|
||||||
different MPI implementation to compile and run these programs if you
|
|
||||||
wish.
|
|
||||||
|
|
||||||
Make today an Open MPI day!
|
|
66
examples/README.md
Обычный файл
66
examples/README.md
Обычный файл
@ -0,0 +1,66 @@
|
|||||||
|
The files in this directory are sample MPI applications provided both
|
||||||
|
as a trivial primer to MPI as well as simple tests to ensure that your
|
||||||
|
Open MPI installation is working properly.
|
||||||
|
|
||||||
|
If you are looking for a comprehensive MPI tutorial, these samples are
|
||||||
|
not enough. [Excellent MPI tutorials are available
|
||||||
|
here](http://www.citutor.org/login.php).
|
||||||
|
|
||||||
|
Get a free account and login; you can then browse to the list of
|
||||||
|
available courses. Look for the ones with "MPI" in the title.
|
||||||
|
|
||||||
|
There are two MPI examples in this directory, each using one of six
|
||||||
|
different MPI interfaces:
|
||||||
|
|
||||||
|
## Hello world
|
||||||
|
|
||||||
|
The MPI version of the canonical "hello world" program:
|
||||||
|
|
||||||
|
* C: `hello_c.c`
|
||||||
|
* C++: `hello_cxx.cc`
|
||||||
|
* Fortran mpif.h: `hello_mpifh.f`
|
||||||
|
* Fortran use mpi: `hello_usempi.f90`
|
||||||
|
* Fortran use mpi_f08: `hello_usempif08.f90`
|
||||||
|
* Java: `Hello.java`
|
||||||
|
* C shmem.h: `hello_oshmem_c.c`
|
||||||
|
* Fortran shmem.fh: `hello_oshmemfh.f90`
|
||||||
|
|
||||||
|
## Ring
|
||||||
|
|
||||||
|
Send a trivial message around in a ring:
|
||||||
|
|
||||||
|
* C: `ring_c.c`
|
||||||
|
* C++: `ring_cxx.cc`
|
||||||
|
* Fortran mpif.h: `ring_mpifh.f`
|
||||||
|
* Fortran use mpi: `ring_usempi.f90`
|
||||||
|
* Fortran use mpi_f08: `ring_usempif08.f90`
|
||||||
|
* Java: `Ring.java`
|
||||||
|
* C shmem.h: `ring_oshmem_c.c`
|
||||||
|
* Fortran shmem.fh: `ring_oshmemfh.f90`
|
||||||
|
|
||||||
|
## Connectivity Test
|
||||||
|
|
||||||
|
Additionally, there's one further example application, but this one
|
||||||
|
only uses the MPI C bindings to test the connectivity between all
|
||||||
|
processes:
|
||||||
|
|
||||||
|
* C: `connectivity_c.c`
|
||||||
|
|
||||||
|
## Makefile
|
||||||
|
|
||||||
|
The `Makefile` in this directory will build as many of the examples as
|
||||||
|
you have language support (e.g., if you do not have the Fortran `use
|
||||||
|
mpi` bindings compiled as part of Open MPI, the those examples will be
|
||||||
|
skipped).
|
||||||
|
|
||||||
|
The `Makefile` assumes that the wrapper compilers `mpicc`, `mpic++`, and
|
||||||
|
`mpifort` are in your path.
|
||||||
|
|
||||||
|
Although the `Makefile` is tailored for Open MPI (e.g., it checks the
|
||||||
|
`ompi_info` command to see if you have support for `mpif.h`, the `mpi`
|
||||||
|
module, and the `use mpi_f08` module), all of the example programs are
|
||||||
|
pure MPI, and therefore not specific to Open MPI. Hence, you can use
|
||||||
|
a different MPI implementation to compile and run these programs if
|
||||||
|
you wish.
|
||||||
|
|
||||||
|
Make today an Open MPI day!
|
19
ompi/contrib/README.md
Обычный файл
19
ompi/contrib/README.md
Обычный файл
@ -0,0 +1,19 @@
|
|||||||
|
This is the OMPI contrib system. It is (far) less functional and
|
||||||
|
flexible than the OMPI MCA framework/component system.
|
||||||
|
|
||||||
|
Each contrib package must have a `configure.m4`. It may optionally also
|
||||||
|
have an `autogen.subdirs` file.
|
||||||
|
|
||||||
|
If it has a `configure.m4` file, it must specify its own relevant
|
||||||
|
files to `AC_CONFIG_FILES` to create during `AC_OUTPUT` -- just like
|
||||||
|
MCA components (at a minimum, usually its own `Makefile`). The
|
||||||
|
`configure.m4` file will be slurped up into the main `configure`
|
||||||
|
script, just like other MCA components. Note that there is currently
|
||||||
|
no "no configure" option for contrib packages -- you *must* have a
|
||||||
|
`configure.m4` (even if all it does it call `$1`). Feel free to fix
|
||||||
|
this situation if you want -- it probably won't not be too difficult
|
||||||
|
to extend `autogen.pl` to support this scenario, similar to how it is
|
||||||
|
done for MCA components. :smile:
|
||||||
|
|
||||||
|
If it has an `autogen.subdirs` file, then it needs to be a
|
||||||
|
subdirectory that is autogen-able.
|
@ -1,19 +0,0 @@
|
|||||||
This is the OMPI contrib system. It is (far) less functional and
|
|
||||||
flexible than the OMPI MCA framework/component system.
|
|
||||||
|
|
||||||
Each contrib package must have a configure.m4. It may optionally also
|
|
||||||
have an autogen.subdirs file.
|
|
||||||
|
|
||||||
If it has a configure.m4 file, it must specify its own relevant files
|
|
||||||
to AC_CONFIG_FILES to create during AC_OUTPUT -- just like MCA
|
|
||||||
components (at a minimum, usually its own Makefile). The configure.m4
|
|
||||||
file will be slurped up into the main configure script, just like
|
|
||||||
other MCA components. Note that there is currently no "no configure"
|
|
||||||
option for contrib packages -- you *must* have a configure.m4 (even if
|
|
||||||
all it does it call $1). Feel free to fix this situation if you want
|
|
||||||
-- it probably won't not be too difficult to extend autogen.pl to
|
|
||||||
support this scenario, similar to how it is done for MCA components.
|
|
||||||
:-)
|
|
||||||
|
|
||||||
If it has an autogen.subdirs file, then it needs to be a subdirectory
|
|
||||||
that is autogen-able.
|
|
@ -13,7 +13,7 @@
|
|||||||
# $HEADER$
|
# $HEADER$
|
||||||
#
|
#
|
||||||
|
|
||||||
EXTRA_DIST = profile2mat.pl aggregate_profile.pl
|
EXTRA_DIST = profile2mat.pl aggregate_profile.pl README.md
|
||||||
|
|
||||||
sources = common_monitoring.c common_monitoring_coll.c
|
sources = common_monitoring.c common_monitoring_coll.c
|
||||||
headers = common_monitoring.h common_monitoring_coll.h
|
headers = common_monitoring.h common_monitoring_coll.h
|
||||||
|
@ -1,181 +0,0 @@
|
|||||||
|
|
||||||
Copyright (c) 2013-2015 The University of Tennessee and The University
|
|
||||||
of Tennessee Research Foundation. All rights
|
|
||||||
reserved.
|
|
||||||
Copyright (c) 2013-2015 Inria. All rights reserved.
|
|
||||||
$COPYRIGHT$
|
|
||||||
|
|
||||||
Additional copyrights may follow
|
|
||||||
|
|
||||||
$HEADER$
|
|
||||||
|
|
||||||
===========================================================================
|
|
||||||
|
|
||||||
Low level communication monitoring interface in Open MPI
|
|
||||||
|
|
||||||
Introduction
|
|
||||||
------------
|
|
||||||
This interface traces and monitors all messages sent by MPI before they go to the
|
|
||||||
communication channels. At that levels all communication are point-to-point communications:
|
|
||||||
collectives are already decomposed in send and receive calls.
|
|
||||||
|
|
||||||
The monitoring is stored internally by each process and output on stderr at the end of the
|
|
||||||
application (during MPI_Finalize()).
|
|
||||||
|
|
||||||
|
|
||||||
Enabling the monitoring
|
|
||||||
-----------------------
|
|
||||||
To enable the monitoring add --mca pml_monitoring_enable x to the mpirun command line.
|
|
||||||
If x = 1 it monitors internal and external tags indifferently and aggregate everything.
|
|
||||||
If x = 2 it monitors internal tags and external tags separately.
|
|
||||||
If x = 0 the monitoring is disabled.
|
|
||||||
Other value of x are not supported.
|
|
||||||
|
|
||||||
Internal tags are tags < 0. They are used to tag send and receive coming from
|
|
||||||
collective operations or from protocol communications
|
|
||||||
|
|
||||||
External tags are tags >=0. They are used by the application in point-to-point communication.
|
|
||||||
|
|
||||||
Therefore, distinguishing external and internal tags help to distinguish between point-to-point
|
|
||||||
and other communication (mainly collectives).
|
|
||||||
|
|
||||||
Output format
|
|
||||||
-------------
|
|
||||||
The output of the monitoring looks like (with --mca pml_monitoring_enable 2):
|
|
||||||
I 0 1 108 bytes 27 msgs sent
|
|
||||||
E 0 1 1012 bytes 30 msgs sent
|
|
||||||
E 0 2 23052 bytes 61 msgs sent
|
|
||||||
I 1 2 104 bytes 26 msgs sent
|
|
||||||
I 1 3 208 bytes 52 msgs sent
|
|
||||||
E 1 0 860 bytes 24 msgs sent
|
|
||||||
E 1 3 2552 bytes 56 msgs sent
|
|
||||||
I 2 3 104 bytes 26 msgs sent
|
|
||||||
E 2 0 22804 bytes 49 msgs sent
|
|
||||||
E 2 3 860 bytes 24 msgs sent
|
|
||||||
I 3 0 104 bytes 26 msgs sent
|
|
||||||
I 3 1 204 bytes 51 msgs sent
|
|
||||||
E 3 1 2304 bytes 44 msgs sent
|
|
||||||
E 3 2 860 bytes 24 msgs sent
|
|
||||||
|
|
||||||
Where:
|
|
||||||
- the first column distinguishes internal (I) and external (E) tags.
|
|
||||||
- the second column is the sender rank
|
|
||||||
- the third column is the receiver rank
|
|
||||||
- the fourth column is the number of bytes sent
|
|
||||||
- the last column is the number of messages.
|
|
||||||
|
|
||||||
In this example process 0 as sent 27 messages to process 1 using point-to-point call
|
|
||||||
for 108 bytes and 30 messages with collectives and protocol related communication
|
|
||||||
for 1012 bytes to process 1.
|
|
||||||
|
|
||||||
If the monitoring was called with --mca pml_monitoring_enable 1 everything is aggregated
|
|
||||||
under the internal tags. With te above example, you have:
|
|
||||||
I 0 1 1120 bytes 57 msgs sent
|
|
||||||
I 0 2 23052 bytes 61 msgs sent
|
|
||||||
I 1 0 860 bytes 24 msgs sent
|
|
||||||
I 1 2 104 bytes 26 msgs sent
|
|
||||||
I 1 3 2760 bytes 108 msgs sent
|
|
||||||
I 2 0 22804 bytes 49 msgs sent
|
|
||||||
I 2 3 964 bytes 50 msgs sent
|
|
||||||
I 3 0 104 bytes 26 msgs sent
|
|
||||||
I 3 1 2508 bytes 95 msgs sent
|
|
||||||
I 3 2 860 bytes 24 msgs sent
|
|
||||||
|
|
||||||
Monitoring phases
|
|
||||||
-----------------
|
|
||||||
If one wants to monitor phases of the application, it is possible to flush the monitoring
|
|
||||||
at the application level. In this case all the monitoring since the last flush is stored
|
|
||||||
by every process in a file.
|
|
||||||
|
|
||||||
An example of how to flush such monitoring is given in test/monitoring/monitoring_test.c
|
|
||||||
|
|
||||||
Moreover, all the different flushed phased are aggregated at runtime and output at the end
|
|
||||||
of the application as described above.
|
|
||||||
|
|
||||||
Example
|
|
||||||
-------
|
|
||||||
A working example is given in test/monitoring/monitoring_test.c
|
|
||||||
It features, MPI_COMM_WORLD monitoring , sub-communicator monitoring, collective and
|
|
||||||
point-to-point communication monitoring and phases monitoring
|
|
||||||
|
|
||||||
To compile:
|
|
||||||
> make monitoring_test
|
|
||||||
|
|
||||||
Helper scripts
|
|
||||||
--------------
|
|
||||||
Two perl scripts are provided in test/monitoring
|
|
||||||
- aggregate_profile.pl is for aggregating monitoring phases of different processes
|
|
||||||
This script aggregates the profiles generated by the flush_monitoring function.
|
|
||||||
The files need to be in in given format: name_<phase_id>_<process_id>
|
|
||||||
They are then aggregated by phases.
|
|
||||||
If one needs the profile of all the phases he can concatenate the different files,
|
|
||||||
or use the output of the monitoring system done at MPI_Finalize
|
|
||||||
in the example it should be call as:
|
|
||||||
./aggregate_profile.pl prof/phase to generate
|
|
||||||
prof/phase_1.prof
|
|
||||||
prof/phase_2.prof
|
|
||||||
|
|
||||||
- profile2mat.pl is for transforming a the monitoring output into a communication matrix.
|
|
||||||
Take a profile file and aggregates all the recorded communicator into matrices.
|
|
||||||
It generated a matrices for the number of messages, (msg),
|
|
||||||
for the total bytes transmitted (size) and
|
|
||||||
the average number of bytes per messages (avg)
|
|
||||||
|
|
||||||
The output matrix is symmetric
|
|
||||||
|
|
||||||
Do not forget to enable the execution right to these scripts.
|
|
||||||
|
|
||||||
For instance, the provided examples store phases output in ./prof
|
|
||||||
|
|
||||||
If you type:
|
|
||||||
> mpirun -np 4 --mca pml_monitoring_enable 2 ./monitoring_test
|
|
||||||
you should have the following output
|
|
||||||
Proc 3 flushing monitoring to: ./prof/phase_1_3.prof
|
|
||||||
Proc 0 flushing monitoring to: ./prof/phase_1_0.prof
|
|
||||||
Proc 2 flushing monitoring to: ./prof/phase_1_2.prof
|
|
||||||
Proc 1 flushing monitoring to: ./prof/phase_1_1.prof
|
|
||||||
Proc 1 flushing monitoring to: ./prof/phase_2_1.prof
|
|
||||||
Proc 3 flushing monitoring to: ./prof/phase_2_3.prof
|
|
||||||
Proc 0 flushing monitoring to: ./prof/phase_2_0.prof
|
|
||||||
Proc 2 flushing monitoring to: ./prof/phase_2_2.prof
|
|
||||||
I 2 3 104 bytes 26 msgs sent
|
|
||||||
E 2 0 22804 bytes 49 msgs sent
|
|
||||||
E 2 3 860 bytes 24 msgs sent
|
|
||||||
I 3 0 104 bytes 26 msgs sent
|
|
||||||
I 3 1 204 bytes 51 msgs sent
|
|
||||||
E 3 1 2304 bytes 44 msgs sent
|
|
||||||
E 3 2 860 bytes 24 msgs sent
|
|
||||||
I 0 1 108 bytes 27 msgs sent
|
|
||||||
E 0 1 1012 bytes 30 msgs sent
|
|
||||||
E 0 2 23052 bytes 61 msgs sent
|
|
||||||
I 1 2 104 bytes 26 msgs sent
|
|
||||||
I 1 3 208 bytes 52 msgs sent
|
|
||||||
E 1 0 860 bytes 24 msgs sent
|
|
||||||
E 1 3 2552 bytes 56 msgs sent
|
|
||||||
|
|
||||||
you can parse the phases with:
|
|
||||||
> /aggregate_profile.pl prof/phase
|
|
||||||
Building prof/phase_1.prof
|
|
||||||
Building prof/phase_2.prof
|
|
||||||
|
|
||||||
And you can build the different communication matrices of phase 1 with:
|
|
||||||
> ./profile2mat.pl prof/phase_1.prof
|
|
||||||
prof/phase_1.prof -> all
|
|
||||||
prof/phase_1_size_all.mat
|
|
||||||
prof/phase_1_msg_all.mat
|
|
||||||
prof/phase_1_avg_all.mat
|
|
||||||
|
|
||||||
prof/phase_1.prof -> external
|
|
||||||
prof/phase_1_size_external.mat
|
|
||||||
prof/phase_1_msg_external.mat
|
|
||||||
prof/phase_1_avg_external.mat
|
|
||||||
|
|
||||||
prof/phase_1.prof -> internal
|
|
||||||
prof/phase_1_size_internal.mat
|
|
||||||
prof/phase_1_msg_internal.mat
|
|
||||||
prof/phase_1_avg_internal.mat
|
|
||||||
|
|
||||||
Credit
|
|
||||||
------
|
|
||||||
Designed by George Bosilca <bosilca@icl.utk.edu> and
|
|
||||||
Emmanuel Jeannot <emmanuel.jeannot@inria.fr>
|
|
209
ompi/mca/common/monitoring/README.md
Обычный файл
209
ompi/mca/common/monitoring/README.md
Обычный файл
@ -0,0 +1,209 @@
|
|||||||
|
# Open MPI common monitoring module
|
||||||
|
|
||||||
|
Copyright (c) 2013-2015 The University of Tennessee and The University
|
||||||
|
of Tennessee Research Foundation. All rights
|
||||||
|
reserved.
|
||||||
|
Copyright (c) 2013-2015 Inria. All rights reserved.
|
||||||
|
|
||||||
|
Low level communication monitoring interface in Open MPI
|
||||||
|
|
||||||
|
## Introduction
|
||||||
|
|
||||||
|
This interface traces and monitors all messages sent by MPI before
|
||||||
|
they go to the communication channels. At that levels all
|
||||||
|
communication are point-to-point communications: collectives are
|
||||||
|
already decomposed in send and receive calls.
|
||||||
|
|
||||||
|
The monitoring is stored internally by each process and output on
|
||||||
|
stderr at the end of the application (during `MPI_Finalize()`).
|
||||||
|
|
||||||
|
|
||||||
|
## Enabling the monitoring
|
||||||
|
|
||||||
|
To enable the monitoring add `--mca pml_monitoring_enable x` to the
|
||||||
|
`mpirun` command line:
|
||||||
|
|
||||||
|
* If x = 1 it monitors internal and external tags indifferently and aggregate everything.
|
||||||
|
* If x = 2 it monitors internal tags and external tags separately.
|
||||||
|
* If x = 0 the monitoring is disabled.
|
||||||
|
* Other value of x are not supported.
|
||||||
|
|
||||||
|
Internal tags are tags < 0. They are used to tag send and receive
|
||||||
|
coming from collective operations or from protocol communications
|
||||||
|
|
||||||
|
External tags are tags >=0. They are used by the application in
|
||||||
|
point-to-point communication.
|
||||||
|
|
||||||
|
Therefore, distinguishing external and internal tags help to
|
||||||
|
distinguish between point-to-point and other communication (mainly
|
||||||
|
collectives).
|
||||||
|
|
||||||
|
## Output format
|
||||||
|
|
||||||
|
The output of the monitoring looks like (with `--mca
|
||||||
|
pml_monitoring_enable 2`):
|
||||||
|
|
||||||
|
```
|
||||||
|
I 0 1 108 bytes 27 msgs sent
|
||||||
|
E 0 1 1012 bytes 30 msgs sent
|
||||||
|
E 0 2 23052 bytes 61 msgs sent
|
||||||
|
I 1 2 104 bytes 26 msgs sent
|
||||||
|
I 1 3 208 bytes 52 msgs sent
|
||||||
|
E 1 0 860 bytes 24 msgs sent
|
||||||
|
E 1 3 2552 bytes 56 msgs sent
|
||||||
|
I 2 3 104 bytes 26 msgs sent
|
||||||
|
E 2 0 22804 bytes 49 msgs sent
|
||||||
|
E 2 3 860 bytes 24 msgs sent
|
||||||
|
I 3 0 104 bytes 26 msgs sent
|
||||||
|
I 3 1 204 bytes 51 msgs sent
|
||||||
|
E 3 1 2304 bytes 44 msgs sent
|
||||||
|
E 3 2 860 bytes 24 msgs sent
|
||||||
|
```
|
||||||
|
|
||||||
|
Where:
|
||||||
|
|
||||||
|
1. the first column distinguishes internal (I) and external (E) tags.
|
||||||
|
1. the second column is the sender rank
|
||||||
|
1. the third column is the receiver rank
|
||||||
|
1. the fourth column is the number of bytes sent
|
||||||
|
1. the last column is the number of messages.
|
||||||
|
|
||||||
|
In this example process 0 as sent 27 messages to process 1 using
|
||||||
|
point-to-point call for 108 bytes and 30 messages with collectives and
|
||||||
|
protocol related communication for 1012 bytes to process 1.
|
||||||
|
|
||||||
|
If the monitoring was called with `--mca pml_monitoring_enable 1`,
|
||||||
|
everything is aggregated under the internal tags. With the e above
|
||||||
|
example, you have:
|
||||||
|
|
||||||
|
```
|
||||||
|
I 0 1 1120 bytes 57 msgs sent
|
||||||
|
I 0 2 23052 bytes 61 msgs sent
|
||||||
|
I 1 0 860 bytes 24 msgs sent
|
||||||
|
I 1 2 104 bytes 26 msgs sent
|
||||||
|
I 1 3 2760 bytes 108 msgs sent
|
||||||
|
I 2 0 22804 bytes 49 msgs sent
|
||||||
|
I 2 3 964 bytes 50 msgs sent
|
||||||
|
I 3 0 104 bytes 26 msgs sent
|
||||||
|
I 3 1 2508 bytes 95 msgs sent
|
||||||
|
I 3 2 860 bytes 24 msgs sent
|
||||||
|
```
|
||||||
|
|
||||||
|
## Monitoring phases
|
||||||
|
|
||||||
|
If one wants to monitor phases of the application, it is possible to
|
||||||
|
flush the monitoring at the application level. In this case all the
|
||||||
|
monitoring since the last flush is stored by every process in a file.
|
||||||
|
|
||||||
|
An example of how to flush such monitoring is given in
|
||||||
|
`test/monitoring/monitoring_test.c`.
|
||||||
|
|
||||||
|
Moreover, all the different flushed phased are aggregated at runtime
|
||||||
|
and output at the end of the application as described above.
|
||||||
|
|
||||||
|
## Example
|
||||||
|
|
||||||
|
A working example is given in `test/monitoring/monitoring_test.c` It
|
||||||
|
features, `MPI_COMM_WORLD` monitoring , sub-communicator monitoring,
|
||||||
|
collective and point-to-point communication monitoring and phases
|
||||||
|
monitoring
|
||||||
|
|
||||||
|
To compile:
|
||||||
|
|
||||||
|
```
|
||||||
|
shell$ make monitoring_test
|
||||||
|
```
|
||||||
|
|
||||||
|
## Helper scripts
|
||||||
|
|
||||||
|
Two perl scripts are provided in test/monitoring:
|
||||||
|
|
||||||
|
1. `aggregate_profile.pl` is for aggregating monitoring phases of
|
||||||
|
different processes This script aggregates the profiles generated by
|
||||||
|
the `flush_monitoring` function.
|
||||||
|
|
||||||
|
The files need to be in in given format: `name_<phase_id>_<process_id>`
|
||||||
|
They are then aggregated by phases.
|
||||||
|
If one needs the profile of all the phases he can concatenate the different files,
|
||||||
|
or use the output of the monitoring system done at `MPI_Finalize`
|
||||||
|
in the example it should be call as:
|
||||||
|
```
|
||||||
|
./aggregate_profile.pl prof/phase to generate
|
||||||
|
prof/phase_1.prof
|
||||||
|
prof/phase_2.prof
|
||||||
|
```
|
||||||
|
|
||||||
|
1. `profile2mat.pl` is for transforming a the monitoring output into a
|
||||||
|
communication matrix. Take a profile file and aggregates all the
|
||||||
|
recorded communicator into matrices. It generated a matrices for
|
||||||
|
the number of messages, (msg), for the total bytes transmitted
|
||||||
|
(size) and the average number of bytes per messages (avg)
|
||||||
|
|
||||||
|
The output matrix is symmetric.
|
||||||
|
|
||||||
|
For instance, the provided examples store phases output in `./prof`:
|
||||||
|
|
||||||
|
```
|
||||||
|
shell$ mpirun -np 4 --mca pml_monitoring_enable 2 ./monitoring_test
|
||||||
|
```
|
||||||
|
|
||||||
|
Should provide the following output:
|
||||||
|
|
||||||
|
```
|
||||||
|
Proc 3 flushing monitoring to: ./prof/phase_1_3.prof
|
||||||
|
Proc 0 flushing monitoring to: ./prof/phase_1_0.prof
|
||||||
|
Proc 2 flushing monitoring to: ./prof/phase_1_2.prof
|
||||||
|
Proc 1 flushing monitoring to: ./prof/phase_1_1.prof
|
||||||
|
Proc 1 flushing monitoring to: ./prof/phase_2_1.prof
|
||||||
|
Proc 3 flushing monitoring to: ./prof/phase_2_3.prof
|
||||||
|
Proc 0 flushing monitoring to: ./prof/phase_2_0.prof
|
||||||
|
Proc 2 flushing monitoring to: ./prof/phase_2_2.prof
|
||||||
|
I 2 3 104 bytes 26 msgs sent
|
||||||
|
E 2 0 22804 bytes 49 msgs sent
|
||||||
|
E 2 3 860 bytes 24 msgs sent
|
||||||
|
I 3 0 104 bytes 26 msgs sent
|
||||||
|
I 3 1 204 bytes 51 msgs sent
|
||||||
|
E 3 1 2304 bytes 44 msgs sent
|
||||||
|
E 3 2 860 bytes 24 msgs sent
|
||||||
|
I 0 1 108 bytes 27 msgs sent
|
||||||
|
E 0 1 1012 bytes 30 msgs sent
|
||||||
|
E 0 2 23052 bytes 61 msgs sent
|
||||||
|
I 1 2 104 bytes 26 msgs sent
|
||||||
|
I 1 3 208 bytes 52 msgs sent
|
||||||
|
E 1 0 860 bytes 24 msgs sent
|
||||||
|
E 1 3 2552 bytes 56 msgs sent
|
||||||
|
```
|
||||||
|
|
||||||
|
You can then parse the phases with:
|
||||||
|
|
||||||
|
```
|
||||||
|
shell$ /aggregate_profile.pl prof/phase
|
||||||
|
Building prof/phase_1.prof
|
||||||
|
Building prof/phase_2.prof
|
||||||
|
```
|
||||||
|
|
||||||
|
And you can build the different communication matrices of phase 1
|
||||||
|
with:
|
||||||
|
|
||||||
|
```
|
||||||
|
shell$ ./profile2mat.pl prof/phase_1.prof
|
||||||
|
prof/phase_1.prof -> all
|
||||||
|
prof/phase_1_size_all.mat
|
||||||
|
prof/phase_1_msg_all.mat
|
||||||
|
prof/phase_1_avg_all.mat
|
||||||
|
|
||||||
|
prof/phase_1.prof -> external
|
||||||
|
prof/phase_1_size_external.mat
|
||||||
|
prof/phase_1_msg_external.mat
|
||||||
|
prof/phase_1_avg_external.mat
|
||||||
|
|
||||||
|
prof/phase_1.prof -> internal
|
||||||
|
prof/phase_1_size_internal.mat
|
||||||
|
prof/phase_1_msg_internal.mat
|
||||||
|
prof/phase_1_avg_internal.mat
|
||||||
|
```
|
||||||
|
|
||||||
|
## Authors
|
||||||
|
|
||||||
|
Designed by George Bosilca <bosilca@icl.utk.edu> and
|
||||||
|
Emmanuel Jeannot <emmanuel.jeannot@inria.fr>
|
@ -1,340 +0,0 @@
|
|||||||
OFI MTL:
|
|
||||||
--------
|
|
||||||
The OFI MTL supports Libfabric (a.k.a. Open Fabrics Interfaces OFI,
|
|
||||||
https://ofiwg.github.io/libfabric/) tagged APIs (fi_tagged(3)). At
|
|
||||||
initialization time, the MTL queries libfabric for providers supporting tag matching
|
|
||||||
(fi_getinfo(3)). Libfabric will return a list of providers that satisfy the requested
|
|
||||||
capabilities, having the most performant one at the top of the list.
|
|
||||||
The user may modify the OFI provider selection with mca parameters
|
|
||||||
mtl_ofi_provider_include or mtl_ofi_provider_exclude.
|
|
||||||
|
|
||||||
PROGRESS:
|
|
||||||
---------
|
|
||||||
The MTL registers a progress function to opal_progress. There is currently
|
|
||||||
no support for asynchronous progress. The progress function reads multiple events
|
|
||||||
from the OFI provider Completion Queue (CQ) per iteration (defaults to 100, can be
|
|
||||||
modified with the mca mtl_ofi_progress_event_cnt) and iterates until the
|
|
||||||
completion queue is drained.
|
|
||||||
|
|
||||||
COMPLETIONS:
|
|
||||||
------------
|
|
||||||
Each operation uses a request type ompi_mtl_ofi_request_t which includes a reference
|
|
||||||
to an operation specific completion callback, an MPI request, and a context. The
|
|
||||||
context (fi_context) is used to map completion events with MPI_requests when reading the
|
|
||||||
CQ.
|
|
||||||
|
|
||||||
OFI TAG:
|
|
||||||
--------
|
|
||||||
MPI needs to send 96 bits of information per message (32 bits communicator id,
|
|
||||||
32 bits source rank, 32 bits MPI tag) but OFI only offers 64 bits tags. In
|
|
||||||
addition, the OFI MTL uses 2 bits of the OFI tag for the synchronous send protocol.
|
|
||||||
Therefore, there are only 62 bits available in the OFI tag for message usage. The
|
|
||||||
OFI MTL offers the mtl_ofi_tag_mode mca parameter with 4 modes to address this:
|
|
||||||
|
|
||||||
"auto" (Default):
|
|
||||||
After the OFI provider is selected, a runtime check is performed to assess
|
|
||||||
FI_REMOTE_CQ_DATA and FI_DIRECTED_RECV support (see fi_tagged(3), fi_msg(2)
|
|
||||||
and fi_getinfo(3)). If supported, "ofi_tag_full" is used. If not supported,
|
|
||||||
fall back to "ofi_tag_1".
|
|
||||||
|
|
||||||
"ofi_tag_1":
|
|
||||||
For providers that do not support FI_REMOTE_CQ_DATA, the OFI MTL will
|
|
||||||
trim the fields (Communicator ID, Source Rank, MPI tag) to make them fit the 62
|
|
||||||
bits available bit in the OFI tag. There are two options available with different
|
|
||||||
number of bits for the Communicator ID and MPI tag fields. This tag distribution
|
|
||||||
offers: 12 bits for Communicator ID (max Communicator ID 4,095) subject to
|
|
||||||
provider reserved bits (see mem_tag_format below), 18 bits for Source Rank (max
|
|
||||||
Source Rank 262,143), 32 bits for MPI tag (max MPI tag is INT_MAX).
|
|
||||||
|
|
||||||
"ofi_tag_2":
|
|
||||||
Same as 2 "ofi_tag_1" but offering a different OFI tag distribution for
|
|
||||||
applications that may require a greater number of supported Communicators at the
|
|
||||||
expense of fewer MPI tag bits. This tag distribution offers: 24 bits for
|
|
||||||
Communicator ID (max Communicator ED 16,777,215. See mem_tag_format below), 18
|
|
||||||
bits for Source Rank (max Source Rank 262,143), 20 bits for MPI tag (max MPI tag
|
|
||||||
524,287).
|
|
||||||
|
|
||||||
"ofi_tag_full":
|
|
||||||
For executions that cannot accept trimming source rank or MPI tag, this mode sends
|
|
||||||
source rank for each message in the CQ DATA. The Source Rank is made available at
|
|
||||||
the remote process CQ (FI_CQ_FORMAT_TAGGED is used, see fi_cq(3)) at the completion
|
|
||||||
of the matching receive operation. Since the minimum size for FI_REMOTE_CQ_DATA
|
|
||||||
is 32 bits, the Source Rank fits with no limitations. The OFI tag is used for the
|
|
||||||
Communicator id (28 bits, max Communicator ID 268,435,455. See mem_tag_format below),
|
|
||||||
and the MPI tag (max MPI tag is INT_MAX). If this mode is selected by the user
|
|
||||||
and FI_REMOTE_CQ_DATA or FI_DIRECTED_RECV are not supported, the execution will abort.
|
|
||||||
|
|
||||||
mem_tag_format (fi_endpoint(3))
|
|
||||||
Some providers can reserve the higher order bits from the OFI tag for internal purposes.
|
|
||||||
This is signaled in mem_tag_format (see fi_endpoint(3)) by setting higher order bits
|
|
||||||
to zero. In such cases, the OFI MTL will reduce the number of communicator ids supported
|
|
||||||
by reducing the bits available for the communicator ID field in the OFI tag.
|
|
||||||
|
|
||||||
SCALABLE ENDPOINTS:
|
|
||||||
-------------------
|
|
||||||
OFI MTL supports OFI Scalable Endpoints (SEP) feature as a means to improve
|
|
||||||
multi-threaded application throughput and message rate. Currently the feature
|
|
||||||
is designed to utilize multiple TX/RX contexts exposed by the OFI provider in
|
|
||||||
conjunction with a multi-communicator MPI application model. Therefore, new OFI
|
|
||||||
contexts are created as and when communicators are duplicated in a lazy fashion
|
|
||||||
instead of creating them all at once during init time and this approach also
|
|
||||||
favours only creating as many contexts as needed.
|
|
||||||
|
|
||||||
1. Multi-communicator model:
|
|
||||||
With this approach, the MPI application is requried to first duplicate
|
|
||||||
the communicators it wants to use with MPI operations (ideally creating
|
|
||||||
as many communicators as the number of threads it wants to use to call
|
|
||||||
into MPI). The duplicated communicators are then used by the
|
|
||||||
corresponding threads to perform MPI operations. A possible usage
|
|
||||||
scenario could be in an MPI + OMP application as follows
|
|
||||||
(example limited to 2 ranks):
|
|
||||||
|
|
||||||
MPI_Comm dup_comm[n];
|
|
||||||
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
|
|
||||||
for (i = 0; i < n; i++) {
|
|
||||||
MPI_Comm_dup(MPI_COMM_WORLD, &dup_comm[i]);
|
|
||||||
}
|
|
||||||
if (rank == 0) {
|
|
||||||
#pragma omp parallel for private(host_sbuf, host_rbuf) num_threads(n)
|
|
||||||
for (i = 0; i < n ; i++) {
|
|
||||||
MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
|
|
||||||
1, MSG_TAG, dup_comm[i]);
|
|
||||||
MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
|
|
||||||
1, MSG_TAG, dup_comm[i], &status);
|
|
||||||
}
|
|
||||||
} else if (rank == 1) {
|
|
||||||
#pragma omp parallel for private(status, host_sbuf, host_rbuf) num_threads(n)
|
|
||||||
for (i = 0; i < n ; i++) {
|
|
||||||
MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
|
|
||||||
0, MSG_TAG, dup_comm[i], &status);
|
|
||||||
MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
|
|
||||||
0, MSG_TAG, dup_comm[i]);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
2. MCA variables:
|
|
||||||
To utilize the feature, the following MCA variables need to be set:
|
|
||||||
mtl_ofi_enable_sep:
|
|
||||||
This MCA variable needs to be set to enable the use of Scalable Endpoints (SEP)
|
|
||||||
feature in the OFI MTL. The underlying provider is also checked to ensure the
|
|
||||||
feature is supported. If the provider chosen does not support it, user needs
|
|
||||||
to either set this variable to 0 or select a different provider which supports
|
|
||||||
the feature.
|
|
||||||
For single-threaded applications one OFI context is sufficient, so OFI SEPs
|
|
||||||
may not add benefit.
|
|
||||||
Note that mtl_ofi_thread_grouping (see below) needs to be enabled to use the
|
|
||||||
different OFI SEP contexts. Otherwise, only one context (ctxt 0) will be used.
|
|
||||||
|
|
||||||
Default: 0
|
|
||||||
|
|
||||||
Command-line syntax:
|
|
||||||
"-mca mtl_ofi_enable_sep 1"
|
|
||||||
|
|
||||||
mtl_ofi_thread_grouping:
|
|
||||||
Turn Thread Grouping feature on. This is needed to use the Multi-communicator
|
|
||||||
model explained above. This means that the OFI MTL will use the communicator
|
|
||||||
ID to decide the SEP contexts to be used by the thread. In this way, each
|
|
||||||
thread will have direct access to different OFI resources. If disabled,
|
|
||||||
only context 0 will be used.
|
|
||||||
Requires mtl_ofi_enable_sep to be set to 1.
|
|
||||||
|
|
||||||
Default: 0
|
|
||||||
|
|
||||||
It is not recommended to set the MCA variable for:
|
|
||||||
- Multi-threaded MPI applications not following multi-communicator approach.
|
|
||||||
- Applications that have multiple threads using a single communicator as
|
|
||||||
it may degrade performance.
|
|
||||||
|
|
||||||
Command-line syntax:
|
|
||||||
"-mca mtl_ofi_thread_grouping 1"
|
|
||||||
|
|
||||||
mtl_ofi_num_ctxts:
|
|
||||||
This MCA variable allows user to set the number of OFI SEP contexts the
|
|
||||||
application expects to use. For multi-threaded applications using Thread
|
|
||||||
Grouping feature, this number should be set to the number of user threads
|
|
||||||
that will call into MPI. This variable will only have effect if
|
|
||||||
mtl_ofi_enable_sep is set to 1.
|
|
||||||
|
|
||||||
Default: 1
|
|
||||||
|
|
||||||
Command-line syntax:
|
|
||||||
"-mca mtl_ofi_num_ctxts N" [ N: number of OFI contexts required by
|
|
||||||
application ]
|
|
||||||
|
|
||||||
3. Notes on performance:
|
|
||||||
- OFI MTL will create as many TX/RX contexts as set by MCA mtl_ofi_num_ctxts.
|
|
||||||
The number of contexts that can be created is also limited by the underlying
|
|
||||||
provider as each provider may have different thresholds. Once the threshold
|
|
||||||
is exceeded, contexts are used in a round-robin fashion which leads to
|
|
||||||
resource sharing among threads. Therefore locks are required to guard
|
|
||||||
against race conditions. For performance, it is recommended to have
|
|
||||||
|
|
||||||
Number of threads = Number of communicators = Number of contexts
|
|
||||||
|
|
||||||
For example, when using PSM2 provider, the number of contexts is dictated
|
|
||||||
by the Intel Omni-Path HFI1 driver module.
|
|
||||||
|
|
||||||
- OPAL layer allows for multiple threads to enter progress simultaneously. To
|
|
||||||
enable this feature, user needs to set MCA variable
|
|
||||||
"max_thread_in_progress". When using Thread Grouping feature, it is
|
|
||||||
recommended to set this MCA parameter to the number of threads expected to
|
|
||||||
call into MPI as it provides performance benefits.
|
|
||||||
|
|
||||||
Command-line syntax:
|
|
||||||
"-mca opal_max_thread_in_progress N" [ N: number of threads expected to
|
|
||||||
make MPI calls ]
|
|
||||||
Default: 1
|
|
||||||
|
|
||||||
- For applications using a single thread with multiple communicators and MCA
|
|
||||||
variable "mtl_ofi_thread_grouping" set to 1, the MTL will use multiple
|
|
||||||
contexts, but the benefits may be negligible as only one thread is driving
|
|
||||||
progress.
|
|
||||||
|
|
||||||
SPECIALIZED FUNCTIONS:
|
|
||||||
-------------------
|
|
||||||
To improve performance when calling message passing APIs in the OFI mtl
|
|
||||||
specialized functions are generated at compile time that eliminate all the
|
|
||||||
if conditionals that can be determined at init and don't need to be
|
|
||||||
queried again during the critical path. These functions are generated by
|
|
||||||
perl scripts during make which generate functions and symbols for every
|
|
||||||
combination of flags for each function.
|
|
||||||
|
|
||||||
1. ADDING NEW FLAGS FOR SPECIALIZATION OF EXISTING FUNCTION:
|
|
||||||
To add a new flag to an existing specialized function for handling cases
|
|
||||||
where different OFI providers may or may not support a particular feature,
|
|
||||||
then you must follow these steps:
|
|
||||||
1) Update the "_generic" function in mtl_ofi.h with the new flag and
|
|
||||||
the if conditionals to read the new value.
|
|
||||||
2) Update the *.pm file corresponding to the function with the new flag in:
|
|
||||||
gen_funcs(), gen_*_function(), & gen_*_sym_init()
|
|
||||||
3) Update mtl_ofi_opt.h with:
|
|
||||||
The new flag as #define NEW_FLAG_TYPES #NUMBER_OF_STATES
|
|
||||||
example: #define OFI_CQ_DATA 2 (only has TRUE/FALSE states)
|
|
||||||
Update the function's types with:
|
|
||||||
#define OMPI_MTL_OFI_FUNCTION_TYPES [NEW_FLAG_TYPES]
|
|
||||||
|
|
||||||
2. ADDING A NEW FUNCTION FOR SPECIALIZATION:
|
|
||||||
To add a new function to be specialized you must
|
|
||||||
follow these steps:
|
|
||||||
1) Create a new mtl_ofi_"function_name"_opt.pm based off opt_common/mtl_ofi_opt.pm.template
|
|
||||||
2) Add new .pm file to generated_source_modules in Makefile.am
|
|
||||||
3) Add .c file to generated_sources in Makefile.am named the same as the corresponding .pm file
|
|
||||||
4) Update existing or create function in mtl_ofi.h to _generic with new flags.
|
|
||||||
5) Update mtl_ofi_opt.h with:
|
|
||||||
a) New function types: #define OMPI_MTL_OFI_FUNCTION_TYPES [FLAG_TYPES]
|
|
||||||
b) Add new function to the struct ompi_mtl_ofi_symtable:
|
|
||||||
struct ompi_mtl_ofi_symtable {
|
|
||||||
...
|
|
||||||
int (*ompi_mtl_ofi_FUNCTION OMPI_MTL_OFI_FUNCTION_TYPES )
|
|
||||||
}
|
|
||||||
c) Add new symbol table init function definition:
|
|
||||||
void ompi_mtl_ofi_FUNCTION_symtable_init(struct ompi_mtl_ofi_symtable* sym_table);
|
|
||||||
6) Add calls to init the new function in the symbol table and assign the function
|
|
||||||
pointer to be used based off the flags in mtl_ofi_component.c:
|
|
||||||
ompi_mtl_ofi_FUNCTION_symtable_init(&ompi_mtl_ofi.sym_table);
|
|
||||||
ompi_mtl_ofi.base.mtl_FUNCTION =
|
|
||||||
ompi_mtl_ofi.sym_table.ompi_mtl_ofi_FUNCTION[ompi_mtl_ofi.flag];
|
|
||||||
|
|
||||||
3. EXAMPLE SPECIALIZED FILE:
|
|
||||||
The code below is an example of what is generated by the specialization
|
|
||||||
scripts for use in the OFI mtl. This code specializes the blocking
|
|
||||||
send functionality based on FI_REMOTE_CQ_DATA & OFI Scalable Endpoint support
|
|
||||||
provided by an OFI Provider. Only one function and symbol is used during
|
|
||||||
runtime based on if FI_REMOTE_CQ_DATA is supported and/or if OFI Scalable
|
|
||||||
Endpoint support is enabled.
|
|
||||||
/*
|
|
||||||
* Copyright (c) 2013-2018 Intel, Inc. All rights reserved
|
|
||||||
*
|
|
||||||
* $COPYRIGHT$
|
|
||||||
*
|
|
||||||
* Additional copyrights may follow
|
|
||||||
*
|
|
||||||
* $HEADER$
|
|
||||||
*/
|
|
||||||
|
|
||||||
#include "mtl_ofi.h"
|
|
||||||
|
|
||||||
__opal_attribute_always_inline__ static inline int
|
|
||||||
ompi_mtl_ofi_send_false_false(struct mca_mtl_base_module_t *mtl,
|
|
||||||
struct ompi_communicator_t *comm,
|
|
||||||
int dest,
|
|
||||||
int tag,
|
|
||||||
struct opal_convertor_t *convertor,
|
|
||||||
mca_pml_base_send_mode_t mode)
|
|
||||||
{
|
|
||||||
const bool OFI_CQ_DATA = false;
|
|
||||||
const bool OFI_SCEP_EPS = false;
|
|
||||||
|
|
||||||
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
|
||||||
convertor, mode,
|
|
||||||
OFI_CQ_DATA, OFI_SCEP_EPS);
|
|
||||||
}
|
|
||||||
|
|
||||||
__opal_attribute_always_inline__ static inline int
|
|
||||||
ompi_mtl_ofi_send_false_true(struct mca_mtl_base_module_t *mtl,
|
|
||||||
struct ompi_communicator_t *comm,
|
|
||||||
int dest,
|
|
||||||
int tag,
|
|
||||||
struct opal_convertor_t *convertor,
|
|
||||||
mca_pml_base_send_mode_t mode)
|
|
||||||
{
|
|
||||||
const bool OFI_CQ_DATA = false;
|
|
||||||
const bool OFI_SCEP_EPS = true;
|
|
||||||
|
|
||||||
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
|
||||||
convertor, mode,
|
|
||||||
OFI_CQ_DATA, OFI_SCEP_EPS);
|
|
||||||
}
|
|
||||||
|
|
||||||
__opal_attribute_always_inline__ static inline int
|
|
||||||
ompi_mtl_ofi_send_true_false(struct mca_mtl_base_module_t *mtl,
|
|
||||||
struct ompi_communicator_t *comm,
|
|
||||||
int dest,
|
|
||||||
int tag,
|
|
||||||
struct opal_convertor_t *convertor,
|
|
||||||
mca_pml_base_send_mode_t mode)
|
|
||||||
{
|
|
||||||
const bool OFI_CQ_DATA = true;
|
|
||||||
const bool OFI_SCEP_EPS = false;
|
|
||||||
|
|
||||||
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
|
||||||
convertor, mode,
|
|
||||||
OFI_CQ_DATA, OFI_SCEP_EPS);
|
|
||||||
}
|
|
||||||
|
|
||||||
__opal_attribute_always_inline__ static inline int
|
|
||||||
ompi_mtl_ofi_send_true_true(struct mca_mtl_base_module_t *mtl,
|
|
||||||
struct ompi_communicator_t *comm,
|
|
||||||
int dest,
|
|
||||||
int tag,
|
|
||||||
struct opal_convertor_t *convertor,
|
|
||||||
mca_pml_base_send_mode_t mode)
|
|
||||||
{
|
|
||||||
const bool OFI_CQ_DATA = true;
|
|
||||||
const bool OFI_SCEP_EPS = true;
|
|
||||||
|
|
||||||
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
|
||||||
convertor, mode,
|
|
||||||
OFI_CQ_DATA, OFI_SCEP_EPS);
|
|
||||||
}
|
|
||||||
|
|
||||||
void ompi_mtl_ofi_send_symtable_init(struct ompi_mtl_ofi_symtable* sym_table)
|
|
||||||
{
|
|
||||||
|
|
||||||
sym_table->ompi_mtl_ofi_send[false][false]
|
|
||||||
= ompi_mtl_ofi_send_false_false;
|
|
||||||
|
|
||||||
|
|
||||||
sym_table->ompi_mtl_ofi_send[false][true]
|
|
||||||
= ompi_mtl_ofi_send_false_true;
|
|
||||||
|
|
||||||
|
|
||||||
sym_table->ompi_mtl_ofi_send[true][false]
|
|
||||||
= ompi_mtl_ofi_send_true_false;
|
|
||||||
|
|
||||||
|
|
||||||
sym_table->ompi_mtl_ofi_send[true][true]
|
|
||||||
= ompi_mtl_ofi_send_true_true;
|
|
||||||
|
|
||||||
}
|
|
||||||
###
|
|
368
ompi/mca/mtl/ofi/README.md
Обычный файл
368
ompi/mca/mtl/ofi/README.md
Обычный файл
@ -0,0 +1,368 @@
|
|||||||
|
# Open MPI OFI MTL
|
||||||
|
|
||||||
|
The OFI MTL supports Libfabric (a.k.a., [Open Fabrics Interfaces
|
||||||
|
OFI](https://ofiwg.github.io/libfabric/)) tagged APIs
|
||||||
|
(`fi_tagged(3)`). At initialization time, the MTL queries libfabric
|
||||||
|
for providers supporting tag matching (`fi_getinfo(3)`). Libfabric
|
||||||
|
will return a list of providers that satisfy the requested
|
||||||
|
capabilities, having the most performant one at the top of the list.
|
||||||
|
The user may modify the OFI provider selection with mca parameters
|
||||||
|
`mtl_ofi_provider_include` or `mtl_ofi_provider_exclude`.
|
||||||
|
|
||||||
|
## PROGRESS
|
||||||
|
|
||||||
|
The MTL registers a progress function to `opal_progress`. There is
|
||||||
|
currently no support for asynchronous progress. The progress function
|
||||||
|
reads multiple events from the OFI provider Completion Queue (CQ) per
|
||||||
|
iteration (defaults to 100, can be modified with the mca
|
||||||
|
`mtl_ofi_progress_event_cnt`) and iterates until the completion queue is
|
||||||
|
drained.
|
||||||
|
|
||||||
|
## COMPLETIONS
|
||||||
|
|
||||||
|
Each operation uses a request type `ompi_mtl_ofi_request_t` which
|
||||||
|
includes a reference to an operation specific completion callback, an
|
||||||
|
MPI request, and a context. The context (`fi_context`) is used to map
|
||||||
|
completion events with `MPI_requests` when reading the CQ.
|
||||||
|
|
||||||
|
## OFI TAG
|
||||||
|
|
||||||
|
MPI needs to send 96 bits of information per message (32 bits
|
||||||
|
communicator id, 32 bits source rank, 32 bits MPI tag) but OFI only
|
||||||
|
offers 64 bits tags. In addition, the OFI MTL uses 2 bits of the OFI
|
||||||
|
tag for the synchronous send protocol. Therefore, there are only 62
|
||||||
|
bits available in the OFI tag for message usage. The OFI MTL offers
|
||||||
|
the `mtl_ofi_tag_mode` mca parameter with 4 modes to address this:
|
||||||
|
|
||||||
|
* `auto` (Default):
|
||||||
|
After the OFI provider is selected, a runtime check is performed to
|
||||||
|
assess `FI_REMOTE_CQ_DATA` and `FI_DIRECTED_RECV` support (see
|
||||||
|
`fi_tagged(3)`, `fi_msg(2)` and `fi_getinfo(3)`). If supported,
|
||||||
|
`ofi_tag_full` is used. If not supported, fall back to `ofi_tag_1`.
|
||||||
|
|
||||||
|
* `ofi_tag_1`:
|
||||||
|
For providers that do not support `FI_REMOTE_CQ_DATA`, the OFI MTL
|
||||||
|
will trim the fields (Communicator ID, Source Rank, MPI tag) to make
|
||||||
|
them fit the 62 bits available bit in the OFI tag. There are two
|
||||||
|
options available with different number of bits for the Communicator
|
||||||
|
ID and MPI tag fields. This tag distribution offers: 12 bits for
|
||||||
|
Communicator ID (max Communicator ID 4,095) subject to provider
|
||||||
|
reserved bits (see `mem_tag_format` below), 18 bits for Source Rank
|
||||||
|
(max Source Rank 262,143), 32 bits for MPI tag (max MPI tag is
|
||||||
|
`INT_MAX`).
|
||||||
|
|
||||||
|
* `ofi_tag_2`:
|
||||||
|
Same as 2 `ofi_tag_1` but offering a different OFI tag distribution
|
||||||
|
for applications that may require a greater number of supported
|
||||||
|
Communicators at the expense of fewer MPI tag bits. This tag
|
||||||
|
distribution offers: 24 bits for Communicator ID (max Communicator
|
||||||
|
ED 16,777,215. See mem_tag_format below), 18 bits for Source Rank
|
||||||
|
(max Source Rank 262,143), 20 bits for MPI tag (max MPI tag
|
||||||
|
524,287).
|
||||||
|
|
||||||
|
* `ofi_tag_full`:
|
||||||
|
For executions that cannot accept trimming source rank or MPI tag,
|
||||||
|
this mode sends source rank for each message in the CQ DATA. The
|
||||||
|
Source Rank is made available at the remote process CQ
|
||||||
|
(`FI_CQ_FORMAT_TAGGED` is used, see `fi_cq(3)`) at the completion of
|
||||||
|
the matching receive operation. Since the minimum size for
|
||||||
|
`FI_REMOTE_CQ_DATA` is 32 bits, the Source Rank fits with no
|
||||||
|
limitations. The OFI tag is used for the Communicator id (28 bits,
|
||||||
|
max Communicator ID 268,435,455. See `mem_tag_format` below), and
|
||||||
|
the MPI tag (max MPI tag is `INT_MAX`). If this mode is selected by
|
||||||
|
the user and `FI_REMOTE_CQ_DATA` or `FI_DIRECTED_RECV` are not
|
||||||
|
supported, the execution will abort.
|
||||||
|
|
||||||
|
* `mem_tag_format` (`fi_endpoint(3)`)
|
||||||
|
Some providers can reserve the higher order bits from the OFI tag
|
||||||
|
for internal purposes. This is signaled in `mem_tag_format` (see
|
||||||
|
`fi_endpoint(3)`) by setting higher order bits to zero. In such
|
||||||
|
cases, the OFI MTL will reduce the number of communicator ids
|
||||||
|
supported by reducing the bits available for the communicator ID
|
||||||
|
field in the OFI tag.
|
||||||
|
|
||||||
|
## SCALABLE ENDPOINTS
|
||||||
|
|
||||||
|
OFI MTL supports OFI Scalable Endpoints (SEP) feature as a means to
|
||||||
|
improve multi-threaded application throughput and message
|
||||||
|
rate. Currently the feature is designed to utilize multiple TX/RX
|
||||||
|
contexts exposed by the OFI provider in conjunction with a
|
||||||
|
multi-communicator MPI application model. Therefore, new OFI contexts
|
||||||
|
are created as and when communicators are duplicated in a lazy fashion
|
||||||
|
instead of creating them all at once during init time and this
|
||||||
|
approach also favours only creating as many contexts as needed.
|
||||||
|
|
||||||
|
1. Multi-communicator model:
|
||||||
|
With this approach, the MPI application is requried to first duplicate
|
||||||
|
the communicators it wants to use with MPI operations (ideally creating
|
||||||
|
as many communicators as the number of threads it wants to use to call
|
||||||
|
into MPI). The duplicated communicators are then used by the
|
||||||
|
corresponding threads to perform MPI operations. A possible usage
|
||||||
|
scenario could be in an MPI + OMP application as follows
|
||||||
|
(example limited to 2 ranks):
|
||||||
|
|
||||||
|
```c
|
||||||
|
MPI_Comm dup_comm[n];
|
||||||
|
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
|
||||||
|
for (i = 0; i < n; i++) {
|
||||||
|
MPI_Comm_dup(MPI_COMM_WORLD, &dup_comm[i]);
|
||||||
|
}
|
||||||
|
if (rank == 0) {
|
||||||
|
#pragma omp parallel for private(host_sbuf, host_rbuf) num_threads(n)
|
||||||
|
for (i = 0; i < n ; i++) {
|
||||||
|
MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
|
||||||
|
1, MSG_TAG, dup_comm[i]);
|
||||||
|
MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
|
||||||
|
1, MSG_TAG, dup_comm[i], &status);
|
||||||
|
}
|
||||||
|
} else if (rank == 1) {
|
||||||
|
#pragma omp parallel for private(status, host_sbuf, host_rbuf) num_threads(n)
|
||||||
|
for (i = 0; i < n ; i++) {
|
||||||
|
MPI_Recv(host_rbuf, MYBUFSIZE, MPI_CHAR,
|
||||||
|
0, MSG_TAG, dup_comm[i], &status);
|
||||||
|
MPI_Send(host_sbuf, MYBUFSIZE, MPI_CHAR,
|
||||||
|
0, MSG_TAG, dup_comm[i]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
2. MCA variables:
|
||||||
|
To utilize the feature, the following MCA variables need to be set:
|
||||||
|
|
||||||
|
* `mtl_ofi_enable_sep`:
|
||||||
|
This MCA variable needs to be set to enable the use of Scalable
|
||||||
|
Endpoints (SEP) feature in the OFI MTL. The underlying provider
|
||||||
|
is also checked to ensure the feature is supported. If the
|
||||||
|
provider chosen does not support it, user needs to either set
|
||||||
|
this variable to 0 or select a different provider which supports
|
||||||
|
the feature. For single-threaded applications one OFI context is
|
||||||
|
sufficient, so OFI SEPs may not add benefit. Note that
|
||||||
|
`mtl_ofi_thread_grouping` (see below) needs to be enabled to use
|
||||||
|
the different OFI SEP contexts. Otherwise, only one context (ctxt
|
||||||
|
0) will be used.
|
||||||
|
|
||||||
|
Default: 0
|
||||||
|
|
||||||
|
Command-line syntax: `--mca mtl_ofi_enable_sep 1`
|
||||||
|
|
||||||
|
* `mtl_ofi_thread_grouping`:
|
||||||
|
Turn Thread Grouping feature on. This is needed to use the
|
||||||
|
Multi-communicator model explained above. This means that the OFI
|
||||||
|
MTL will use the communicator ID to decide the SEP contexts to be
|
||||||
|
used by the thread. In this way, each thread will have direct
|
||||||
|
access to different OFI resources. If disabled, only context 0
|
||||||
|
will be used. Requires `mtl_ofi_enable_sep` to be set to 1.
|
||||||
|
|
||||||
|
Default: 0
|
||||||
|
|
||||||
|
It is not recommended to set the MCA variable for:
|
||||||
|
|
||||||
|
* Multi-threaded MPI applications not following multi-communicator
|
||||||
|
approach.
|
||||||
|
* Applications that have multiple threads using a single
|
||||||
|
communicator as it may degrade performance.
|
||||||
|
|
||||||
|
Command-line syntax: `--mca mtl_ofi_thread_grouping 1`
|
||||||
|
|
||||||
|
* `mtl_ofi_num_ctxts`:
|
||||||
|
This MCA variable allows user to set the number of OFI SEP
|
||||||
|
contexts the application expects to use. For multi-threaded
|
||||||
|
applications using Thread Grouping feature, this number should be
|
||||||
|
set to the number of user threads that will call into MPI. This
|
||||||
|
variable will only have effect if `mtl_ofi_enable_sep` is set to 1.
|
||||||
|
|
||||||
|
Default: 1
|
||||||
|
|
||||||
|
Command-line syntax: `--mca mtl_ofi_num_ctxts N` (`N`: number of OFI contexts required by application)
|
||||||
|
|
||||||
|
3. Notes on performance:
|
||||||
|
* OFI MTL will create as many TX/RX contexts as set by MCA
|
||||||
|
mtl_ofi_num_ctxts. The number of contexts that can be created is
|
||||||
|
also limited by the underlying provider as each provider may have
|
||||||
|
different thresholds. Once the threshold is exceeded, contexts are
|
||||||
|
used in a round-robin fashion which leads to resource sharing
|
||||||
|
among threads. Therefore locks are required to guard against race
|
||||||
|
conditions. For performance, it is recommended to have
|
||||||
|
|
||||||
|
Number of threads = Number of communicators = Number of contexts
|
||||||
|
|
||||||
|
For example, when using PSM2 provider, the number of contexts is
|
||||||
|
dictated by the Intel Omni-Path HFI1 driver module.
|
||||||
|
|
||||||
|
* OPAL layer allows for multiple threads to enter progress
|
||||||
|
simultaneously. To enable this feature, user needs to set MCA
|
||||||
|
variable `max_thread_in_progress`. When using Thread Grouping
|
||||||
|
feature, it is recommended to set this MCA parameter to the number
|
||||||
|
of threads expected to call into MPI as it provides performance
|
||||||
|
benefits.
|
||||||
|
|
||||||
|
Default: 1
|
||||||
|
|
||||||
|
Command-line syntax: `--mca opal_max_thread_in_progress N` (`N`: number of threads expected to make MPI calls )
|
||||||
|
|
||||||
|
* For applications using a single thread with multiple communicators
|
||||||
|
and MCA variable `mtl_ofi_thread_grouping` set to 1, the MTL will
|
||||||
|
use multiple contexts, but the benefits may be negligible as only
|
||||||
|
one thread is driving progress.
|
||||||
|
|
||||||
|
## SPECIALIZED FUNCTIONS
|
||||||
|
|
||||||
|
To improve performance when calling message passing APIs in the OFI
|
||||||
|
mtl specialized functions are generated at compile time that eliminate
|
||||||
|
all the if conditionals that can be determined at init and don't need
|
||||||
|
to be queried again during the critical path. These functions are
|
||||||
|
generated by perl scripts during make which generate functions and
|
||||||
|
symbols for every combination of flags for each function.
|
||||||
|
|
||||||
|
1. ADDING NEW FLAGS FOR SPECIALIZATION OF EXISTING FUNCTION:
|
||||||
|
To add a new flag to an existing specialized function for handling
|
||||||
|
cases where different OFI providers may or may not support a
|
||||||
|
particular feature, then you must follow these steps:
|
||||||
|
|
||||||
|
1. Update the `_generic` function in `mtl_ofi.h` with the new flag
|
||||||
|
and the if conditionals to read the new value.
|
||||||
|
1. Update the `*.pm` file corresponding to the function with the
|
||||||
|
new flag in: `gen_funcs()`, `gen_*_function()`, &
|
||||||
|
`gen_*_sym_init()`
|
||||||
|
1. Update `mtl_ofi_opt.h` with:
|
||||||
|
* The new flag as `#define NEW_FLAG_TYPES #NUMBER_OF_STATES`.
|
||||||
|
Example: #define OFI_CQ_DATA 2 (only has TRUE/FALSE states)
|
||||||
|
* Update the function's types with:
|
||||||
|
`#define OMPI_MTL_OFI_FUNCTION_TYPES [NEW_FLAG_TYPES]`
|
||||||
|
|
||||||
|
1. ADDING A NEW FUNCTION FOR SPECIALIZATION:
|
||||||
|
To add a new function to be specialized you must
|
||||||
|
follow these steps:
|
||||||
|
1. Create a new `mtl_ofi_<function_name>_opt.pm` based off
|
||||||
|
`opt_common/mtl_ofi_opt.pm.template`
|
||||||
|
1. Add new `.pm` file to `generated_source_modules` in `Makefile.am`
|
||||||
|
1. Add `.c` file to `generated_sources` in `Makefile.am` named the
|
||||||
|
same as the corresponding `.pm` file
|
||||||
|
1. Update existing or create function in `mtl_ofi.h` to `_generic`
|
||||||
|
with new flags.
|
||||||
|
1. Update `mtl_ofi_opt.h` with:
|
||||||
|
1. New function types: `#define OMPI_MTL_OFI_FUNCTION_TYPES` `[FLAG_TYPES]`
|
||||||
|
1. Add new function to the `struct ompi_mtl_ofi_symtable`:
|
||||||
|
```c
|
||||||
|
struct ompi_mtl_ofi_symtable {
|
||||||
|
...
|
||||||
|
int (*ompi_mtl_ofi_FUNCTION OMPI_MTL_OFI_FUNCTION_TYPES )
|
||||||
|
}
|
||||||
|
```
|
||||||
|
1. Add new symbol table init function definition:
|
||||||
|
```c
|
||||||
|
void ompi_mtl_ofi_FUNCTION_symtable_init(struct ompi_mtl_ofi_symtable* sym_table);
|
||||||
|
```
|
||||||
|
1. Add calls to init the new function in the symbol table and
|
||||||
|
assign the function pointer to be used based off the flags in
|
||||||
|
`mtl_ofi_component.c`:
|
||||||
|
* `ompi_mtl_ofi_FUNCTION_symtable_init(&ompi_mtl_ofi.sym_table);`
|
||||||
|
* `ompi_mtl_ofi.base.mtl_FUNCTION = ompi_mtl_ofi.sym_table.ompi_mtl_ofi_FUNCTION[ompi_mtl_ofi.flag];`
|
||||||
|
|
||||||
|
## EXAMPLE SPECIALIZED FILE
|
||||||
|
|
||||||
|
The code below is an example of what is generated by the
|
||||||
|
specialization scripts for use in the OFI mtl. This code specializes
|
||||||
|
the blocking send functionality based on `FI_REMOTE_CQ_DATA` & OFI
|
||||||
|
Scalable Endpoint support provided by an OFI Provider. Only one
|
||||||
|
function and symbol is used during runtime based on if
|
||||||
|
`FI_REMOTE_CQ_DATA` is supported and/or if OFI Scalable Endpoint support
|
||||||
|
is enabled.
|
||||||
|
|
||||||
|
```c
|
||||||
|
/*
|
||||||
|
* Copyright (c) 2013-2018 Intel, Inc. All rights reserved
|
||||||
|
*
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "mtl_ofi.h"
|
||||||
|
|
||||||
|
__opal_attribute_always_inline__ static inline int
|
||||||
|
ompi_mtl_ofi_send_false_false(struct mca_mtl_base_module_t *mtl,
|
||||||
|
struct ompi_communicator_t *comm,
|
||||||
|
int dest,
|
||||||
|
int tag,
|
||||||
|
struct opal_convertor_t *convertor,
|
||||||
|
mca_pml_base_send_mode_t mode)
|
||||||
|
{
|
||||||
|
const bool OFI_CQ_DATA = false;
|
||||||
|
const bool OFI_SCEP_EPS = false;
|
||||||
|
|
||||||
|
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
||||||
|
convertor, mode,
|
||||||
|
OFI_CQ_DATA, OFI_SCEP_EPS);
|
||||||
|
}
|
||||||
|
|
||||||
|
__opal_attribute_always_inline__ static inline int
|
||||||
|
ompi_mtl_ofi_send_false_true(struct mca_mtl_base_module_t *mtl,
|
||||||
|
struct ompi_communicator_t *comm,
|
||||||
|
int dest,
|
||||||
|
int tag,
|
||||||
|
struct opal_convertor_t *convertor,
|
||||||
|
mca_pml_base_send_mode_t mode)
|
||||||
|
{
|
||||||
|
const bool OFI_CQ_DATA = false;
|
||||||
|
const bool OFI_SCEP_EPS = true;
|
||||||
|
|
||||||
|
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
||||||
|
convertor, mode,
|
||||||
|
OFI_CQ_DATA, OFI_SCEP_EPS);
|
||||||
|
}
|
||||||
|
|
||||||
|
__opal_attribute_always_inline__ static inline int
|
||||||
|
ompi_mtl_ofi_send_true_false(struct mca_mtl_base_module_t *mtl,
|
||||||
|
struct ompi_communicator_t *comm,
|
||||||
|
int dest,
|
||||||
|
int tag,
|
||||||
|
struct opal_convertor_t *convertor,
|
||||||
|
mca_pml_base_send_mode_t mode)
|
||||||
|
{
|
||||||
|
const bool OFI_CQ_DATA = true;
|
||||||
|
const bool OFI_SCEP_EPS = false;
|
||||||
|
|
||||||
|
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
||||||
|
convertor, mode,
|
||||||
|
OFI_CQ_DATA, OFI_SCEP_EPS);
|
||||||
|
}
|
||||||
|
|
||||||
|
__opal_attribute_always_inline__ static inline int
|
||||||
|
ompi_mtl_ofi_send_true_true(struct mca_mtl_base_module_t *mtl,
|
||||||
|
struct ompi_communicator_t *comm,
|
||||||
|
int dest,
|
||||||
|
int tag,
|
||||||
|
struct opal_convertor_t *convertor,
|
||||||
|
mca_pml_base_send_mode_t mode)
|
||||||
|
{
|
||||||
|
const bool OFI_CQ_DATA = true;
|
||||||
|
const bool OFI_SCEP_EPS = true;
|
||||||
|
|
||||||
|
return ompi_mtl_ofi_send_generic(mtl, comm, dest, tag,
|
||||||
|
convertor, mode,
|
||||||
|
OFI_CQ_DATA, OFI_SCEP_EPS);
|
||||||
|
}
|
||||||
|
|
||||||
|
void ompi_mtl_ofi_send_symtable_init(struct ompi_mtl_ofi_symtable* sym_table)
|
||||||
|
{
|
||||||
|
|
||||||
|
sym_table->ompi_mtl_ofi_send[false][false]
|
||||||
|
= ompi_mtl_ofi_send_false_false;
|
||||||
|
|
||||||
|
|
||||||
|
sym_table->ompi_mtl_ofi_send[false][true]
|
||||||
|
= ompi_mtl_ofi_send_false_true;
|
||||||
|
|
||||||
|
|
||||||
|
sym_table->ompi_mtl_ofi_send[true][false]
|
||||||
|
= ompi_mtl_ofi_send_true_false;
|
||||||
|
|
||||||
|
|
||||||
|
sym_table->ompi_mtl_ofi_send[true][true]
|
||||||
|
= ompi_mtl_ofi_send_true_true;
|
||||||
|
|
||||||
|
}
|
||||||
|
```
|
@ -1,5 +1,3 @@
|
|||||||
Copyright 2009 Cisco Systems, Inc. All rights reserved.
|
|
||||||
|
|
||||||
This is a simple example op component meant to be a template /
|
This is a simple example op component meant to be a template /
|
||||||
springboard for people to write their own op components. There are
|
springboard for people to write their own op components. There are
|
||||||
many different ways to write components and modules; this is but one
|
many different ways to write components and modules; this is but one
|
||||||
@ -13,28 +11,26 @@ same end effect. Feel free to customize / simplify / strip out what
|
|||||||
you don't need from this example.
|
you don't need from this example.
|
||||||
|
|
||||||
This example component supports a ficticious set of hardware that
|
This example component supports a ficticious set of hardware that
|
||||||
provides acceleation for the MPI_MAX and MPI_BXOR MPI_Ops. The
|
provides acceleation for the `MPI_MAX` and `MPI_BXOR` `MPI_Ops`. The
|
||||||
ficticious hardware has multiple versions, too: some versions only
|
ficticious hardware has multiple versions, too: some versions only
|
||||||
support single precision floating point types for MAX and single
|
support single precision floating point types for `MAX` and single
|
||||||
precision integer types for BXOR, whereas later versions support both
|
precision integer types for `BXOR`, whereas later versions support
|
||||||
single and double precision floating point types for MAX and both
|
both single and double precision floating point types for `MAX` and
|
||||||
single and double precision integer types for BXOR. Hence, this
|
both single and double precision integer types for `BXOR`. Hence,
|
||||||
example walks through setting up particular MPI_Op function pointers
|
this example walks through setting up particular `MPI_Op` function
|
||||||
based on:
|
pointers based on:
|
||||||
|
|
||||||
a) hardware availability (e.g., does the node where this MPI process
|
1. hardware availability (e.g., does the node where this MPI process
|
||||||
is running have the relevant hardware/resources?)
|
is running have the relevant hardware/resources?)
|
||||||
|
1. `MPI_Op` (e.g., in this example, only `MPI_MAX` and `MPI_BXOR` are
|
||||||
b) MPI_Op (e.g., in this example, only MPI_MAX and MPI_BXOR are
|
|
||||||
supported)
|
supported)
|
||||||
|
1. datatype (e.g., single/double precision floating point for `MAX`
|
||||||
c) datatype (e.g., single/double precision floating point for MAX and
|
and single/double precision integer for `BXOR`)
|
||||||
single/double precision integer for BXOR)
|
|
||||||
|
|
||||||
Additionally, there are other considerations that should be factored
|
Additionally, there are other considerations that should be factored
|
||||||
in at run time. Hardware accelerators are great, but they do induce
|
in at run time. Hardware accelerators are great, but they do induce
|
||||||
overhead -- for example, some accelerator hardware require registered
|
overhead -- for example, some accelerator hardware require registered
|
||||||
memory. So even if a particular MPI_Op and datatype are supported, it
|
memory. So even if a particular `MPI_Op` and datatype are supported, it
|
||||||
may not be worthwhile to use the hardware unless the amount of data to
|
may not be worthwhile to use the hardware unless the amount of data to
|
||||||
be processed is "big enough" (meaning that the cost of the
|
be processed is "big enough" (meaning that the cost of the
|
||||||
registration and/or copy-in/copy-out is ameliorated) or the memory to
|
registration and/or copy-in/copy-out is ameliorated) or the memory to
|
||||||
@ -47,57 +43,65 @@ failover strategy is well-supported by the op framework; during the
|
|||||||
query process, a component can "stack" itself similar to how POSIX
|
query process, a component can "stack" itself similar to how POSIX
|
||||||
signal handlers can be stacked. Specifically, op components can cache
|
signal handlers can be stacked. Specifically, op components can cache
|
||||||
other implementations of operation functions for use in the case of
|
other implementations of operation functions for use in the case of
|
||||||
failover. The MAX and BXOR module implementations show one way of
|
failover. The `MAX` and `BXOR` module implementations show one way of
|
||||||
using this method.
|
using this method.
|
||||||
|
|
||||||
Here's a listing of the files in the example component and what they
|
Here's a listing of the files in the example component and what they
|
||||||
do:
|
do:
|
||||||
|
|
||||||
- configure.m4: Tests that get slurped into OMPI's top-level configure
|
- `configure.m4`: Tests that get slurped into OMPI's top-level
|
||||||
script to determine whether this component will be built or not.
|
`configure` script to determine whether this component will be built
|
||||||
- Makefile.am: Automake makefile that builds this component.
|
or not.
|
||||||
- op_example_component.c: The main "component" source file.
|
- `Makefile.am`: Automake makefile that builds this component.
|
||||||
- op_example_module.c: The main "module" source file.
|
- `op_example_component.c`: The main "component" source file.
|
||||||
- op_example.h: information that is shared between the .c files.
|
- `op_example_module.c`: The main "module" source file.
|
||||||
- .ompi_ignore: the presence of this file causes OMPI's autogen.pl to
|
- `op_example.h`: information that is shared between the `.c` files.
|
||||||
skip this component in the configure/build/install process (see
|
- `.ompi_ignore`: the presence of this file causes OMPI's `autogen.pl`
|
||||||
|
to skip this component in the configure/build/install process (see
|
||||||
below).
|
below).
|
||||||
|
|
||||||
To use this example as a template for your component (assume your new
|
To use this example as a template for your component (assume your new
|
||||||
component is named "foo"):
|
component is named `foo`):
|
||||||
|
|
||||||
|
```
|
||||||
shell$ cd (top_ompi_dir)/ompi/mca/op
|
shell$ cd (top_ompi_dir)/ompi/mca/op
|
||||||
shell$ cp -r example foo
|
shell$ cp -r example foo
|
||||||
shell$ cd foo
|
shell$ cd foo
|
||||||
|
```
|
||||||
|
|
||||||
Remove the .ompi_ignore file (which makes the component "visible" to
|
Remove the `.ompi_ignore` file (which makes the component "visible" to
|
||||||
all developers) *OR* add an .ompi_unignore file with one username per
|
all developers) *OR* add an `.ompi_unignore` file with one username per
|
||||||
line (as reported by `whoami`). OMPI's autogen.pl will skip any
|
line (as reported by `whoami`). OMPI's `autogen.pl` will skip any
|
||||||
component with a .ompi_ignore file *unless* there is also an
|
component with a `.ompi_ignore` file *unless* there is also an
|
||||||
.ompi_unignore file containing your user ID in it. This is a handy
|
.ompi_unignore file containing your user ID in it. This is a handy
|
||||||
mechanism to have a component in the tree but have it not built / used
|
mechanism to have a component in the tree but have it not built / used
|
||||||
by most other developers:
|
by most other developers:
|
||||||
|
|
||||||
|
```
|
||||||
shell$ rm .ompi_ignore
|
shell$ rm .ompi_ignore
|
||||||
*OR*
|
*OR*
|
||||||
shell$ whoami > .ompi_unignore
|
shell$ whoami > .ompi_unignore
|
||||||
|
```
|
||||||
|
|
||||||
Now rename any file that contains "example" in the filename to have
|
Now rename any file that contains `example` in the filename to have
|
||||||
"foo", instead. For example:
|
`foo`, instead. For example:
|
||||||
|
|
||||||
|
```
|
||||||
shell$ mv op_example_component.c op_foo_component.c
|
shell$ mv op_example_component.c op_foo_component.c
|
||||||
#...etc.
|
#...etc.
|
||||||
|
```
|
||||||
|
|
||||||
Now edit all the files and s/example/foo/gi. Specifically, replace
|
Now edit all the files and `s/example/foo/gi`. Specifically, replace
|
||||||
all instances of "example" with "foo" in all function names, type
|
all instances of `example` with `foo` in all function names, type
|
||||||
names, header #defines, strings, and global variables.
|
names, header `#defines`, strings, and global variables.
|
||||||
|
|
||||||
Now your component should be fully functional (although entirely
|
Now your component should be fully functional (although entirely
|
||||||
renamed as "foo" instead of "example"). You can go to the top-level
|
renamed as `foo` instead of `example`). You can go to the top-level
|
||||||
OMPI directory and run "autogen.pl" (which will find your component
|
OMPI directory and run `autogen.pl` (which will find your component
|
||||||
and att it to the configure/build process) and then "configure ..."
|
and att it to the configure/build process) and then `configure ...`
|
||||||
and "make ..." as normal.
|
and `make ...` as normal.
|
||||||
|
|
||||||
|
```
|
||||||
shell$ cd (top_ompi_dir)
|
shell$ cd (top_ompi_dir)
|
||||||
shell$ ./autogen.pl
|
shell$ ./autogen.pl
|
||||||
# ...lots of output...
|
# ...lots of output...
|
||||||
@ -107,19 +111,21 @@ shell$ make -j 4 all
|
|||||||
# ...lots of output...
|
# ...lots of output...
|
||||||
shell$ make install
|
shell$ make install
|
||||||
# ...lots of output...
|
# ...lots of output...
|
||||||
|
```
|
||||||
|
|
||||||
After you have installed Open MPI, running "ompi_info" should show
|
After you have installed Open MPI, running `ompi_info` should show
|
||||||
your "foo" component in the output.
|
your `foo` component in the output.
|
||||||
|
|
||||||
|
```
|
||||||
shell$ ompi_info | grep op:
|
shell$ ompi_info | grep op:
|
||||||
MCA op: example (MCA v2.0, API v1.0, Component v1.4)
|
MCA op: example (MCA v2.0, API v1.0, Component v1.4)
|
||||||
MCA op: foo (MCA v2.0, API v1.0, Component v1.4)
|
MCA op: foo (MCA v2.0, API v1.0, Component v1.4)
|
||||||
shell$
|
shell$
|
||||||
|
```
|
||||||
|
|
||||||
If you do not see your foo component, check the above steps, and check
|
If you do not see your `foo` component, check the above steps, and
|
||||||
the output of autogen.pl, configure, and make to ensure that "foo" was
|
check the output of `autogen.pl`, `configure`, and `make` to ensure
|
||||||
found, configured, and built successfully.
|
that `foo` was found, configured, and built successfully.
|
||||||
|
|
||||||
Once ompi_info sees your component, start editing the "foo" component
|
|
||||||
files in a meaningful way.
|
|
||||||
|
|
||||||
|
Once `ompi_info` sees your component, start editing the `foo`
|
||||||
|
component files in a meaningful way.
|
@ -10,3 +10,5 @@
|
|||||||
#
|
#
|
||||||
|
|
||||||
SUBDIRS = java c
|
SUBDIRS = java c
|
||||||
|
|
||||||
|
EXTRA_DIST = README.md
|
||||||
|
@ -1,26 +1,27 @@
|
|||||||
***************************************************************************
|
# Open MPI Java bindings
|
||||||
|
|
||||||
Note about the Open MPI Java bindings
|
Note about the Open MPI Java bindings
|
||||||
|
|
||||||
The Java bindings in this directory are not part of the MPI specification,
|
The Java bindings in this directory are not part of the MPI
|
||||||
as noted in the README.JAVA.txt file in the root directory. That file also
|
specification, as noted in the README.JAVA.md file in the root
|
||||||
contains some information regarding the installation and use of the Java
|
directory. That file also contains some information regarding the
|
||||||
bindings. Further details can be found in the paper [1].
|
installation and use of the Java bindings. Further details can be
|
||||||
|
found in the paper [1].
|
||||||
|
|
||||||
We originally took the code from the mpiJava project [2] as starting point
|
We originally took the code from the mpiJava project [2] as starting point
|
||||||
for our developments, but we have pretty much rewritten 100% of it. The
|
for our developments, but we have pretty much rewritten 100% of it. The
|
||||||
original copyrights and license terms of mpiJava are listed below.
|
original copyrights and license terms of mpiJava are listed below.
|
||||||
|
|
||||||
[1] O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
|
1. O. Vega-Gisbert, J. E. Roman, and J. M. Squyres. "Design and
|
||||||
implementation of Java bindings in Open MPI". Parallel Comput.
|
implementation of Java bindings in Open MPI". Parallel Comput.
|
||||||
59: 1-20 (2016).
|
59: 1-20 (2016).
|
||||||
|
1. M. Baker et al. "mpiJava: An object-oriented Java interface to
|
||||||
|
MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
|
||||||
|
pp. 748-762, Springer (1999).
|
||||||
|
|
||||||
[2] M. Baker et al. "mpiJava: An object-oriented Java interface to
|
## Original citation
|
||||||
MPI". In Parallel and Distributed Processing, LNCS vol. 1586,
|
|
||||||
pp. 748-762, Springer (1999).
|
|
||||||
|
|
||||||
***************************************************************************
|
|
||||||
|
|
||||||
|
```
|
||||||
mpiJava - A Java Interface to MPI
|
mpiJava - A Java Interface to MPI
|
||||||
---------------------------------
|
---------------------------------
|
||||||
Copyright 2003
|
Copyright 2003
|
||||||
@ -39,6 +40,7 @@ original copyrights and license terms of mpiJava are listed below.
|
|||||||
(Bugfixes/Additions, CMake based configure/build)
|
(Bugfixes/Additions, CMake based configure/build)
|
||||||
Blasius Czink
|
Blasius Czink
|
||||||
HLRS, University of Stuttgart
|
HLRS, University of Stuttgart
|
||||||
|
```
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License");
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
you may not use this file except in compliance with the License.
|
you may not use this file except in compliance with the License.
|
@ -1,4 +1,5 @@
|
|||||||
Symbol conventions for Open MPI extensions
|
# Symbol conventions for Open MPI extensions
|
||||||
|
|
||||||
Last updated: January 2015
|
Last updated: January 2015
|
||||||
|
|
||||||
This README provides some rule-of-thumb guidance for how to name
|
This README provides some rule-of-thumb guidance for how to name
|
||||||
@ -15,26 +16,22 @@ Generally speaking, there are usually three kinds of extensions:
|
|||||||
3. Functionality that is strongly expected to be in an upcoming
|
3. Functionality that is strongly expected to be in an upcoming
|
||||||
version of the MPI specification.
|
version of the MPI specification.
|
||||||
|
|
||||||
----------------------------------------------------------------------
|
## Case 1
|
||||||
|
|
||||||
Case 1
|
The `OMPI_Paffinity_str()` extension is a good example of this type:
|
||||||
|
it is solely intended to be for Open MPI. It will likely never be
|
||||||
The OMPI_Paffinity_str() extension is a good example of this type: it
|
pushed to other MPI implementations, and it will likely never be
|
||||||
is solely intended to be for Open MPI. It will likely never be pushed
|
pushed to the MPI Forum.
|
||||||
to other MPI implementations, and it will likely never be pushed to
|
|
||||||
the MPI Forum.
|
|
||||||
|
|
||||||
It's Open MPI-specific functionality, through and through.
|
It's Open MPI-specific functionality, through and through.
|
||||||
|
|
||||||
Public symbols of this type of functionality should be named with an
|
Public symbols of this type of functionality should be named with an
|
||||||
"OMPI_" prefix to emphasize its Open MPI-specific nature. To be
|
`OMPI_` prefix to emphasize its Open MPI-specific nature. To be
|
||||||
clear: the "OMPI_" prefix clearly identifies parts of user code that
|
clear: the `OMPI_` prefix clearly identifies parts of user code that
|
||||||
are relying on Open MPI (and likely need to be surrounded with #if
|
are relying on Open MPI (and likely need to be surrounded with #if
|
||||||
OPEN_MPI blocks, etc.).
|
`OPEN_MPI` blocks, etc.).
|
||||||
|
|
||||||
----------------------------------------------------------------------
|
## Case 2
|
||||||
|
|
||||||
Case 2
|
|
||||||
|
|
||||||
The MPI extensions mechanism in Open MPI was designed to help MPI
|
The MPI extensions mechanism in Open MPI was designed to help MPI
|
||||||
Forum members prototype new functionality that is intended for the
|
Forum members prototype new functionality that is intended for the
|
||||||
@ -43,23 +40,21 @@ functionality is not only to be included in the MPI spec, but possibly
|
|||||||
also be included in another MPI implementation.
|
also be included in another MPI implementation.
|
||||||
|
|
||||||
As such, it seems reasonable to prefix public symbols in this type of
|
As such, it seems reasonable to prefix public symbols in this type of
|
||||||
functionality with "MPIX_". This commonly-used prefix allows the same
|
functionality with `MPIX_`. This commonly-used prefix allows the same
|
||||||
symbols to be available in multiple MPI implementations, and therefore
|
symbols to be available in multiple MPI implementations, and therefore
|
||||||
allows user code to easily check for it. E.g., user apps can check
|
allows user code to easily check for it. E.g., user apps can check
|
||||||
for the presence of MPIX_Foo to know if both Open MPI and Other MPI
|
for the presence of `MPIX_Foo` to know if both Open MPI and Other MPI
|
||||||
support the proposed MPIX_Foo functionality.
|
support the proposed `MPIX_Foo` functionality.
|
||||||
|
|
||||||
Of course, when using the MPIX_ namespace, there is the possibility of
|
Of course, when using the `MPIX_` namespace, there is the possibility of
|
||||||
symbol name collisions. E.g., what if Open MPI has an MPIX_Foo and
|
symbol name collisions. E.g., what if Open MPI has an `MPIX_Foo` and
|
||||||
Other MPI has a *different* MPIX_Foo?
|
Other MPI has a *different* `MPIX_Foo`?
|
||||||
|
|
||||||
While we technically can't prevent such collisions from happening, we
|
While we technically can't prevent such collisions from happening, we
|
||||||
encourage extension authors to avoid such symbol clashes whenever
|
encourage extension authors to avoid such symbol clashes whenever
|
||||||
possible.
|
possible.
|
||||||
|
|
||||||
----------------------------------------------------------------------
|
## Case 3
|
||||||
|
|
||||||
Case 3
|
|
||||||
|
|
||||||
It is well-known that the MPI specification (intentionally) takes a
|
It is well-known that the MPI specification (intentionally) takes a
|
||||||
long time to publish. MPI implementers can typically know, with a
|
long time to publish. MPI implementers can typically know, with a
|
||||||
@ -72,13 +67,13 @@ functionality early (i.e., before the actual publication of the
|
|||||||
corresponding MPI specification document).
|
corresponding MPI specification document).
|
||||||
|
|
||||||
Case in point: the non-blocking collective operations that were
|
Case in point: the non-blocking collective operations that were
|
||||||
included in MPI-3.0 (e.g., MPI_Ibarrier). It was known for a year or
|
included in MPI-3.0 (e.g., `MPI_Ibarrier()`). It was known for a year
|
||||||
two before MPI-3.0 was published that these functions would be
|
or two before MPI-3.0 was published that these functions would be
|
||||||
included in MPI-3.0.
|
included in MPI-3.0.
|
||||||
|
|
||||||
There is a continual debate among the developer community: when
|
There is a continual debate among the developer community: when
|
||||||
implementing such functionality, should the symbols be in the MPIX_
|
implementing such functionality, should the symbols be in the MPIX_
|
||||||
namespace or in the MPI_ namespace? On one hand, the symbols are not
|
namespace or in the `MPI_` namespace? On one hand, the symbols are not
|
||||||
yet officially standardized -- *they could change* before publication.
|
yet officially standardized -- *they could change* before publication.
|
||||||
On the other hand, developers who participate in the Forum typically
|
On the other hand, developers who participate in the Forum typically
|
||||||
have a good sense for whether symbols are going to change before
|
have a good sense for whether symbols are going to change before
|
||||||
@ -89,35 +84,31 @@ before the MPI specification is published. ...and so on.
|
|||||||
After much debate: for functionality that has a high degree of
|
After much debate: for functionality that has a high degree of
|
||||||
confidence that it will be included in an upcoming spec (e.g., it has
|
confidence that it will be included in an upcoming spec (e.g., it has
|
||||||
passed at least one vote in the MPI Forum), our conclusion is that it
|
passed at least one vote in the MPI Forum), our conclusion is that it
|
||||||
is OK to use the MPI_ namespace.
|
is OK to use the `MPI_` namespace.
|
||||||
|
|
||||||
Case in point: Open MPI released non-blocking collectives with the
|
Case in point: Open MPI released non-blocking collectives with the
|
||||||
MPI_ prefix (not the MPIX_ prefix) before the MPI-3.0 specification
|
`MPI_` prefix (not the `MPIX_` prefix) before the MPI-3.0
|
||||||
officially standardized these functions.
|
specification officially standardized these functions.
|
||||||
|
|
||||||
The rationale was threefold:
|
The rationale was threefold:
|
||||||
|
|
||||||
1. Let users use the functionality as soon as possible.
|
1. Let users use the functionality as soon as possible.
|
||||||
|
1. If OMPI initially creates `MPIX_Foo`, but eventually renames it to
|
||||||
2. If OMPI initially creates MPIX_Foo, but eventually renames it to
|
`MPI_Foo` when the MPI specification is published, then users will
|
||||||
MPI_Foo when the MPI specification is published, then users will
|
|
||||||
have to modify their codes to match. This is an artificial change
|
have to modify their codes to match. This is an artificial change
|
||||||
inserted just to be "pure" to the MPI spec (i.e., it's a "lawyer's
|
inserted just to be "pure" to the MPI spec (i.e., it's a "lawyer's
|
||||||
answer"). But since the MPIX_Foo -> MPI_Foo change is inevitable,
|
answer"). But since the `MPIX_Foo` -> `MPI_Foo` change is
|
||||||
it just ends up annoying users.
|
inevitable, it just ends up annoying users.
|
||||||
|
1. Once OMPI introduces `MPIX_` symbols, if we want to *not* annoy
|
||||||
3. Once OMPI introduces MPIX_ symbols, if we want to *not* annoy
|
|
||||||
users, we'll likely have weak symbols / aliased versions of both
|
users, we'll likely have weak symbols / aliased versions of both
|
||||||
MPIX_Foo and MPI_Foo once the Foo functionality is included in a
|
`MPIX_Foo` and `MPI_Foo` once the Foo functionality is included in
|
||||||
published MPI specification. However, when can we delete the
|
a published MPI specification. However, when can we delete the
|
||||||
MPIX_Foo symbol? It becomes a continuing annoyance of backwards
|
`MPIX_Foo` symbol? It becomes a continuing annoyance of backwards
|
||||||
compatibility that we have to keep carrying forward.
|
compatibility that we have to keep carrying forward.
|
||||||
|
|
||||||
For all these reasons, we believe that it's better to put
|
For all these reasons, we believe that it's better to put
|
||||||
expected-upcoming official MPI functionality in the MPI_ namespace,
|
expected-upcoming official MPI functionality in the `MPI_` namespace,
|
||||||
not the MPIX_ namespace.
|
not the `MPIX_` namespace.
|
||||||
|
|
||||||
----------------------------------------------------------------------
|
|
||||||
|
|
||||||
All that being said, these are rules of thumb. They are not an
|
All that being said, these are rules of thumb. They are not an
|
||||||
official mandate. There may well be cases where there are reasons to
|
official mandate. There may well be cases where there are reasons to
|
@ -2,7 +2,7 @@
|
|||||||
# Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana
|
# Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana
|
||||||
# University Research and Technology
|
# University Research and Technology
|
||||||
# Corporation. All rights reserved.
|
# Corporation. All rights reserved.
|
||||||
# Copyright (c) 2010-2012 Cisco Systems, Inc. All rights reserved.
|
# Copyright (c) 2010-2020 Cisco Systems, Inc. All rights reserved.
|
||||||
# $COPYRIGHT$
|
# $COPYRIGHT$
|
||||||
#
|
#
|
||||||
# Additional copyrights may follow
|
# Additional copyrights may follow
|
||||||
@ -20,4 +20,4 @@
|
|||||||
|
|
||||||
SUBDIRS = c
|
SUBDIRS = c
|
||||||
|
|
||||||
EXTRA_DIST = README.txt
|
EXTRA_DIST = README.md
|
||||||
|
30
ompi/mpiext/affinity/README.md
Обычный файл
30
ompi/mpiext/affinity/README.md
Обычный файл
@ -0,0 +1,30 @@
|
|||||||
|
# Open MPI extension: Affinity
|
||||||
|
|
||||||
|
## Copyrights
|
||||||
|
|
||||||
|
```
|
||||||
|
Copyright (c) 2010-2012 Cisco Systems, Inc. All rights reserved.
|
||||||
|
Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
|
||||||
|
```
|
||||||
|
|
||||||
|
## Authors
|
||||||
|
|
||||||
|
* Jeff Squyres, 19 April 2010, and 16 April 2012
|
||||||
|
* Terry Dontje, 18 November 2010
|
||||||
|
|
||||||
|
## Description
|
||||||
|
|
||||||
|
This extension provides a single new function, `OMPI_Affinity_str()`,
|
||||||
|
that takes a format value and then provides 3 prettyprint strings as
|
||||||
|
output:
|
||||||
|
|
||||||
|
* `fmt_type`: is an enum that tells `OMPI_Affinity_str()` whether to
|
||||||
|
use a resource description string or layout string format for
|
||||||
|
`ompi_bound` and `currently_bound` output strings.
|
||||||
|
* `ompi_bound`: describes what sockets/cores Open MPI bound this process
|
||||||
|
to (or indicates that Open MPI did not bind this process).
|
||||||
|
* `currently_bound`: describes what sockets/cores this process is
|
||||||
|
currently bound to (or indicates that it is unbound).
|
||||||
|
* `exists`: describes what processors are available in the current host.
|
||||||
|
|
||||||
|
See `OMPI_Affinity_str(3)` for more details.
|
@ -1,29 +0,0 @@
|
|||||||
# Copyright (c) 2010-2012 Cisco Systems, Inc. All rights reserved.
|
|
||||||
Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
|
|
||||||
|
|
||||||
$COPYRIGHT$
|
|
||||||
|
|
||||||
Jeff Squyres
|
|
||||||
19 April 2010, and
|
|
||||||
16 April 2012
|
|
||||||
|
|
||||||
Terry Dontje
|
|
||||||
18 November 2010
|
|
||||||
|
|
||||||
This extension provides a single new function, OMPI_Affinity_str(),
|
|
||||||
that takes a format value and then provides 3 prettyprint strings as
|
|
||||||
output:
|
|
||||||
|
|
||||||
fmt_type: is an enum that tells OMPI_Affinity_str() whether to use a
|
|
||||||
resource description string or layout string format for ompi_bound and
|
|
||||||
currently_bound output strings.
|
|
||||||
|
|
||||||
ompi_bound: describes what sockets/cores Open MPI bound this process
|
|
||||||
to (or indicates that Open MPI did not bind this process).
|
|
||||||
|
|
||||||
currently_bound: describes what sockets/cores this process is
|
|
||||||
currently bound to (or indicates that it is unbound).
|
|
||||||
|
|
||||||
exists: describes what processors are available in the current host.
|
|
||||||
|
|
||||||
See OMPI_Affinity_str(3) for more details.
|
|
@ -21,4 +21,4 @@
|
|||||||
|
|
||||||
SUBDIRS = c
|
SUBDIRS = c
|
||||||
|
|
||||||
EXTRA_DIST = README.txt
|
EXTRA_DIST = README.md
|
||||||
|
11
ompi/mpiext/cuda/README.md
Обычный файл
11
ompi/mpiext/cuda/README.md
Обычный файл
@ -0,0 +1,11 @@
|
|||||||
|
# Open MPI extension: Cuda
|
||||||
|
|
||||||
|
Copyright (c) 2015 NVIDIA, Inc. All rights reserved.
|
||||||
|
|
||||||
|
Author: Rolf vandeVaart
|
||||||
|
|
||||||
|
This extension provides a macro for compile time check of CUDA aware
|
||||||
|
support. It also provides a function for runtime check of CUDA aware
|
||||||
|
support.
|
||||||
|
|
||||||
|
See `MPIX_Query_cuda_support(3)` for more details.
|
@ -1,11 +0,0 @@
|
|||||||
# Copyright (c) 2015 NVIDIA, Inc. All rights reserved.
|
|
||||||
|
|
||||||
$COPYRIGHT$
|
|
||||||
|
|
||||||
Rolf vandeVaart
|
|
||||||
|
|
||||||
|
|
||||||
This extension provides a macro for compile time check of CUDA aware support.
|
|
||||||
It also provides a function for runtime check of CUDA aware support.
|
|
||||||
|
|
||||||
See MPIX_Query_cuda_support(3) for more details.
|
|
@ -1,5 +1,5 @@
|
|||||||
#
|
#
|
||||||
# Copyright (c) 2012 Cisco Systems, Inc. All rights reserved.
|
# Copyright (c) 2020 Cisco Systems, Inc. All rights reserved.
|
||||||
# $COPYRIGHT$
|
# $COPYRIGHT$
|
||||||
#
|
#
|
||||||
# Additional copyrights may follow
|
# Additional copyrights may follow
|
||||||
@ -17,4 +17,4 @@
|
|||||||
|
|
||||||
SUBDIRS = c mpif-h use-mpi use-mpi-f08
|
SUBDIRS = c mpif-h use-mpi use-mpi-f08
|
||||||
|
|
||||||
EXTRA_DIST = README.txt
|
EXTRA_DIST = README.md
|
||||||
|
148
ompi/mpiext/example/README.md
Обычный файл
148
ompi/mpiext/example/README.md
Обычный файл
@ -0,0 +1,148 @@
|
|||||||
|
# Open MPI extension: Example
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This example MPI extension shows how to make an MPI extension for Open
|
||||||
|
MPI.
|
||||||
|
|
||||||
|
An MPI extension provides new top-level APIs in Open MPI that are
|
||||||
|
available to user-level applications (vs. adding new code/APIs that is
|
||||||
|
wholly internal to Open MPI). MPI extensions are generally used to
|
||||||
|
prototype new MPI APIs, or provide Open MPI-specific APIs to
|
||||||
|
applications. This example MPI extension provides a new top-level MPI
|
||||||
|
API named `OMPI_Progress` that is callable in both C and Fortran.
|
||||||
|
|
||||||
|
MPI extensions are similar to Open MPI components, but due to
|
||||||
|
complex ordering requirements for the Fortran-based MPI bindings,
|
||||||
|
their build order is a little different.
|
||||||
|
|
||||||
|
Note that MPI has 4 different sets of bindings (C, Fortran `mpif.h`,
|
||||||
|
the Fortran `mpi` module, and the Fortran `mpi_f08` module), and Open
|
||||||
|
MPI extensions allow adding API calls to all 4 of them. Prototypes
|
||||||
|
for the user-accessible functions/subroutines/constants are included
|
||||||
|
in the following publicly-available mechanisms:
|
||||||
|
|
||||||
|
* C: `mpi-ext.h`
|
||||||
|
* Fortran mpif.h: `mpif-ext.h`
|
||||||
|
* Fortran "use mpi": `use mpi_ext`
|
||||||
|
* Fortran "use mpi_f08": `use mpi_f08_ext`
|
||||||
|
|
||||||
|
This example extension defines a new top-level API named
|
||||||
|
`OMPI_Progress()` in all four binding types, and provides test programs
|
||||||
|
to call this API in each of the four binding types. Code (and
|
||||||
|
comments) is worth 1,000 words -- see the code in this example
|
||||||
|
extension to understand how it works and how the build system builds
|
||||||
|
and inserts each piece into the publicly-available mechansisms (e.g.,
|
||||||
|
`mpi-ext.h` and the `mpi_f08_ext` module).
|
||||||
|
|
||||||
|
## Comparison to General Open MPI MCA Components
|
||||||
|
|
||||||
|
Here's the ways that MPI extensions are similar to Open MPI
|
||||||
|
components:
|
||||||
|
|
||||||
|
1. Extensions have a top-level `configure.m4` with a well-known m4 macro
|
||||||
|
that is run during Open MPI's configure that determines whether the
|
||||||
|
component wants to build or not.
|
||||||
|
|
||||||
|
Note, however, that unlike components, extensions *must* have a
|
||||||
|
`configure.m4`. No other method of configuration is supported.
|
||||||
|
|
||||||
|
1. Extensions must adhere to normal Automake-based targets. We
|
||||||
|
strongly suggest that you use `Makefile.am`'s and have the
|
||||||
|
extension's `configure.m4` `AC_CONFIG_FILE` each `Makefile.am` in
|
||||||
|
the extension. Using other build systems may work, but are
|
||||||
|
untested and unsupported.
|
||||||
|
|
||||||
|
1. Extensions create specifically-named libtool convenience archives
|
||||||
|
(i.e., `*.la` files) that the build system slurps into higher-level
|
||||||
|
libraries.
|
||||||
|
|
||||||
|
Unlike components, however, extensions:
|
||||||
|
|
||||||
|
1. Have a bit more rigid directory and file naming scheme.
|
||||||
|
1. Have up to four different, specifically-named subdirectories (one
|
||||||
|
for each MPI binding type).
|
||||||
|
1. Also install some specifically-named header files (for C and the
|
||||||
|
Fortran `mpif.h` bindings).
|
||||||
|
|
||||||
|
Similar to components, an MPI extension's name is determined by its
|
||||||
|
directory name: `ompi/mpiext/EXTENSION_NAME`
|
||||||
|
|
||||||
|
## Extension requirements
|
||||||
|
|
||||||
|
### Required: C API
|
||||||
|
|
||||||
|
Under this top-level directory, the extension *must* have a directory
|
||||||
|
named `c` (for the C bindings) that:
|
||||||
|
|
||||||
|
1. contains a file named `mpiext_EXTENSION_NAME_c.h`
|
||||||
|
1. installs `mpiext_EXTENSION_NAME_c.h` to
|
||||||
|
`$includedir/openmpi/mpiext/EXTENSION_NAME/c`
|
||||||
|
1. builds a Libtool convenience library named
|
||||||
|
`libmpiext_EXTENSION_NAME_c.la`
|
||||||
|
|
||||||
|
### Optional: `mpif.h` bindings
|
||||||
|
|
||||||
|
Optionally, the extension may have a director named `mpif-h` (for the
|
||||||
|
Fortran `mpif.h` bindings) that:
|
||||||
|
|
||||||
|
1. contains a file named `mpiext_EXTENSION_NAME_mpifh.h`
|
||||||
|
1. installs `mpiext_EXTENSION_NAME_mpih.h` to
|
||||||
|
`$includedir/openmpi/mpiext/EXTENSION_NAME/mpif-h`
|
||||||
|
1. builds a Libtool convenience library named
|
||||||
|
`libmpiext_EXTENSION_NAME_mpifh.la`
|
||||||
|
|
||||||
|
### Optional: `mpi` module bindings
|
||||||
|
|
||||||
|
Optionally, the extension may have a directory named `use-mpi` (for the
|
||||||
|
Fortran `mpi` module) that:
|
||||||
|
|
||||||
|
1. contains a file named `mpiext_EXTENSION_NAME_usempi.h`
|
||||||
|
|
||||||
|
***NOTE:*** The MPI extension system does NOT support building an
|
||||||
|
additional library in the `use-mpi` extension directory. It is
|
||||||
|
assumed that the `use-mpi` bindings will use the same back-end symbols
|
||||||
|
as the `mpif.h` bindings, and that the only output product of the
|
||||||
|
`use-mpi` directory is a file to be included in the `mpi-ext` module
|
||||||
|
(i.e., strong Fortran prototypes for the functions/global variables in
|
||||||
|
this extension).
|
||||||
|
|
||||||
|
### Optional: `mpi_f08` module bindings
|
||||||
|
|
||||||
|
Optionally, the extension may have a directory named `use-mpi-f08` (for
|
||||||
|
the Fortran `mpi_f08` module) that:
|
||||||
|
|
||||||
|
1. contains a file named `mpiext_EXTENSION_NAME_usempif08.h`
|
||||||
|
1. builds a Libtool convenience library named
|
||||||
|
`libmpiext_EXTENSION_NAME_usempif08.la`
|
||||||
|
|
||||||
|
See the comments in all the header and source files in this tree to
|
||||||
|
see what each file is for and what should be in each.
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
Note that the build order of MPI extensions is a bit strange. The
|
||||||
|
directories in a MPI extensions are NOT traversed top-down in
|
||||||
|
sequential order. Instead, due to ordering requirements when building
|
||||||
|
the Fortran module-based interfaces, each subdirectory in extensions
|
||||||
|
are traversed individually at different times in the overall Open MPI
|
||||||
|
build.
|
||||||
|
|
||||||
|
As such, `ompi/mpiext/EXTENSION_NAME/Makefile.am` is not traversed
|
||||||
|
during a normal top-level `make all` target. This `Makefile.am`
|
||||||
|
exists for two reasons, however:
|
||||||
|
|
||||||
|
1. For the conveneince of the developer, so that you can issue normal
|
||||||
|
`make` commands at the top of your extension tree (e.g., `make all`
|
||||||
|
will still build all bindings in an extension).
|
||||||
|
1. During a top-level `make dist`, extension directories *are*
|
||||||
|
traversed top-down in sequence order. Having a top-level
|
||||||
|
`Makefile.am` in an extension allows `EXTRA_DIST`ing of files, such
|
||||||
|
as this `README.md` file.
|
||||||
|
|
||||||
|
This are reasons for this strange ordering, but suffice it to say that
|
||||||
|
`make dist` doesn't have the same ordering requiements as `make all`,
|
||||||
|
and is therefore easier to have a "normal" Automake-usual top-down
|
||||||
|
sequential directory traversal.
|
||||||
|
|
||||||
|
Enjoy!
|
@ -1,138 +0,0 @@
|
|||||||
Copyright (C) 2012 Cisco Systems, Inc. All rights reserved.
|
|
||||||
|
|
||||||
$COPYRIGHT$
|
|
||||||
|
|
||||||
This example MPI extension shows how to make an MPI extension for Open
|
|
||||||
MPI.
|
|
||||||
|
|
||||||
An MPI extension provides new top-level APIs in Open MPI that are
|
|
||||||
available to user-level applications (vs. adding new code/APIs that is
|
|
||||||
wholly internal to Open MPI). MPI extensions are generally used to
|
|
||||||
prototype new MPI APIs, or provide Open MPI-specific APIs to
|
|
||||||
applications. This example MPI extension provides a new top-level MPI
|
|
||||||
API named "OMPI_Progress" that is callable in both C and Fortran.
|
|
||||||
|
|
||||||
MPI extensions are similar to Open MPI components, but due to
|
|
||||||
complex ordering requirements for the Fortran-based MPI bindings,
|
|
||||||
their build order is a little different.
|
|
||||||
|
|
||||||
Note that MPI has 4 different sets of bindings (C, Fortran mpif.h,
|
|
||||||
Fortran "use mpi", and Fortran "use mpi_f08"), and Open MPI extensions
|
|
||||||
allow adding API calls to all 4 of them. Prototypes for the
|
|
||||||
user-accessible functions/subroutines/constants are included in the
|
|
||||||
following publicly-available mechanisms:
|
|
||||||
|
|
||||||
- C: mpi-ext.h
|
|
||||||
- Fortran mpif.h: mpif-ext.h
|
|
||||||
- Fortran "use mpi": use mpi_ext
|
|
||||||
- Fortran "use mpi_f08": use mpi_f08_ext
|
|
||||||
|
|
||||||
This example extension defines a new top-level API named
|
|
||||||
"OMPI_Progress" in all four binding types, and provides test programs
|
|
||||||
to call this API in each of the four binding types. Code (and
|
|
||||||
comments) is worth 1,000 words -- see the code in this example
|
|
||||||
extension to understand how it works and how the build system builds
|
|
||||||
and inserts each piece into the publicly-available mechansisms (e.g.,
|
|
||||||
mpi-ext.h and the mpi_f08_ext module).
|
|
||||||
|
|
||||||
--------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Here's the ways that MPI extensions are similar to Open MPI
|
|
||||||
components:
|
|
||||||
|
|
||||||
- Extensions have a top-level configure.m4 with a well-known m4 macro
|
|
||||||
that is run during Open MPI's configure that determines whether the
|
|
||||||
component wants to build or not.
|
|
||||||
|
|
||||||
Note, however, that unlike components, extensions *must* have a
|
|
||||||
configure.m4. No other method of configuration is supported.
|
|
||||||
|
|
||||||
- Extensions must adhere to normal Automake-based targets. We
|
|
||||||
strongly suggest that you use Makefile.am's and have the extension's
|
|
||||||
configure.m4 AC_CONFIG_FILE each Makefile.am in the extension.
|
|
||||||
Using other build systems may work, but are untested and
|
|
||||||
unsupported.
|
|
||||||
|
|
||||||
- Extensions create specifically-named libtool convenience archives
|
|
||||||
(i.e., *.la files) that the build system slurps into higher-level
|
|
||||||
libraries.
|
|
||||||
|
|
||||||
Unlike components, however, extensions:
|
|
||||||
|
|
||||||
- Have a bit more rigid directory and file naming scheme.
|
|
||||||
|
|
||||||
- Have up to four different, specifically-named subdirectories (one
|
|
||||||
for each MPI binding type).
|
|
||||||
|
|
||||||
- Also install some specifically-named header files (for C and the
|
|
||||||
Fortran mpif.h bindings).
|
|
||||||
|
|
||||||
Similar to components, an MPI extension's name is determined by its
|
|
||||||
directory name: ompi/mpiext/<extension name>
|
|
||||||
|
|
||||||
Under this top-level directory, the extension *must* have a directory
|
|
||||||
named "c" (for the C bindings) that:
|
|
||||||
|
|
||||||
- contains a file named mpiext_<ext_name>_c.h
|
|
||||||
- installs mpiext_<ext_name>_c.h to
|
|
||||||
$includedir/openmpi/mpiext/<ext_name>/c
|
|
||||||
- builds a Libtool convenience library named libmpiext_<ext_name>_c.la
|
|
||||||
|
|
||||||
Optionally, the extension may have a director named "mpif-h" (for the
|
|
||||||
Fortran mpif.h bindings) that:
|
|
||||||
|
|
||||||
- contains a file named mpiext_<ext_name>_mpifh.h
|
|
||||||
- installs mpiext_<ext_name>_mpih.h to
|
|
||||||
$includedir/openmpi/mpiext/<ext_name>/mpif-h
|
|
||||||
- builds a Libtool convenience library named libmpiext_<ext_name>_mpifh.la
|
|
||||||
|
|
||||||
Optionally, the extension may have a director named "use-mpi" (for the
|
|
||||||
Fortran "use mpi" bindings) that:
|
|
||||||
|
|
||||||
- contains a file named mpiext_<ext_name>_usempi.h
|
|
||||||
|
|
||||||
NOTE: The MPI extension system does NOT support building an additional
|
|
||||||
library in the use-mpi extension directory. It is assumed that the
|
|
||||||
use-mpi bindings will use the same back-end symbols as the mpif.h
|
|
||||||
bindings, and that the only output product of the use-mpi directory is
|
|
||||||
a file to be included in the mpi-ext module (i.e., strong Fortran
|
|
||||||
prototypes for the functions/global variables in this extension).
|
|
||||||
|
|
||||||
Optionally, the extension may have a director named "use-mpi-f08" (for
|
|
||||||
the Fortran mpi_f08 bindings) that:
|
|
||||||
|
|
||||||
- contains a file named mpiext_<ext_name>_usempif08.h
|
|
||||||
- builds a Libtool convenience library named
|
|
||||||
libmpiext_<ext_name>_usempif08.la
|
|
||||||
|
|
||||||
See the comments in all the header and source files in this tree to
|
|
||||||
see what each file is for and what should be in each.
|
|
||||||
|
|
||||||
--------------------------------------------------------------------------------
|
|
||||||
|
|
||||||
Note that the build order of MPI extensions is a bit strange. The
|
|
||||||
directories in a MPI extensions are NOT traversed top-down in
|
|
||||||
sequential order. Instead, due to ordering requirements when building
|
|
||||||
the Fortran module-based interfaces, each subdirectory in extensions
|
|
||||||
are traversed individually at different times in the overall Open MPI
|
|
||||||
build.
|
|
||||||
|
|
||||||
As such, ompi/mpiext/<ext_name>/Makefile.am is not traversed during a
|
|
||||||
normal top-level "make all" target. This Makefile.am exists for two
|
|
||||||
reasons, however:
|
|
||||||
|
|
||||||
1. For the conveneince of the developer, so that you can issue normal
|
|
||||||
"make" commands at the top of your extension tree (e.g., "make all"
|
|
||||||
will still build all bindings in an extension).
|
|
||||||
|
|
||||||
2. During a top-level "make dist", extension directories *are*
|
|
||||||
traversed top-down in sequence order. Having a top-level Makefile.am
|
|
||||||
in an extension allows EXTRA_DISTing of files, such as this README
|
|
||||||
file.
|
|
||||||
|
|
||||||
This are reasons for this strange ordering, but suffice it to say that
|
|
||||||
"make dist" doesn't have the same ordering requiements as "make all",
|
|
||||||
and is therefore easier to have a "normal" Automake-usual top-down
|
|
||||||
sequential directory traversal.
|
|
||||||
|
|
||||||
Enjoy!
|
|
@ -8,3 +8,5 @@
|
|||||||
#
|
#
|
||||||
|
|
||||||
SUBDIRS = c mpif-h use-mpi use-mpi-f08
|
SUBDIRS = c mpif-h use-mpi use-mpi-f08
|
||||||
|
|
||||||
|
EXTRA_DIST = README.md
|
||||||
|
14
ompi/mpiext/pcollreq/README.md
Обычный файл
14
ompi/mpiext/pcollreq/README.md
Обычный файл
@ -0,0 +1,14 @@
|
|||||||
|
# Open MPI extension: pcollreq
|
||||||
|
|
||||||
|
Copyright (c) 2018 FUJITSU LIMITED. All rights reserved.
|
||||||
|
|
||||||
|
This extension provides the feature of persistent collective
|
||||||
|
communication operations and persistent neighborhood collective
|
||||||
|
communication operations, which is planned to be included in the next
|
||||||
|
MPI Standard after MPI-3.1 as of Nov. 2018.
|
||||||
|
|
||||||
|
See `MPIX_Barrier_init(3)` for more details.
|
||||||
|
|
||||||
|
The code will be moved to the `ompi/mpi` directory and the `MPIX_`
|
||||||
|
prefix will be switch to the `MPI_` prefix once the MPI Standard which
|
||||||
|
includes this feature is published.
|
@ -1,14 +0,0 @@
|
|||||||
Copyright (c) 2018 FUJITSU LIMITED. All rights reserved.
|
|
||||||
|
|
||||||
$COPYRIGHT$
|
|
||||||
|
|
||||||
This extension provides the feature of persistent collective communication
|
|
||||||
operations and persistent neighborhood collective communication operations,
|
|
||||||
which is planned to be included in the next MPI Standard after MPI-3.1 as
|
|
||||||
of Nov. 2018.
|
|
||||||
|
|
||||||
See MPIX_Barrier_init(3) for more details.
|
|
||||||
|
|
||||||
The code will be moved to the ompi/mpi directory and the MPIX_ prefix will
|
|
||||||
be switch to the MPI_ prefix once the MPI Standard which includes this
|
|
||||||
feature is published.
|
|
@ -8,3 +8,5 @@
|
|||||||
#
|
#
|
||||||
|
|
||||||
SUBDIRS = c mpif-h use-mpi use-mpi-f08
|
SUBDIRS = c mpif-h use-mpi use-mpi-f08
|
||||||
|
|
||||||
|
EXTRA_DIST = README.md
|
||||||
|
35
ompi/mpiext/shortfloat/README.md
Обычный файл
35
ompi/mpiext/shortfloat/README.md
Обычный файл
@ -0,0 +1,35 @@
|
|||||||
|
# Open MPI extension: shortfloat
|
||||||
|
|
||||||
|
Copyright (c) 2018 FUJITSU LIMITED. All rights reserved.
|
||||||
|
|
||||||
|
This extension provides additional MPI datatypes `MPIX_SHORT_FLOAT`,
|
||||||
|
`MPIX_C_SHORT_FLOAT_COMPLEX`, and `MPIX_CXX_SHORT_FLOAT_COMPLEX`,
|
||||||
|
which are proposed with the `MPI_` prefix in June 2017 for proposal in
|
||||||
|
the MPI 4.0 standard. As of February 2019, it is not accepted yet.
|
||||||
|
|
||||||
|
See https://github.com/mpi-forum/mpi-issues/issues/65 for moe details
|
||||||
|
|
||||||
|
Each MPI datatype corresponds to the C/C++ type `short float`, the C
|
||||||
|
type `short float _Complex`, and the C++ type `std::complex<short
|
||||||
|
float>`, respectively.
|
||||||
|
|
||||||
|
In addition, this extension provides a datatype `MPIX_C_FLOAT16` for
|
||||||
|
the C type `_Float16`, which is defined in ISO/IEC JTC 1/SC 22/WG 14
|
||||||
|
N1945 (ISO/IEC TS 18661-3:2015). This name and meaning are same as
|
||||||
|
that of MPICH. See https://github.com/pmodels/mpich/pull/3455.
|
||||||
|
|
||||||
|
This extension is enabled only if the C compiler supports `short float`
|
||||||
|
or `_Float16`, or the `--enable-alt-short-float=TYPE` option is passed
|
||||||
|
to the Open MPI `configure` script.
|
||||||
|
|
||||||
|
NOTE: The Clang 6.0.x and 7.0.x compilers support the `_Float16` type
|
||||||
|
(via software emulation), but require an additional linker flag to
|
||||||
|
function properly. If you wish to enable Clang 6.0.x or 7.0.x's
|
||||||
|
software emulation of `_Float16`, use the following CLI options to Open
|
||||||
|
MPI configure script:
|
||||||
|
|
||||||
|
```
|
||||||
|
./configure \
|
||||||
|
LDFLAGS=--rtlib=compiler-rt \
|
||||||
|
--with-wrapper-ldflags=--rtlib=compiler-rt ...
|
||||||
|
```
|
@ -1,35 +0,0 @@
|
|||||||
Copyright (c) 2018 FUJITSU LIMITED. All rights reserved.
|
|
||||||
|
|
||||||
$COPYRIGHT$
|
|
||||||
|
|
||||||
This extension provides additional MPI datatypes MPIX_SHORT_FLOAT,
|
|
||||||
MPIX_C_SHORT_FLOAT_COMPLEX, and MPIX_CXX_SHORT_FLOAT_COMPLEX, which
|
|
||||||
are proposed with the MPI_ prefix in June 2017 for proposal in the
|
|
||||||
MPI 4.0 standard. As of February 2019, it is not accepted yet.
|
|
||||||
|
|
||||||
https://github.com/mpi-forum/mpi-issues/issues/65
|
|
||||||
|
|
||||||
Each MPI datatype corresponds to the C/C++ type 'short float', the C type
|
|
||||||
'short float _Complex', and the C++ type 'std::complex<short float>',
|
|
||||||
respectively.
|
|
||||||
|
|
||||||
In addition, this extension provides a datatype MPIX_C_FLOAT16 for
|
|
||||||
the C type _Float16, which is defined in ISO/IEC JTC 1/SC 22/WG 14
|
|
||||||
N1945 (ISO/IEC TS 18661-3:2015). This name and meaning are same as
|
|
||||||
that of MPICH.
|
|
||||||
|
|
||||||
https://github.com/pmodels/mpich/pull/3455
|
|
||||||
|
|
||||||
This extension is enabled only if the C compiler supports 'short float'
|
|
||||||
or '_Float16', or the '--enable-alt-short-float=TYPE' option is passed
|
|
||||||
to the configure script.
|
|
||||||
|
|
||||||
NOTE: The Clang 6.0.x and 7.0.x compilers support the "_Float16" type
|
|
||||||
(via software emulation), but require an additional linker flag to
|
|
||||||
function properly. If you wish to enable Clang 6.0.x or 7.0.x's
|
|
||||||
software emulation of _Float16, use the following CLI options to Open
|
|
||||||
MPI configure script:
|
|
||||||
|
|
||||||
./configure \
|
|
||||||
LDFLAGS=--rtlib=compiler-rt \
|
|
||||||
--with-wrapper-ldflags=--rtlib=compiler-rt ...
|
|
@ -1,110 +0,0 @@
|
|||||||
========================================
|
|
||||||
Design notes on BTL/OFI
|
|
||||||
========================================
|
|
||||||
|
|
||||||
This is the RDMA only btl based on OFI Libfabric. The goal is to enable RDMA
|
|
||||||
with multiple vendor hardware through one interface. Most of the operations are
|
|
||||||
managed by upper layer (osc/rdma). This BTL is mostly doing the low level work.
|
|
||||||
|
|
||||||
Tested providers: sockets,psm2,ugni
|
|
||||||
|
|
||||||
========================================
|
|
||||||
|
|
||||||
Component
|
|
||||||
|
|
||||||
This BTL is requesting libfabric version 1.5 API and will not support older versions.
|
|
||||||
|
|
||||||
The required capabilities of this BTL is FI_ATOMIC and FI_RMA with the endpoint type
|
|
||||||
of FI_EP_RDM only. This BTL does NOT support libfabric provider that requires local
|
|
||||||
memory registration (FI_MR_LOCAL).
|
|
||||||
|
|
||||||
BTL/OFI will initialize a module with ONLY the first compatible info returned from OFI.
|
|
||||||
This means it will rely on OFI provider to do load balancing. The support for multiple
|
|
||||||
device might be added later.
|
|
||||||
|
|
||||||
The BTL creates only one endpoint and one CQ.
|
|
||||||
|
|
||||||
========================================
|
|
||||||
|
|
||||||
Memory Registration
|
|
||||||
|
|
||||||
Open MPI has a system in place to exchange remote address and always use the remote
|
|
||||||
virtual address to refer to a piece of memory. However, some libfabric providers might
|
|
||||||
not support the use of virtual address and instead will use zero-based offset addressing.
|
|
||||||
|
|
||||||
FI_MR_VIRT_ADDR is the flag that determine this behavior. mca_btl_ofi_reg_mem() handles
|
|
||||||
this by storing the base address in registration handle in case of the provider does not
|
|
||||||
support FI_MR_VIRT_ADDR. This base address will be used to calculate the offset later in
|
|
||||||
RDMA/Atomic operations.
|
|
||||||
|
|
||||||
The BTL will try to use the address of registration handle as the key. However, if the
|
|
||||||
provider supports FI_MR_PROV_KEY, it will use provider provided key. Simply does not care.
|
|
||||||
|
|
||||||
The BTL does not register local operand or compare. This is why this BTL does not support
|
|
||||||
FI_MR_LOCAL and will allocate every buffer before registering. This means FI_MR_ALLOCATED
|
|
||||||
is supported. So to be explicit.
|
|
||||||
|
|
||||||
Supported MR mode bits (will work with or without):
|
|
||||||
enum:
|
|
||||||
- FI_MR_BASIC
|
|
||||||
- FI_MR_SCALABLE
|
|
||||||
|
|
||||||
mode bits:
|
|
||||||
- FI_MR_VIRT_ADDR
|
|
||||||
- FI_MR_ALLOCATED
|
|
||||||
- FI_MR_PROV_KEY
|
|
||||||
|
|
||||||
The BTL does NOT support (will not work with):
|
|
||||||
- FI_MR_LOCAL
|
|
||||||
- FI_MR_MMU_NOTIFY
|
|
||||||
- FI_MR_RMA_EVENT
|
|
||||||
- FI_MR_ENDPOINT
|
|
||||||
|
|
||||||
Just a reminder, in libfabric API 1.5...
|
|
||||||
FI_MR_BASIC == (FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR)
|
|
||||||
|
|
||||||
========================================
|
|
||||||
|
|
||||||
Completions
|
|
||||||
|
|
||||||
Every operation in this BTL is asynchronous. The completion handling will occur in
|
|
||||||
mca_btl_ofi_component_progress() where we read the CQ with the completion context and
|
|
||||||
execute the callback functions. The completions are local. No remote completion event is
|
|
||||||
generated as local completion already guarantee global completion.
|
|
||||||
|
|
||||||
The BTL keep tracks of number of outstanding operations and provide flush interface.
|
|
||||||
|
|
||||||
========================================
|
|
||||||
|
|
||||||
Sockets Provider
|
|
||||||
|
|
||||||
Sockets provider is the proof of concept provider for libfabric. It is supposed to support
|
|
||||||
all the OFI API with emulations. This provider is considered very slow and bound to raise
|
|
||||||
problems that we might not see from other faster providers.
|
|
||||||
|
|
||||||
Known Problems:
|
|
||||||
- sockets provider uses progress thread and can cause segfault in finalize as we free
|
|
||||||
the resources while progress thread is still using it. sleep(1) was put in
|
|
||||||
mca_btl_ofi_componenet_close() for this reason.
|
|
||||||
- sockets provider deadlock in two-sided mode. Might be something about buffered recv.
|
|
||||||
(August 2018).
|
|
||||||
|
|
||||||
========================================
|
|
||||||
|
|
||||||
Scalable Endpoint
|
|
||||||
|
|
||||||
This BTL will try to use scalable endpoint to create communication context. This will increase
|
|
||||||
multithreaded performance for some application. The default number of context created is 1 and
|
|
||||||
can be tuned VIA MCA parameter "btl_ofi_num_contexts_per_module". It is advised that the number
|
|
||||||
of context should be equal to number of physical core for optimal performance.
|
|
||||||
|
|
||||||
User can disable scalable endpoint by MCA parameter "btl_ofi_disable_sep".
|
|
||||||
With scalable endpoint disbled, the BTL will alias OFI endpoint to both tx and rx context.
|
|
||||||
|
|
||||||
========================================
|
|
||||||
|
|
||||||
Two sided communication
|
|
||||||
|
|
||||||
Two sided communication is added later on to BTL OFI to enable non tag-matching provider
|
|
||||||
to be able to use in Open MPI with this BTL. However, the support is only for "functional"
|
|
||||||
and has not been optimized for performance at this point. (August 2018)
|
|
113
opal/mca/btl/ofi/README.md
Обычный файл
113
opal/mca/btl/ofi/README.md
Обычный файл
@ -0,0 +1,113 @@
|
|||||||
|
# Design notes on BTL/OFI
|
||||||
|
|
||||||
|
This is the RDMA only btl based on OFI Libfabric. The goal is to
|
||||||
|
enable RDMA with multiple vendor hardware through one interface. Most
|
||||||
|
of the operations are managed by upper layer (osc/rdma). This BTL is
|
||||||
|
mostly doing the low level work.
|
||||||
|
|
||||||
|
Tested providers: sockets,psm2,ugni
|
||||||
|
|
||||||
|
## Component
|
||||||
|
|
||||||
|
This BTL is requesting libfabric version 1.5 API and will not support
|
||||||
|
older versions.
|
||||||
|
|
||||||
|
The required capabilities of this BTL is `FI_ATOMIC` and `FI_RMA` with
|
||||||
|
the endpoint type of `FI_EP_RDM` only. This BTL does NOT support
|
||||||
|
libfabric provider that requires local memory registration
|
||||||
|
(`FI_MR_LOCAL`).
|
||||||
|
|
||||||
|
BTL/OFI will initialize a module with ONLY the first compatible info
|
||||||
|
returned from OFI. This means it will rely on OFI provider to do load
|
||||||
|
balancing. The support for multiple device might be added later.
|
||||||
|
|
||||||
|
The BTL creates only one endpoint and one CQ.
|
||||||
|
|
||||||
|
## Memory Registration
|
||||||
|
|
||||||
|
Open MPI has a system in place to exchange remote address and always
|
||||||
|
use the remote virtual address to refer to a piece of memory. However,
|
||||||
|
some libfabric providers might not support the use of virtual address
|
||||||
|
and instead will use zero-based offset addressing.
|
||||||
|
|
||||||
|
`FI_MR_VIRT_ADDR` is the flag that determine this
|
||||||
|
behavior. `mca_btl_ofi_reg_mem()` handles this by storing the base
|
||||||
|
address in registration handle in case of the provider does not
|
||||||
|
support `FI_MR_VIRT_ADDR`. This base address will be used to calculate
|
||||||
|
the offset later in RDMA/Atomic operations.
|
||||||
|
|
||||||
|
The BTL will try to use the address of registration handle as the
|
||||||
|
key. However, if the provider supports `FI_MR_PROV_KEY`, it will use
|
||||||
|
provider provided key. Simply does not care.
|
||||||
|
|
||||||
|
The BTL does not register local operand or compare. This is why this
|
||||||
|
BTL does not support `FI_MR_LOCAL` and will allocate every buffer
|
||||||
|
before registering. This means `FI_MR_ALLOCATED` is supported. So to
|
||||||
|
be explicit.
|
||||||
|
|
||||||
|
Supported MR mode bits (will work with or without):
|
||||||
|
|
||||||
|
* enum:
|
||||||
|
* `FI_MR_BASIC`
|
||||||
|
* `FI_MR_SCALABLE`
|
||||||
|
* mode bits:
|
||||||
|
* `FI_MR_VIRT_ADDR`
|
||||||
|
* `FI_MR_ALLOCATED`
|
||||||
|
* `FI_MR_PROV_KEY`
|
||||||
|
|
||||||
|
The BTL does NOT support (will not work with):
|
||||||
|
|
||||||
|
* `FI_MR_LOCAL`
|
||||||
|
* `FI_MR_MMU_NOTIFY`
|
||||||
|
* `FI_MR_RMA_EVENT`
|
||||||
|
* `FI_MR_ENDPOINT`
|
||||||
|
|
||||||
|
Just a reminder, in libfabric API 1.5...
|
||||||
|
`FI_MR_BASIC == (FI_MR_PROV_KEY | FI_MR_ALLOCATED | FI_MR_VIRT_ADDR)`
|
||||||
|
|
||||||
|
## Completions
|
||||||
|
|
||||||
|
Every operation in this BTL is asynchronous. The completion handling
|
||||||
|
will occur in `mca_btl_ofi_component_progress()` where we read the CQ
|
||||||
|
with the completion context and execute the callback functions. The
|
||||||
|
completions are local. No remote completion event is generated as
|
||||||
|
local completion already guarantee global completion.
|
||||||
|
|
||||||
|
The BTL keep tracks of number of outstanding operations and provide
|
||||||
|
flush interface.
|
||||||
|
|
||||||
|
## Sockets Provider
|
||||||
|
|
||||||
|
Sockets provider is the proof of concept provider for libfabric. It is
|
||||||
|
supposed to support all the OFI API with emulations. This provider is
|
||||||
|
considered very slow and bound to raise problems that we might not see
|
||||||
|
from other faster providers.
|
||||||
|
|
||||||
|
Known Problems:
|
||||||
|
|
||||||
|
* sockets provider uses progress thread and can cause segfault in
|
||||||
|
finalize as we free the resources while progress thread is still
|
||||||
|
using it. `sleep(1)` was put in `mca_btl_ofi_component_close()` for
|
||||||
|
this reason.
|
||||||
|
* sockets provider deadlock in two-sided mode. Might be something
|
||||||
|
about buffered recv. (August 2018).
|
||||||
|
|
||||||
|
## Scalable Endpoint
|
||||||
|
|
||||||
|
This BTL will try to use scalable endpoint to create communication
|
||||||
|
context. This will increase multithreaded performance for some
|
||||||
|
application. The default number of context created is 1 and can be
|
||||||
|
tuned VIA MCA parameter `btl_ofi_num_contexts_per_module`. It is
|
||||||
|
advised that the number of context should be equal to number of
|
||||||
|
physical core for optimal performance.
|
||||||
|
|
||||||
|
User can disable scalable endpoint by MCA parameter
|
||||||
|
`btl_ofi_disable_sep`. With scalable endpoint disbled, the BTL will
|
||||||
|
alias OFI endpoint to both tx and rx context.
|
||||||
|
|
||||||
|
## Two sided communication
|
||||||
|
|
||||||
|
Two sided communication is added later on to BTL OFI to enable non
|
||||||
|
tag-matching provider to be able to use in Open MPI with this
|
||||||
|
BTL. However, the support is only for "functional" and has not been
|
||||||
|
optimized for performance at this point. (August 2018)
|
@ -1,113 +0,0 @@
|
|||||||
Copyright (c) 2013 NVIDIA Corporation. All rights reserved.
|
|
||||||
August 21, 2013
|
|
||||||
|
|
||||||
SMCUDA DESIGN DOCUMENT
|
|
||||||
This document describes the design and use of the smcuda BTL.
|
|
||||||
|
|
||||||
BACKGROUND
|
|
||||||
The smcuda btl is a copy of the sm btl but with some additional features.
|
|
||||||
The main extra feature is the ability to make use of the CUDA IPC APIs to
|
|
||||||
quickly move GPU buffers from one GPU to another. Without this support,
|
|
||||||
the GPU buffers would all be moved into and then out of host memory.
|
|
||||||
|
|
||||||
GENERAL DESIGN
|
|
||||||
|
|
||||||
The general design makes use of the large message RDMA RGET support in the
|
|
||||||
OB1 PML. However, there are some interesting choices to make use of it.
|
|
||||||
First, we disable any large message RDMA support in the BTL for host
|
|
||||||
messages. This is done because we need to use the mca_btl_smcuda_get() for
|
|
||||||
the GPU buffers. This is also done because the upper layers expect there
|
|
||||||
to be a single mpool but we need one for the GPU memory and one for the
|
|
||||||
host memory. Since the advantages of using RDMA with host memory is
|
|
||||||
unclear, we disabled it. This means no KNEM or CMA support built in to the
|
|
||||||
smcuda BTL.
|
|
||||||
|
|
||||||
Also note that we give the smcuda BTL a higher rank than the sm BTL. This
|
|
||||||
means it will always be selected even if we are doing host only data
|
|
||||||
transfers. The smcuda BTL is not built if it is not requested via the
|
|
||||||
--with-cuda flag to the configure line.
|
|
||||||
|
|
||||||
Secondly, the smcuda does not make use of the traditional method of
|
|
||||||
enabling RDMA operations. The traditional method checks for the existence
|
|
||||||
of an RDMA btl hanging off the endpoint. The smcuda works in conjunction
|
|
||||||
with the OB1 PML and uses flags that it sends in the BML layer.
|
|
||||||
|
|
||||||
OTHER CONSIDERATIONS
|
|
||||||
CUDA IPC is not necessarily supported by all GPUs on a node. In NUMA
|
|
||||||
nodes, CUDA IPC may only work between GPUs that are not connected
|
|
||||||
over the IOH. In addition, we want to check for CUDA IPC support lazily,
|
|
||||||
when the first GPU access occurs, rather than during MPI_Init() time.
|
|
||||||
This complicates the design.
|
|
||||||
|
|
||||||
INITIALIZATION
|
|
||||||
When the smcuda BTL initializes, it starts with no support for CUDA IPC.
|
|
||||||
Upon the first access of a GPU buffer, the smcuda checks which GPU device
|
|
||||||
it has and sends that to the remote side using a smcuda specific control
|
|
||||||
message. The other rank receives the message, and checks to see if there
|
|
||||||
is CUDA IPC support between the two GPUs via a call to
|
|
||||||
cuDeviceCanAccessPeer(). If it is true, then the smcuda BTL piggy backs on
|
|
||||||
the PML error handler callback to make a call into the PML and let it know
|
|
||||||
to enable CUDA IPC. We created a new flag so that the error handler does
|
|
||||||
the right thing. Large message RDMA is enabled by setting a flag in the
|
|
||||||
bml->btl_flags field. Control returns to the smcuda BTL where a reply
|
|
||||||
message is sent so the sending side can set its flag.
|
|
||||||
|
|
||||||
At that point, the PML layer starts using the large message RDMA support
|
|
||||||
in the smcuda BTL. This is done in some special CUDA code in the PML layer.
|
|
||||||
|
|
||||||
ESTABLISHING CUDA IPC SUPPORT
|
|
||||||
A check has been added into both the send and sendi path in the smcuda btl
|
|
||||||
that checks to see if it should send a request for CUDA IPC setup message.
|
|
||||||
|
|
||||||
/* Initiate setting up CUDA IPC support. */
|
|
||||||
if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstatus)) {
|
|
||||||
mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint);
|
|
||||||
}
|
|
||||||
|
|
||||||
The first check is to see if the CUDA environment has been initialized. If
|
|
||||||
not, then presumably we are not sending any GPU buffers yet and there is
|
|
||||||
nothing to be done. If we are initialized, then check the status of the
|
|
||||||
CUDA IPC endpoint. If it is in the IPC_INIT stage, then call the function
|
|
||||||
to send of a control message to the endpoint.
|
|
||||||
|
|
||||||
On the receiving side, we first check to see if we are initialized. If
|
|
||||||
not, then send a message back to the sender saying we are not initialized.
|
|
||||||
This will cause the sender to reset its state to IPC_INIT so it can try
|
|
||||||
again on the next send.
|
|
||||||
|
|
||||||
I considered putting the receiving side into a new state like IPC_NOTREADY,
|
|
||||||
and then when it switches to ready, to then sending the ACK to the sender.
|
|
||||||
The problem with this is that we would need to do these checks during the
|
|
||||||
progress loop which adds some extra overhead as we would have to check all
|
|
||||||
endpoints to see if they were ready.
|
|
||||||
|
|
||||||
Note that any rank can initiate the setup of CUDA IPC. It is triggered by
|
|
||||||
whichever side does a send or sendi call of a GPU buffer.
|
|
||||||
|
|
||||||
I have the sender attempt 5 times to set up the connection. After that, we
|
|
||||||
give up. Note that I do not expect many scenarios where the sender has to
|
|
||||||
resend. It could happen in a race condition where one rank has initialized
|
|
||||||
its CUDA environment but the other side has not.
|
|
||||||
|
|
||||||
There are several states the connections can go through.
|
|
||||||
|
|
||||||
IPC_INIT - nothing has happened
|
|
||||||
IPC_SENT - message has been sent to other side
|
|
||||||
IPC_ACKING - Received request and figuring out what to send back
|
|
||||||
IPC_ACKED - IPC ACK sent
|
|
||||||
IPC_OK - IPC ACK received back
|
|
||||||
IPC_BAD - Something went wrong, so marking as no IPC support
|
|
||||||
|
|
||||||
NOTE ABOUT CUDA IPC AND MEMORY POOLS
|
|
||||||
The CUDA IPC support works in the following way. A sender makes a call to
|
|
||||||
cuIpcGetMemHandle() and gets a memory handle for its local memory. The
|
|
||||||
sender then sends that handle to receiving side. The receiver calls
|
|
||||||
cuIpcOpenMemHandle() using that handle and gets back an address to the
|
|
||||||
remote memory. The receiver then calls cuMemcpyAsync() to initiate a
|
|
||||||
remote read of the GPU data.
|
|
||||||
|
|
||||||
The receiver maintains a cache of remote memory that it has handles open on.
|
|
||||||
This is because a call to cuIpcOpenMemHandle() can be very expensive (90usec) so
|
|
||||||
we want to avoid it when we can. The cache of remote memory is kept in a memory
|
|
||||||
pool that is associated with each endpoint. Note that we do not cache the local
|
|
||||||
memory handles because getting them is very cheap and there is no need.
|
|
126
opal/mca/btl/smcuda/README.md
Обычный файл
126
opal/mca/btl/smcuda/README.md
Обычный файл
@ -0,0 +1,126 @@
|
|||||||
|
# Open MPI SMCUDA design document
|
||||||
|
|
||||||
|
Copyright (c) 2013 NVIDIA Corporation. All rights reserved.
|
||||||
|
August 21, 2013
|
||||||
|
|
||||||
|
This document describes the design and use of the `smcuda` BTL.
|
||||||
|
|
||||||
|
## BACKGROUND
|
||||||
|
|
||||||
|
The `smcuda` btl is a copy of the `sm` btl but with some additional
|
||||||
|
features. The main extra feature is the ability to make use of the
|
||||||
|
CUDA IPC APIs to quickly move GPU buffers from one GPU to another.
|
||||||
|
Without this support, the GPU buffers would all be moved into and then
|
||||||
|
out of host memory.
|
||||||
|
|
||||||
|
## GENERAL DESIGN
|
||||||
|
|
||||||
|
The general design makes use of the large message RDMA RGET support in
|
||||||
|
the OB1 PML. However, there are some interesting choices to make use
|
||||||
|
of it. First, we disable any large message RDMA support in the BTL
|
||||||
|
for host messages. This is done because we need to use the
|
||||||
|
`mca_btl_smcuda_get()` for the GPU buffers. This is also done because
|
||||||
|
the upper layers expect there to be a single mpool but we need one for
|
||||||
|
the GPU memory and one for the host memory. Since the advantages of
|
||||||
|
using RDMA with host memory is unclear, we disabled it. This means no
|
||||||
|
KNEM or CMA support built in to the `smcuda` BTL.
|
||||||
|
|
||||||
|
Also note that we give the `smcuda` BTL a higher rank than the `sm`
|
||||||
|
BTL. This means it will always be selected even if we are doing host
|
||||||
|
only data transfers. The `smcuda` BTL is not built if it is not
|
||||||
|
requested via the `--with-cuda` flag to the configure line.
|
||||||
|
|
||||||
|
Secondly, the `smcuda` does not make use of the traditional method of
|
||||||
|
enabling RDMA operations. The traditional method checks for the existence
|
||||||
|
of an RDMA btl hanging off the endpoint. The `smcuda` works in conjunction
|
||||||
|
with the OB1 PML and uses flags that it sends in the BML layer.
|
||||||
|
|
||||||
|
## OTHER CONSIDERATIONS
|
||||||
|
|
||||||
|
CUDA IPC is not necessarily supported by all GPUs on a node. In NUMA
|
||||||
|
nodes, CUDA IPC may only work between GPUs that are not connected
|
||||||
|
over the IOH. In addition, we want to check for CUDA IPC support lazily,
|
||||||
|
when the first GPU access occurs, rather than during `MPI_Init()` time.
|
||||||
|
This complicates the design.
|
||||||
|
|
||||||
|
## INITIALIZATION
|
||||||
|
|
||||||
|
When the `smcuda` BTL initializes, it starts with no support for CUDA IPC.
|
||||||
|
Upon the first access of a GPU buffer, the `smcuda` checks which GPU device
|
||||||
|
it has and sends that to the remote side using a `smcuda` specific control
|
||||||
|
message. The other rank receives the message, and checks to see if there
|
||||||
|
is CUDA IPC support between the two GPUs via a call to
|
||||||
|
`cuDeviceCanAccessPeer()`. If it is true, then the `smcuda` BTL piggy backs on
|
||||||
|
the PML error handler callback to make a call into the PML and let it know
|
||||||
|
to enable CUDA IPC. We created a new flag so that the error handler does
|
||||||
|
the right thing. Large message RDMA is enabled by setting a flag in the
|
||||||
|
`bml->btl_flags` field. Control returns to the `smcuda` BTL where a reply
|
||||||
|
message is sent so the sending side can set its flag.
|
||||||
|
|
||||||
|
At that point, the PML layer starts using the large message RDMA
|
||||||
|
support in the `smcuda` BTL. This is done in some special CUDA code
|
||||||
|
in the PML layer.
|
||||||
|
|
||||||
|
## ESTABLISHING CUDA IPC SUPPORT
|
||||||
|
|
||||||
|
A check has been added into both the `send` and `sendi` path in the
|
||||||
|
`smcuda` btl that checks to see if it should send a request for CUDA
|
||||||
|
IPC setup message.
|
||||||
|
|
||||||
|
```c
|
||||||
|
/* Initiate setting up CUDA IPC support. */
|
||||||
|
if (mca_common_cuda_enabled && (IPC_INIT == endpoint->ipcstatus)) {
|
||||||
|
mca_btl_smcuda_send_cuda_ipc_request(btl, endpoint);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The first check is to see if the CUDA environment has been
|
||||||
|
initialized. If not, then presumably we are not sending any GPU
|
||||||
|
buffers yet and there is nothing to be done. If we are initialized,
|
||||||
|
then check the status of the CUDA IPC endpoint. If it is in the
|
||||||
|
IPC_INIT stage, then call the function to send of a control message to
|
||||||
|
the endpoint.
|
||||||
|
|
||||||
|
On the receiving side, we first check to see if we are initialized.
|
||||||
|
If not, then send a message back to the sender saying we are not
|
||||||
|
initialized. This will cause the sender to reset its state to
|
||||||
|
IPC_INIT so it can try again on the next send.
|
||||||
|
|
||||||
|
I considered putting the receiving side into a new state like
|
||||||
|
IPC_NOTREADY, and then when it switches to ready, to then sending the
|
||||||
|
ACK to the sender. The problem with this is that we would need to do
|
||||||
|
these checks during the progress loop which adds some extra overhead
|
||||||
|
as we would have to check all endpoints to see if they were ready.
|
||||||
|
|
||||||
|
Note that any rank can initiate the setup of CUDA IPC. It is
|
||||||
|
triggered by whichever side does a send or sendi call of a GPU buffer.
|
||||||
|
|
||||||
|
I have the sender attempt 5 times to set up the connection. After
|
||||||
|
that, we give up. Note that I do not expect many scenarios where the
|
||||||
|
sender has to resend. It could happen in a race condition where one
|
||||||
|
rank has initialized its CUDA environment but the other side has not.
|
||||||
|
|
||||||
|
There are several states the connections can go through.
|
||||||
|
|
||||||
|
1. IPC_INIT - nothing has happened
|
||||||
|
1. IPC_SENT - message has been sent to other side
|
||||||
|
1. IPC_ACKING - Received request and figuring out what to send back
|
||||||
|
1. IPC_ACKED - IPC ACK sent
|
||||||
|
1. IPC_OK - IPC ACK received back
|
||||||
|
1. IPC_BAD - Something went wrong, so marking as no IPC support
|
||||||
|
|
||||||
|
## NOTE ABOUT CUDA IPC AND MEMORY POOLS
|
||||||
|
|
||||||
|
The CUDA IPC support works in the following way. A sender makes a
|
||||||
|
call to `cuIpcGetMemHandle()` and gets a memory handle for its local
|
||||||
|
memory. The sender then sends that handle to receiving side. The
|
||||||
|
receiver calls `cuIpcOpenMemHandle()` using that handle and gets back
|
||||||
|
an address to the remote memory. The receiver then calls
|
||||||
|
`cuMemcpyAsync()` to initiate a remote read of the GPU data.
|
||||||
|
|
||||||
|
The receiver maintains a cache of remote memory that it has handles
|
||||||
|
open on. This is because a call to `cuIpcOpenMemHandle()` can be very
|
||||||
|
expensive (90usec) so we want to avoid it when we can. The cache of
|
||||||
|
remote memory is kept in a memory pool that is associated with each
|
||||||
|
endpoint. Note that we do not cache the local memory handles because
|
||||||
|
getting them is very cheap and there is no need.
|
@ -27,7 +27,7 @@
|
|||||||
|
|
||||||
AM_CPPFLAGS = $(opal_ofi_CPPFLAGS) -DOMPI_LIBMPI_NAME=\"$(OMPI_LIBMPI_NAME)\"
|
AM_CPPFLAGS = $(opal_ofi_CPPFLAGS) -DOMPI_LIBMPI_NAME=\"$(OMPI_LIBMPI_NAME)\"
|
||||||
|
|
||||||
EXTRA_DIST = README.txt README.test
|
EXTRA_DIST = README.md README.test
|
||||||
|
|
||||||
dist_opaldata_DATA = \
|
dist_opaldata_DATA = \
|
||||||
help-mpi-btl-usnic.txt
|
help-mpi-btl-usnic.txt
|
||||||
|
330
opal/mca/btl/usnic/README.md
Обычный файл
330
opal/mca/btl/usnic/README.md
Обычный файл
@ -0,0 +1,330 @@
|
|||||||
|
# Design notes on usnic BTL
|
||||||
|
|
||||||
|
## nomenclature
|
||||||
|
|
||||||
|
* fragment - something the PML asks us to send or put, any size
|
||||||
|
* segment - something we can put on the wire in a single packet
|
||||||
|
* chunk - a piece of a fragment that fits into one segment
|
||||||
|
|
||||||
|
a segment can contain either an entire fragment or a chunk of a fragment
|
||||||
|
|
||||||
|
each segment and fragment has associated descriptor.
|
||||||
|
|
||||||
|
Each segment data structure has a block of registered memory associated with
|
||||||
|
it which matches MTU for that segment
|
||||||
|
|
||||||
|
* ACK - acks get special small segments with only enough memory for an ACK
|
||||||
|
* non-ACK segments always have a parent fragment
|
||||||
|
|
||||||
|
* fragments are either large (> MTU) or small (<= MTU)
|
||||||
|
* a small fragment has a segment descriptor embedded within it since it
|
||||||
|
always needs exactly one.
|
||||||
|
* a large fragment has no permanently associated segments, but allocates them
|
||||||
|
as needed.
|
||||||
|
|
||||||
|
## channels
|
||||||
|
|
||||||
|
A channel is a queue pair with an associated completion queue
|
||||||
|
each channel has its own MTU and r/w queue entry counts
|
||||||
|
|
||||||
|
There are 2 channels, command and data:
|
||||||
|
* command queue is generally for higher priority fragments
|
||||||
|
* data queue is for standard data traffic
|
||||||
|
* command queue should possibly be called "priority" queue
|
||||||
|
|
||||||
|
command queue is shorter and has a smaller MTU that the data queue.
|
||||||
|
this makes the command queue a lot faster than the data queue, so we
|
||||||
|
hijack it for sending very small fragments (<= tiny_mtu, currently 768 bytes)
|
||||||
|
|
||||||
|
command queue is used for ACKs and tiny fragments.
|
||||||
|
data queue is used for everything else.
|
||||||
|
|
||||||
|
PML fragments marked priority should perhaps use command queue
|
||||||
|
|
||||||
|
## sending
|
||||||
|
|
||||||
|
Normally, all send requests are simply enqueued and then actually posted
|
||||||
|
to the NIC by the routine `opal_btl_usnic_module_progress_sends()`.
|
||||||
|
"fastpath" tiny sends are the exception.
|
||||||
|
|
||||||
|
Each module maintains a queue of endpoints that are ready to send.
|
||||||
|
An endpoint is ready to send if all of the following are met:
|
||||||
|
1. the endpoint has fragments to send
|
||||||
|
1. the endpoint has send credits
|
||||||
|
1. the endpoint's send window is "open" (not full of un-ACKed segments)
|
||||||
|
|
||||||
|
Each module also maintains a list of segments that need to be retransmitted.
|
||||||
|
Note that the list of pending retrans is per-module, not per-endpoint.
|
||||||
|
|
||||||
|
Send progression first posts any pending retransmissions, always using
|
||||||
|
the data channel. (reason is that if we start getting heavy
|
||||||
|
congestion and there are lots of retransmits, it becomes more
|
||||||
|
important than ever to prioritize ACKs, clogging command channel with
|
||||||
|
retrans data makes things worse, not better)
|
||||||
|
|
||||||
|
Next, progression loops sending segments to the endpoint at the top of
|
||||||
|
the `endpoints_with_sends` queue. When an endpoint exhausts its send
|
||||||
|
credits or fills its send window or runs out of segments to send, it
|
||||||
|
removes itself from the `endpoint_with_sends` list. Any pending ACKs
|
||||||
|
will be picked up and piggy-backed on these sends.
|
||||||
|
|
||||||
|
Finally, any endpoints that still need ACKs whose timer has expired will
|
||||||
|
be sent explicit ACK packets.
|
||||||
|
|
||||||
|
## fragment sending
|
||||||
|
|
||||||
|
The middle part of the progression loop handles both small
|
||||||
|
(single-segment) and large (multi-segment) sends.
|
||||||
|
|
||||||
|
For small fragments, the verbs descriptor within the embedded segment
|
||||||
|
is updated with length, BTL header is updated, then we call
|
||||||
|
`opal_btl_usnic_endpoint_send_segment()` to send the segment. After
|
||||||
|
posting, we make a PML callback if needed.
|
||||||
|
|
||||||
|
For large fragments, a little more is needed. segments froma large
|
||||||
|
fragment have a slightly larger BTL header which contains a fragment
|
||||||
|
ID, and offset, and a size. The fragment ID is allocated when the
|
||||||
|
first chunk the fragment is sent. A segment gets allocated, next blob
|
||||||
|
of data is copied into this segment, segment is posted. If last chunk
|
||||||
|
of fragment sent, perform callback if needed, then remove fragment
|
||||||
|
from endpoint send queue.
|
||||||
|
|
||||||
|
## `opal_btl_usnic_endpoint_send_segment()`
|
||||||
|
|
||||||
|
This is common posting code for large or small segments. It assigns a
|
||||||
|
sequence number to a segment, checks for an ACK to piggy-back,
|
||||||
|
posts the segment to the NIC, and then starts the retransmit timer
|
||||||
|
by checking the segment into hotel. Send credits are consumed here.
|
||||||
|
|
||||||
|
|
||||||
|
## send dataflow
|
||||||
|
|
||||||
|
PML control messages with no user data are sent via:
|
||||||
|
* `desc = usnic_alloc(size)`
|
||||||
|
* `usnic_send(desc)`
|
||||||
|
|
||||||
|
user messages less than eager limit and 1st part of larger
|
||||||
|
|
||||||
|
messages are sent via:
|
||||||
|
* `desc = usnic_prepare_src(convertor, size)`
|
||||||
|
* `usnic_send(desc)`
|
||||||
|
|
||||||
|
larger msgs:
|
||||||
|
* `desc = usnic_prepare_src(convertor, size)`
|
||||||
|
* `usnic_put(desc)`
|
||||||
|
|
||||||
|
|
||||||
|
`usnic_alloc()` currently asserts the length is "small", allocates and
|
||||||
|
fills in a small fragment. src pointer will point to start of
|
||||||
|
associated registered mem + sizeof BTL header, and PML will put its
|
||||||
|
data there.
|
||||||
|
|
||||||
|
`usnic_prepare_src()` allocated either a large or small fragment based
|
||||||
|
on size The fragment descriptor is filled in to have 2 SG entries, 1st
|
||||||
|
pointing to place where PML should construct its header. If the data
|
||||||
|
convertor says data is contiguous, 2nd SG entry points to user buffer,
|
||||||
|
else it is null and sf_convertor is filled in with address of
|
||||||
|
convertor.
|
||||||
|
|
||||||
|
### `usnic_send()`
|
||||||
|
|
||||||
|
If the fragment being sent is small enough, has contiguous data, and
|
||||||
|
"very few" command queue send WQEs have been consumed, `usnic_send()`
|
||||||
|
does a fastpath send. This means it posts the segment immediately to
|
||||||
|
the NIC with INLINE flag set.
|
||||||
|
|
||||||
|
If all of the conditions for fastpath send are not met, and this is a
|
||||||
|
small fragment, the user data is copied into the associated registered
|
||||||
|
memory at this time and the SG list in the descriptor is collapsed to
|
||||||
|
one entry.
|
||||||
|
|
||||||
|
After the checks above are done, the fragment is enqueued to be sent
|
||||||
|
via `opal_btl_usnic_endpoint_enqueue_frag()`
|
||||||
|
|
||||||
|
### `usnic_put()`
|
||||||
|
|
||||||
|
Do a fast version of what happens in `prepare_src()` (can take shortcuts
|
||||||
|
because we know it will always be a contiguous buffer / no convertor
|
||||||
|
needed). PML gives us the destination address, which we save on the
|
||||||
|
fragment (which is the sentinel value that the underlying engine uses
|
||||||
|
to know that this is a PUT and not a SEND), and the fragment is
|
||||||
|
enqueued for processing.
|
||||||
|
|
||||||
|
### `opal_btl_usnic_endpoint_enqueue_frag()`
|
||||||
|
|
||||||
|
This appends the fragment to the "to be sent" list of the endpoint and
|
||||||
|
conditionally adds the endpoint to the list of endpoints with data to
|
||||||
|
send via `opal_btl_usnic_check_rts()`
|
||||||
|
|
||||||
|
## receive dataflow
|
||||||
|
|
||||||
|
BTL packets has one of 3 types in header: frag, chunk, or ack.
|
||||||
|
|
||||||
|
* A frag packet is a full PML fragment.
|
||||||
|
* A chunk packet is a piece of a fragment that needs to be reassembled.
|
||||||
|
* An ack packet is header only with a sequence number being ACKed.
|
||||||
|
|
||||||
|
* Both frag and chunk packets go through some of the same processing.
|
||||||
|
* Both may carry piggy-backed ACKs which may need to be processed.
|
||||||
|
* Both have sequence numbers which must be processed and may result in
|
||||||
|
dropping the packet and/or queueing an ACK to the sender.
|
||||||
|
|
||||||
|
frag packets may be either regular PML fragments or PUT segments. If
|
||||||
|
the "put_addr" field of the BTL header is set, this is a PUT and the
|
||||||
|
data is copied directly to the user buffer. If this field is NULL,
|
||||||
|
the segment is passed up to the PML. The PML is expected to do
|
||||||
|
everything it needs with this packet in the callback, including
|
||||||
|
copying data out if needed. Once the callback is complete, the
|
||||||
|
receive buffer is recycled.
|
||||||
|
|
||||||
|
chunk packets are parts of a larger fragment. If an active fragment
|
||||||
|
receive for the matching fragment ID cannot be found, and new fragment
|
||||||
|
info descriptor is allocated. If this is not a PUT (`put_addr == NULL`),
|
||||||
|
we `malloc()` data to reassemble the fragment into. Each
|
||||||
|
subsequent chunk is copied either into this reassembly buffer or
|
||||||
|
directly into user memory. When the last chunk of a fragment arrives,
|
||||||
|
a PML callback is made for non-PUTs, then the fragment info descriptor
|
||||||
|
is released.
|
||||||
|
|
||||||
|
## fast receive optimization
|
||||||
|
|
||||||
|
In order to optimize latency of small packets, the component progress
|
||||||
|
routine implements a fast path for receives. If the first completion
|
||||||
|
is a receive on the priority queue, then it is handled by a routine
|
||||||
|
called `opal_btl_usnic_recv_fast()` which does nothing but validates
|
||||||
|
that the packet is OK to be received (sequence number OK and not a
|
||||||
|
DUP) and then delivers it to the PML. This packet is recorded in the
|
||||||
|
channel structure, and all bookeeping for the packet is deferred until
|
||||||
|
the next time `component_progress` is called again.
|
||||||
|
|
||||||
|
This fast path cannot be taken every time we pass through
|
||||||
|
`component_progress` because there will be other completions that need
|
||||||
|
processing, and the receive bookeeping for one fast receive must be
|
||||||
|
complete before allowing another fast receive to occur, as only one
|
||||||
|
recv segment can be saved for deferred processing at a time. This is
|
||||||
|
handled by maintaining a variable in `opal_btl_usnic_recv_fast()`
|
||||||
|
called fastpath_ok which is set to false every time the fastpath is
|
||||||
|
taken. A call into the regular progress routine will set this flag
|
||||||
|
back to true.
|
||||||
|
|
||||||
|
## reliability:
|
||||||
|
|
||||||
|
* every packet has sequence #
|
||||||
|
* each endpoint has a "send window" , currently 4096 entries.
|
||||||
|
* once a segment is sent, it is saved in window array until ACK is received
|
||||||
|
* ACKs acknowledge all packets <= specified sequence #
|
||||||
|
* rcvr only ACKs a sequence # when all packets up to that sequence have arrived
|
||||||
|
|
||||||
|
* each pkt has dflt retrans timer of 100ms
|
||||||
|
* packet will be scheduled for retrans if timer expires
|
||||||
|
|
||||||
|
Once a segment is sent, it always has its retransmit timer started.
|
||||||
|
This is accomplished by `opal_hotel_checkin()`.
|
||||||
|
Any time a segment is posted to the NIC for retransmit, it is checked out
|
||||||
|
of the hotel (timer stopped).
|
||||||
|
So, a send segment is always in one of 4 states:
|
||||||
|
* on free list, unallocated
|
||||||
|
* on endpoint to-send list in the case of segment associated with small fragment
|
||||||
|
* posted to NIC and in hotel awaiting ACK
|
||||||
|
* on module re-send list awaiting retransmission
|
||||||
|
|
||||||
|
rcvr:
|
||||||
|
* if a pkt with seq >= expected seq is received, schedule ack of largest
|
||||||
|
in-order sequence received if not already scheduled. dflt time is 50us
|
||||||
|
* if a packet with seq < expected seq arrives, we send an ACK immediately,
|
||||||
|
as this indicates a lost ACK
|
||||||
|
|
||||||
|
sender:
|
||||||
|
* duplicate ACK triggers immediate retrans if one is not pending for
|
||||||
|
that segment
|
||||||
|
|
||||||
|
## Reordering induced by two queues and piggy-backing:
|
||||||
|
|
||||||
|
ACKs can be reordered-
|
||||||
|
* not an issue at all, old ACKs are simply ignored
|
||||||
|
|
||||||
|
Sends can be reordered-
|
||||||
|
* (small send can jump far ahead of large sends)
|
||||||
|
* large send followed by lots of small sends could trigger many
|
||||||
|
retrans of the large sends. smalls would have to be paced pretty
|
||||||
|
precisely to keep command queue empty enough and also beat out the
|
||||||
|
large sends. send credits limit how many larges can be queued on
|
||||||
|
the sender, but there could be many on the receiver
|
||||||
|
|
||||||
|
|
||||||
|
## RDMA emulation
|
||||||
|
|
||||||
|
We emulate the RDMA PUT because it's more efficient than regular send:
|
||||||
|
it allows the receive to copy directly to the target buffer
|
||||||
|
(vs. making an intermediate copy out of the bounce buffer).
|
||||||
|
|
||||||
|
It would actually be better to morph this PUT into a GET -- GET would
|
||||||
|
be slightly more efficient. In short, when the target requests the
|
||||||
|
actual RDMA data, with PUT, the request has to go up to the PML, which
|
||||||
|
will then invoke PUT on the source's BTL module. With GET, the target
|
||||||
|
issues the GET, and the source BTL module can reply without needing to
|
||||||
|
go up the stack to the PML.
|
||||||
|
|
||||||
|
Once we start supporting RDMA in hardware:
|
||||||
|
|
||||||
|
* we need to provide `module.btl_register_mem` and
|
||||||
|
`module.btl_deregister_mem` functions (see openib for an example)
|
||||||
|
* we need to put something meaningful in
|
||||||
|
`btl_usnic_frag.h:mca_btl_base_registration_handle_t`.
|
||||||
|
* we need to set `module.btl_registration_handle_size` to `sizeof(struct
|
||||||
|
mca_btl_base_registration_handle_t`).
|
||||||
|
* `module.btl_put` / `module.btl_get` will receive the
|
||||||
|
`mca_btl_base_registration_handle_t` from the peer as a cookie.
|
||||||
|
|
||||||
|
Also, `module.btl_put` / `module.btl_get` do not need to make
|
||||||
|
descriptors (this was an optimization added in BTL 3.0). They are now
|
||||||
|
called with enough information to do whatever they need to do.
|
||||||
|
module.btl_put still makes a descriptor and submits it to the usnic
|
||||||
|
sending engine so as to utilize a common infrastructure for send and
|
||||||
|
put.
|
||||||
|
|
||||||
|
But it doesn't necessarily have to be that way -- we could optimize
|
||||||
|
out the use of the descriptors. Have not investigated how easy/hard
|
||||||
|
that would be.
|
||||||
|
|
||||||
|
## libfabric abstractions:
|
||||||
|
|
||||||
|
* `fi_fabric`: corresponds to a VIC PF
|
||||||
|
* `fi_domain`: corresponds to a VIC VF
|
||||||
|
* `fi_endpoint`: resources inside the VIC VF (basically a QP)
|
||||||
|
|
||||||
|
## `MPI_THREAD_MULTIPLE` support
|
||||||
|
|
||||||
|
In order to make usnic btl thread-safe, the mutex locks are issued to
|
||||||
|
protect the critical path. ie; libfabric routines, book keeping, etc.
|
||||||
|
|
||||||
|
The said lock is `btl_usnic_lock`. It is a RECURSIVE lock, meaning
|
||||||
|
that the same thread can take the lock again even if it already has
|
||||||
|
the lock to allow the callback function to post another segment right
|
||||||
|
away if we know that the current segment is completed inline. (So we
|
||||||
|
can call send in send without deadlocking)
|
||||||
|
|
||||||
|
These two functions taking care of hotel checkin/checkout and we have
|
||||||
|
to protect that part. So we take the mutex lock before we enter the
|
||||||
|
function.
|
||||||
|
|
||||||
|
* `opal_btl_usnic_check_rts()`
|
||||||
|
* `opal_btl_usnic_handle_ack()`
|
||||||
|
|
||||||
|
We also have to protect the call to libfabric routines
|
||||||
|
|
||||||
|
* `opal_btl_usnic_endpoint_send_segment()` (`fi_send`)
|
||||||
|
* `opal_btl_usnic_recv_call()` (`fi_recvmsg`)
|
||||||
|
|
||||||
|
have to be protected as well.
|
||||||
|
|
||||||
|
Also cclient connection checking (`opal_btl_usnic_connectivity_ping`)
|
||||||
|
has to be protected. This happens only in the beginning but cclient
|
||||||
|
communicate with cagent through `opal_fd_read/write()` and if two or
|
||||||
|
more clients do `opal_fd_write()` at the same time, the data might be
|
||||||
|
corrupt.
|
||||||
|
|
||||||
|
With this concept, many functions in btl/usnic that make calls to the
|
||||||
|
listed functions are protected by `OPAL_THREAD_LOCK` macro which will
|
||||||
|
only be active if the user specify `MPI_Init_thread()` with
|
||||||
|
`MPI_THREAD_MULTIPLE` support.
|
@ -1,383 +0,0 @@
|
|||||||
Design notes on usnic BTL
|
|
||||||
|
|
||||||
======================================
|
|
||||||
nomenclature
|
|
||||||
|
|
||||||
fragment - something the PML asks us to send or put, any size
|
|
||||||
segment - something we can put on the wire in a single packet
|
|
||||||
chunk - a piece of a fragment that fits into one segment
|
|
||||||
|
|
||||||
a segment can contain either an entire fragment or a chunk of a fragment
|
|
||||||
|
|
||||||
each segment and fragment has associated descriptor.
|
|
||||||
|
|
||||||
Each segment data structure has a block of registered memory associated with
|
|
||||||
it which matches MTU for that segment
|
|
||||||
ACK - acks get special small segments with only enough memory for an ACK
|
|
||||||
non-ACK segments always have a parent fragment
|
|
||||||
|
|
||||||
fragments are either large (> MTU) or small (<= MTU)
|
|
||||||
a small fragment has a segment descriptor embedded within it since it
|
|
||||||
always needs exactly one.
|
|
||||||
|
|
||||||
a large fragment has no permanently associated segments, but allocates them
|
|
||||||
as needed.
|
|
||||||
|
|
||||||
======================================
|
|
||||||
channels
|
|
||||||
|
|
||||||
a channel is a queue pair with an associated completion queue
|
|
||||||
each channel has its own MTU and r/w queue entry counts
|
|
||||||
|
|
||||||
There are 2 channels, command and data
|
|
||||||
command queue is generally for higher priority fragments
|
|
||||||
data queue is for standard data traffic
|
|
||||||
command queue should possibly be called "priority" queue
|
|
||||||
|
|
||||||
command queue is shorter and has a smaller MTU that the data queue
|
|
||||||
this makes the command queue a lot faster than the data queue, so we
|
|
||||||
hijack it for sending very small fragments (<= tiny_mtu, currently 768 bytes)
|
|
||||||
|
|
||||||
command queue is used for ACKs and tiny fragments
|
|
||||||
data queue is used for everything else
|
|
||||||
|
|
||||||
PML fragments marked priority should perhaps use command queue
|
|
||||||
|
|
||||||
======================================
|
|
||||||
sending
|
|
||||||
|
|
||||||
Normally, all send requests are simply enqueued and then actually posted
|
|
||||||
to the NIC by the routine opal_btl_usnic_module_progress_sends().
|
|
||||||
"fastpath" tiny sends are the exception.
|
|
||||||
|
|
||||||
Each module maintains a queue of endpoints that are ready to send.
|
|
||||||
An endpoint is ready to send if all of the following are met:
|
|
||||||
- the endpoint has fragments to send
|
|
||||||
- the endpoint has send credits
|
|
||||||
- the endpoint's send window is "open" (not full of un-ACKed segments)
|
|
||||||
|
|
||||||
Each module also maintains a list of segments that need to be retransmitted.
|
|
||||||
Note that the list of pending retrans is per-module, not per-endpoint.
|
|
||||||
|
|
||||||
send progression first posts any pending retransmissions, always using the
|
|
||||||
data channel. (reason is that if we start getting heavy congestion and
|
|
||||||
there are lots of retransmits, it becomes more important than ever to
|
|
||||||
prioritize ACKs, clogging command channel with retrans data makes things worse,
|
|
||||||
not better)
|
|
||||||
|
|
||||||
Next, progression loops sending segments to the endpoint at the top of
|
|
||||||
the "endpoints_with_sends" queue. When an endpoint exhausts its send
|
|
||||||
credits or fills its send window or runs out of segments to send, it removes
|
|
||||||
itself from the endpoint_with_sends list. Any pending ACKs will be
|
|
||||||
picked up and piggy-backed on these sends.
|
|
||||||
|
|
||||||
Finally, any endpoints that still need ACKs whose timer has expired will
|
|
||||||
be sent explicit ACK packets.
|
|
||||||
|
|
||||||
[double-click fragment sending]
|
|
||||||
The middle part of the progression loop handles both small (single-segment)
|
|
||||||
and large (multi-segment) sends.
|
|
||||||
|
|
||||||
For small fragments, the verbs descriptor within the embedded segment is
|
|
||||||
updated with length, BTL header is updated, then we call
|
|
||||||
opal_btl_usnic_endpoint_send_segment() to send the segment.
|
|
||||||
After posting, we make a PML callback if needed.
|
|
||||||
|
|
||||||
For large fragments, a little more is needed. segments froma large
|
|
||||||
fragment have a slightly larger BTL header which contains a fragment ID,
|
|
||||||
and offset, and a size. The fragment ID is allocated when the first chunk
|
|
||||||
the fragment is sent. A segment gets allocated, next blob of data is
|
|
||||||
copied into this segment, segment is posted. If last chunk of fragment
|
|
||||||
sent, perform callback if needed, then remove fragment from endpoint
|
|
||||||
send queue.
|
|
||||||
|
|
||||||
[double-click opal_btl_usnic_endpoint_send_segment()]
|
|
||||||
|
|
||||||
This is common posting code for large or small segments. It assigns a
|
|
||||||
sequence number to a segment, checks for an ACK to piggy-back,
|
|
||||||
posts the segment to the NIC, and then starts the retransmit timer
|
|
||||||
by checking the segment into hotel. Send credits are consumed here.
|
|
||||||
|
|
||||||
|
|
||||||
======================================
|
|
||||||
send dataflow
|
|
||||||
|
|
||||||
PML control messages with no user data are sent via:
|
|
||||||
desc = usnic_alloc(size)
|
|
||||||
usnic_send(desc)
|
|
||||||
|
|
||||||
user messages less than eager limit and 1st part of larger
|
|
||||||
messages are sent via:
|
|
||||||
desc = usnic_prepare_src(convertor, size)
|
|
||||||
usnic_send(desc)
|
|
||||||
|
|
||||||
larger msgs
|
|
||||||
desc = usnic_prepare_src(convertor, size)
|
|
||||||
usnic_put(desc)
|
|
||||||
|
|
||||||
|
|
||||||
usnic_alloc() currently asserts the length is "small", allocates and
|
|
||||||
fills in a small fragment. src pointer will point to start of
|
|
||||||
associated registered mem + sizeof BTL header, and PML will put its
|
|
||||||
data there.
|
|
||||||
|
|
||||||
usnic_prepare_src() allocated either a large or small fragment based on size
|
|
||||||
The fragment descriptor is filled in to have 2 SG entries, 1st pointing to
|
|
||||||
place where PML should construct its header. If the data convertor says
|
|
||||||
data is contiguous, 2nd SG entry points to user buffer, else it is null and
|
|
||||||
sf_convertor is filled in with address of convertor.
|
|
||||||
|
|
||||||
usnic_send()
|
|
||||||
If the fragment being sent is small enough, has contiguous data, and
|
|
||||||
"very few" command queue send WQEs have been consumed, usnic_send() does
|
|
||||||
a fastpath send. This means it posts the segment immediately to the NIC
|
|
||||||
with INLINE flag set.
|
|
||||||
|
|
||||||
If all of the conditions for fastpath send are not met, and this is a small
|
|
||||||
fragment, the user data is copied into the associated registered memory at this
|
|
||||||
time and the SG list in the descriptor is collapsed to one entry.
|
|
||||||
|
|
||||||
After the checks above are done, the fragment is enqueued to be sent
|
|
||||||
via opal_btl_usnic_endpoint_enqueue_frag()
|
|
||||||
|
|
||||||
usnic_put()
|
|
||||||
Do a fast version of what happens in prepare_src() (can take shortcuts
|
|
||||||
because we know it will always be a contiguous buffer / no convertor
|
|
||||||
needed). PML gives us the destination address, which we save on the
|
|
||||||
fragment (which is the sentinel value that the underlying engine uses
|
|
||||||
to know that this is a PUT and not a SEND), and the fragment is
|
|
||||||
enqueued for processing.
|
|
||||||
|
|
||||||
opal_btl_usnic_endpoint_enqueue_frag()
|
|
||||||
This appends the fragment to the "to be sent" list of the endpoint and
|
|
||||||
conditionally adds the endpoint to the list of endpoints with data to send
|
|
||||||
via opal_btl_usnic_check_rts()
|
|
||||||
|
|
||||||
======================================
|
|
||||||
receive dataflow
|
|
||||||
|
|
||||||
BTL packets has one of 3 types in header: frag, chunk, or ack.
|
|
||||||
|
|
||||||
A frag packet is a full PML fragment.
|
|
||||||
A chunk packet is a piece of a fragment that needs to be reassembled.
|
|
||||||
An ack packet is header only with a sequence number being ACKed.
|
|
||||||
|
|
||||||
Both frag and chunk packets go through some of the same processing.
|
|
||||||
Both may carry piggy-backed ACKs which may need to be processed.
|
|
||||||
Both have sequence numbers which must be processed and may result in
|
|
||||||
dropping the packet and/or queueing an ACK to the sender.
|
|
||||||
|
|
||||||
frag packets may be either regular PML fragments or PUT segments.
|
|
||||||
If the "put_addr" field of the BTL header is set, this is a PUT and
|
|
||||||
the data is copied directly to the user buffer. If this field is NULL,
|
|
||||||
the segment is passed up to the PML. The PML is expected to do everything
|
|
||||||
it needs with this packet in the callback, including copying data out if
|
|
||||||
needed. Once the callback is complete, the receive buffer is recycled.
|
|
||||||
|
|
||||||
chunk packets are parts of a larger fragment. If an active fragment receive
|
|
||||||
for the matching fragment ID cannot be found, and new fragment info
|
|
||||||
descriptor is allocated. If this is not a PUT (put_addr == NULL), we
|
|
||||||
malloc() data to reassemble the fragment into. Each subsequent chunk
|
|
||||||
is copied either into this reassembly buffer or directly into user memory.
|
|
||||||
When the last chunk of a fragment arrives, a PML callback is made for non-PUTs,
|
|
||||||
then the fragment info descriptor is released.
|
|
||||||
|
|
||||||
======================================
|
|
||||||
fast receive optimization
|
|
||||||
|
|
||||||
In order to optimize latency of small packets, the component progress routine
|
|
||||||
implements a fast path for receives. If the first completion is a receive on
|
|
||||||
the priority queue, then it is handled by a routine called
|
|
||||||
opal_btl_usnic_recv_fast() which does nothing but validates that the packet
|
|
||||||
is OK to be received (sequence number OK and not a DUP) and then delivers it
|
|
||||||
to the PML. This packet is recorded in the channel structure, and all
|
|
||||||
bookeeping for the packet is deferred until the next time component_progress
|
|
||||||
is called again.
|
|
||||||
|
|
||||||
This fast path cannot be taken every time we pass through component_progress
|
|
||||||
because there will be other completions that need processing, and the receive
|
|
||||||
bookeeping for one fast receive must be complete before allowing another fast
|
|
||||||
receive to occur, as only one recv segment can be saved for deferred
|
|
||||||
processing at a time. This is handled by maintaining a variable in
|
|
||||||
opal_btl_usnic_recv_fast() called fastpath_ok which is set to false every time
|
|
||||||
the fastpath is taken. A call into the regular progress routine will set this
|
|
||||||
flag back to true.
|
|
||||||
|
|
||||||
======================================
|
|
||||||
reliability:
|
|
||||||
|
|
||||||
every packet has sequence #
|
|
||||||
each endpoint has a "send window" , currently 4096 entries.
|
|
||||||
once a segment is sent, it is saved in window array until ACK is received
|
|
||||||
ACKs acknowledge all packets <= specified sequence #
|
|
||||||
rcvr only ACKs a sequence # when all packets up to that sequence have arrived
|
|
||||||
|
|
||||||
each pkt has dflt retrans timer of 100ms
|
|
||||||
packet will be scheduled for retrans if timer expires
|
|
||||||
|
|
||||||
Once a segment is sent, it always has its retransmit timer started.
|
|
||||||
This is accomplished by opal_hotel_checkin()
|
|
||||||
Any time a segment is posted to the NIC for retransmit, it is checked out
|
|
||||||
of the hotel (timer stopped).
|
|
||||||
So, a send segment is always in one of 4 states:
|
|
||||||
- on free list, unallocated
|
|
||||||
- on endpoint to-send list in the case of segment associated with small fragment
|
|
||||||
- posted to NIC and in hotel awaiting ACK
|
|
||||||
- on module re-send list awaiting retransmission
|
|
||||||
|
|
||||||
rcvr:
|
|
||||||
- if a pkt with seq >= expected seq is received, schedule ack of largest
|
|
||||||
in-order sequence received if not already scheduled. dflt time is 50us
|
|
||||||
- if a packet with seq < expected seq arrives, we send an ACK immediately,
|
|
||||||
as this indicates a lost ACK
|
|
||||||
|
|
||||||
sender:
|
|
||||||
duplicate ACK triggers immediate retrans if one is not pending for that segment
|
|
||||||
|
|
||||||
======================================
|
|
||||||
Reordering induced by two queues and piggy-backing:
|
|
||||||
|
|
||||||
ACKs can be reordered-
|
|
||||||
not an issue at all, old ACKs are simply ignored
|
|
||||||
|
|
||||||
Sends can be reordered-
|
|
||||||
(small send can jump far ahead of large sends)
|
|
||||||
large send followed by lots of small sends could trigger many retrans
|
|
||||||
of the large sends. smalls would have to be paced pretty precisely to
|
|
||||||
keep command queue empty enough and also beat out the large sends.
|
|
||||||
send credits limit how many larges can be queued on the sender, but there
|
|
||||||
could be many on the receiver
|
|
||||||
|
|
||||||
|
|
||||||
======================================
|
|
||||||
RDMA emulation
|
|
||||||
|
|
||||||
We emulate the RDMA PUT because it's more efficient than regular send:
|
|
||||||
it allows the receive to copy directly to the target buffer
|
|
||||||
(vs. making an intermediate copy out of the bounce buffer).
|
|
||||||
|
|
||||||
It would actually be better to morph this PUT into a GET -- GET would
|
|
||||||
be slightly more efficient. In short, when the target requests the
|
|
||||||
actual RDMA data, with PUT, the request has to go up to the PML, which
|
|
||||||
will then invoke PUT on the source's BTL module. With GET, the target
|
|
||||||
issues the GET, and the source BTL module can reply without needing to
|
|
||||||
go up the stack to the PML.
|
|
||||||
|
|
||||||
Once we start supporting RDMA in hardware:
|
|
||||||
|
|
||||||
- we need to provide module.btl_register_mem and
|
|
||||||
module.btl_deregister_mem functions (see openib for an example)
|
|
||||||
- we need to put something meaningful in
|
|
||||||
btl_usnic_frag.h:mca_btl_base_registration_handle_t.
|
|
||||||
- we need to set module.btl_registration_handle_size to sizeof(struct
|
|
||||||
mca_btl_base_registration_handle_t).
|
|
||||||
- module.btl_put / module.btl_get will receive the
|
|
||||||
mca_btl_base_registration_handle_t from the peer as a cookie.
|
|
||||||
|
|
||||||
Also, module.btl_put / module.btl_get do not need to make descriptors
|
|
||||||
(this was an optimization added in BTL 3.0). They are now called with
|
|
||||||
enough information to do whatever they need to do. module.btl_put
|
|
||||||
still makes a descriptor and submits it to the usnic sending engine so
|
|
||||||
as to utilize a common infrastructure for send and put.
|
|
||||||
|
|
||||||
But it doesn't necessarily have to be that way -- we could optimize
|
|
||||||
out the use of the descriptors. Have not investigated how easy/hard
|
|
||||||
that would be.
|
|
||||||
|
|
||||||
======================================
|
|
||||||
|
|
||||||
November 2014 / SC 2014
|
|
||||||
Update February 2015
|
|
||||||
|
|
||||||
The usnic BTL code has been unified across master and the v1.8
|
|
||||||
branches.
|
|
||||||
|
|
||||||
NOTE: As of May 2018, this is no longer true. This was generally
|
|
||||||
only necessary back when the BTLs were moved from the OMPI layer to
|
|
||||||
the OPAL layer. Now that the BTLs have been down in OPAL for
|
|
||||||
several years, this tomfoolery is no longer necessary. This note
|
|
||||||
is kept for historical purposes, just in case someone needs to go
|
|
||||||
back and look at the v1.8 series.
|
|
||||||
|
|
||||||
That is, you can copy the code from v1.8:ompi/mca/btl/usnic/* to
|
|
||||||
master:opal/mca/btl/usnic*, and then only have to make 3 changes in
|
|
||||||
the resulting code in master:
|
|
||||||
|
|
||||||
1. Edit Makefile.am: s/ompi/opal/gi
|
|
||||||
2. Edit configure.m4: s/ompi/opal/gi
|
|
||||||
--> EXCEPT for:
|
|
||||||
- opal_common_libfabric_* (which will eventually be removed,
|
|
||||||
when the embedded libfabric goes away)
|
|
||||||
- OPAL_BTL_USNIC_FI_EXT_USNIC_H (which will eventually be
|
|
||||||
removed, when the embedded libfabric goes away)
|
|
||||||
- OPAL_VAR_SCOPE_*
|
|
||||||
3. Edit Makefile.am: change -DBTL_IN_OPAL=0 to -DBTL_IN_OPAL=1
|
|
||||||
|
|
||||||
*** Note: the BTL_IN_OPAL preprocessor macro is set in Makefile.am
|
|
||||||
rather that in btl_usnic_compat.h to avoid all kinds of include
|
|
||||||
file dependency issues (i.e., btl_usnic_compat.h would need to be
|
|
||||||
included first, but it requires some data structures to be
|
|
||||||
defined, which means it either can't be first or we have to
|
|
||||||
declare various structs first... just put BTL_IN_OPAL in
|
|
||||||
Makefile.am and be happy).
|
|
||||||
|
|
||||||
*** Note 2: CARE MUST BE TAKEN WHEN COPYING THE OTHER DIRECTION! It
|
|
||||||
is *not* as simple as simple s/opal/ompi/gi in configure.m4 and
|
|
||||||
Makefile.am. It certainly can be done, but there's a few strings
|
|
||||||
that need to stay "opal" or "OPAL" (e.g., OPAL_HAVE_FOO).
|
|
||||||
Hence, the string replace will likely need to be done via manual
|
|
||||||
inspection.
|
|
||||||
|
|
||||||
Things still to do:
|
|
||||||
|
|
||||||
- VF/PF sanity checks in component.c:check_usnic_config() uses
|
|
||||||
usnic-specific fi_provider info. The exact mechanism might change
|
|
||||||
as provider-specific info is still being discussed upstream.
|
|
||||||
|
|
||||||
- component.c:usnic_handle_cq_error is using a USD_* constant from
|
|
||||||
usnic_direct. Need to get that value through libfabric somehow.
|
|
||||||
|
|
||||||
======================================
|
|
||||||
|
|
||||||
libfabric abstractions:
|
|
||||||
|
|
||||||
fi_fabric: corresponds to a VIC PF
|
|
||||||
fi_domain: corresponds to a VIC VF
|
|
||||||
fi_endpoint: resources inside the VIC VF (basically a QP)
|
|
||||||
|
|
||||||
======================================
|
|
||||||
|
|
||||||
MPI_THREAD_MULTIPLE support
|
|
||||||
|
|
||||||
In order to make usnic btl thread-safe, the mutex locks are issued
|
|
||||||
to protect the critical path. ie; libfabric routines, book keeping, etc.
|
|
||||||
|
|
||||||
The said lock is btl_usnic_lock. It is a RECURSIVE lock, meaning that
|
|
||||||
the same thread can take the lock again even if it already has the lock to
|
|
||||||
allow the callback function to post another segment right away if we know
|
|
||||||
that the current segment is completed inline. (So we can call send in send
|
|
||||||
without deadlocking)
|
|
||||||
|
|
||||||
These two functions taking care of hotel checkin/checkout and we
|
|
||||||
have to protect that part. So we take the mutex lock before we enter the
|
|
||||||
function.
|
|
||||||
|
|
||||||
- opal_btl_usnic_check_rts()
|
|
||||||
- opal_btl_usnic_handle_ack()
|
|
||||||
|
|
||||||
We also have to protect the call to libfabric routines
|
|
||||||
|
|
||||||
- opal_btl_usnic_endpoint_send_segment() (fi_send)
|
|
||||||
- opal_btl_usnic_recv_call() (fi_recvmsg)
|
|
||||||
|
|
||||||
have to be protected as well.
|
|
||||||
|
|
||||||
Also cclient connection checking (opal_btl_usnic_connectivity_ping) has to be
|
|
||||||
protected. This happens only in the beginning but cclient communicate with cagent
|
|
||||||
through opal_fd_read/write() and if two or more clients do opal_fd_write() at the
|
|
||||||
same time, the data might be corrupt.
|
|
||||||
|
|
||||||
With this concept, many functions in btl/usnic that make calls to the
|
|
||||||
listed functions are protected by OPAL_THREAD_LOCK macro which will only
|
|
||||||
be active if the user specify MPI_Init_thread() with MPI_THREAD_MULTIPLE
|
|
||||||
support.
|
|
@ -1,50 +0,0 @@
|
|||||||
# Copyright (c) 2013 Mellanox Technologies, Inc.
|
|
||||||
# All rights reserved
|
|
||||||
# $COPYRIGHT$
|
|
||||||
MEMHEAP Infrustructure documentation
|
|
||||||
------------------------------------
|
|
||||||
|
|
||||||
MEMHEAP Infrustructure is responsible for managing the symmetric heap.
|
|
||||||
The framework currently has following components: buddy and ptmalloc. buddy which uses a buddy allocator in order to manage the Memory allocations on the symmetric heap. Ptmalloc is an adaptation of ptmalloc3.
|
|
||||||
|
|
||||||
Additional components may be added easily to the framework by defining the component's and the module's base and extended structures, and their funtionalities.
|
|
||||||
|
|
||||||
The buddy allocator has the following data structures:
|
|
||||||
1. Base component - of type struct mca_memheap_base_component_2_0_0_t
|
|
||||||
2. Base module - of type struct mca_memheap_base_module_t
|
|
||||||
3. Buddy component - of type struct mca_memheap_base_component_2_0_0_t
|
|
||||||
4. Buddy module - of type struct mca_memheap_buddy_module_t extending the base module (struct mca_memheap_base_module_t)
|
|
||||||
|
|
||||||
Each data structure includes the following fields:
|
|
||||||
1. Base component - memheap_version, memheap_data and memheap_init
|
|
||||||
2. Base module - Holds pointers to the base component and to the functions: alloc, free and finalize
|
|
||||||
3. Buddy component - is a base component.
|
|
||||||
4. Buddy module - Extends the base module and holds additional data on the components's priority, buddy allocator,
|
|
||||||
maximal order of the symmetric heap, symmetric heap, pointer to the symmetric heap and hashtable maintaining the size of each allocated address.
|
|
||||||
|
|
||||||
In the case that the user decides to implement additional components, the Memheap infrastructure chooses a component with the maximal priority.
|
|
||||||
Handling the component opening is done under the base directory, in three stages:
|
|
||||||
1. Open all available components. Implemented by memheap_base_open.c and called from shmem_init.
|
|
||||||
2. Select the maximal priority component. This procedure involves the initialization of all components and then their
|
|
||||||
finalization except to the chosen component. It is implemented by memheap_base_select.c and called from shmem_init.
|
|
||||||
3. Close the max priority active cmponent. Implemented by memheap_base_close.c and called from shmem finalize.
|
|
||||||
|
|
||||||
|
|
||||||
Buddy Component/Module
|
|
||||||
----------------------
|
|
||||||
|
|
||||||
Responsible for handling the entire activities of the symmetric heap.
|
|
||||||
The supported activities are:
|
|
||||||
- buddy_init (Initialization)
|
|
||||||
- buddy_alloc (Allocates a variable on the symmetric heap)
|
|
||||||
- buddy_free (frees a variable previously allocated on the symetric heap)
|
|
||||||
- buddy_finalize (Finalization).
|
|
||||||
|
|
||||||
Data members of buddy module: - priority. The module's priority.
|
|
||||||
- buddy allocator: bits, num_free, lock and the maximal order (log2 of the maximal size)
|
|
||||||
of a variable on the symmetric heap. Buddy Allocator gives the offset in the symmetric heap
|
|
||||||
where a variable should be allocated.
|
|
||||||
- symmetric_heap: a range of reserved addresses (equal in all executing PE's) dedicated to "shared memory" allocation.
|
|
||||||
- symmetric_heap_hashtable (holding the size of an allocated variable on the symmetric heap.
|
|
||||||
used to free an allocated variable on the symmetric heap)
|
|
||||||
|
|
71
oshmem/mca/memheap/README.md
Обычный файл
71
oshmem/mca/memheap/README.md
Обычный файл
@ -0,0 +1,71 @@
|
|||||||
|
# MEMHEAP infrastructure documentation
|
||||||
|
|
||||||
|
Copyright (c) 2013 Mellanox Technologies, Inc.
|
||||||
|
All rights reserved
|
||||||
|
|
||||||
|
MEMHEAP Infrustructure is responsible for managing the symmetric heap.
|
||||||
|
The framework currently has following components: buddy and
|
||||||
|
ptmalloc. buddy which uses a buddy allocator in order to manage the
|
||||||
|
Memory allocations on the symmetric heap. Ptmalloc is an adaptation of
|
||||||
|
ptmalloc3.
|
||||||
|
|
||||||
|
Additional components may be added easily to the framework by defining
|
||||||
|
the component's and the module's base and extended structures, and
|
||||||
|
their funtionalities.
|
||||||
|
|
||||||
|
The buddy allocator has the following data structures:
|
||||||
|
|
||||||
|
1. Base component - of type struct mca_memheap_base_component_2_0_0_t
|
||||||
|
2. Base module - of type struct mca_memheap_base_module_t
|
||||||
|
3. Buddy component - of type struct mca_memheap_base_component_2_0_0_t
|
||||||
|
4. Buddy module - of type struct mca_memheap_buddy_module_t extending
|
||||||
|
the base module (struct mca_memheap_base_module_t)
|
||||||
|
|
||||||
|
Each data structure includes the following fields:
|
||||||
|
|
||||||
|
1. Base component - memheap_version, memheap_data and memheap_init
|
||||||
|
2. Base module - Holds pointers to the base component and to the
|
||||||
|
functions: alloc, free and finalize
|
||||||
|
3. Buddy component - is a base component.
|
||||||
|
4. Buddy module - Extends the base module and holds additional data on
|
||||||
|
the components's priority, buddy allocator,
|
||||||
|
maximal order of the symmetric heap, symmetric heap, pointer to the
|
||||||
|
symmetric heap and hashtable maintaining the size of each allocated
|
||||||
|
address.
|
||||||
|
|
||||||
|
In the case that the user decides to implement additional components,
|
||||||
|
the Memheap infrastructure chooses a component with the maximal
|
||||||
|
priority. Handling the component opening is done under the base
|
||||||
|
directory, in three stages:
|
||||||
|
1. Open all available components. Implemented by memheap_base_open.c
|
||||||
|
and called from shmem_init.
|
||||||
|
2. Select the maximal priority component. This procedure involves the
|
||||||
|
initialization of all components and then their finalization except
|
||||||
|
to the chosen component. It is implemented by memheap_base_select.c
|
||||||
|
and called from shmem_init.
|
||||||
|
3. Close the max priority active cmponent. Implemented by
|
||||||
|
memheap_base_close.c and called from shmem finalize.
|
||||||
|
|
||||||
|
|
||||||
|
## Buddy Component/Module
|
||||||
|
|
||||||
|
Responsible for handling the entire activities of the symmetric heap.
|
||||||
|
The supported activities are:
|
||||||
|
|
||||||
|
1. buddy_init (Initialization)
|
||||||
|
1. buddy_alloc (Allocates a variable on the symmetric heap)
|
||||||
|
1. buddy_free (frees a variable previously allocated on the symetric heap)
|
||||||
|
1. buddy_finalize (Finalization).
|
||||||
|
|
||||||
|
Data members of buddy module:
|
||||||
|
|
||||||
|
1. priority. The module's priority.
|
||||||
|
1. buddy allocator: bits, num_free, lock and the maximal order (log2
|
||||||
|
of the maximal size) of a variable on the symmetric heap. Buddy
|
||||||
|
Allocator gives the offset in the symmetric heap where a variable
|
||||||
|
should be allocated.
|
||||||
|
1. symmetric_heap: a range of reserved addresses (equal in all
|
||||||
|
executing PE's) dedicated to "shared memory" allocation.
|
||||||
|
1. symmetric_heap_hashtable (holding the size of an allocated variable
|
||||||
|
on the symmetric heap. used to free an allocated variable on the
|
||||||
|
symmetric heap)
|
@ -1,7 +0,0 @@
|
|||||||
The functions in this directory are all intended to test registry operations against a persistent seed. Thus, they perform a system init/finalize. The functions in the directory above this one should be used to test basic registry operations within the replica - they will isolate the replica so as to avoid the communications issues and the init/finalize problems in other subsystems that may cause problems here.
|
|
||||||
|
|
||||||
To run these tests, you need to first start a persistent daemon. This can be done using the command:
|
|
||||||
|
|
||||||
orted --seed --scope public --persistent
|
|
||||||
|
|
||||||
The daemon will "daemonize" itself and establish the registry (as well as other central services) replica, and then return a system prompt. You can then run any of these functions. If desired, you can utilize gdb and/or debug options on the persistent orted to watch/debug replica operations as well.
|
|
20
test/runtime/README.md
Обычный файл
20
test/runtime/README.md
Обычный файл
@ -0,0 +1,20 @@
|
|||||||
|
The functions in this directory are all intended to test registry
|
||||||
|
operations against a persistent seed. Thus, they perform a system
|
||||||
|
init/finalize. The functions in the directory above this one should be
|
||||||
|
used to test basic registry operations within the replica - they will
|
||||||
|
isolate the replica so as to avoid the communications issues and the
|
||||||
|
init/finalize problems in other subsystems that may cause problems
|
||||||
|
here.
|
||||||
|
|
||||||
|
To run these tests, you need to first start a persistent daemon. This
|
||||||
|
can be done using the command:
|
||||||
|
|
||||||
|
```
|
||||||
|
orted --seed --scope public --persistent
|
||||||
|
```
|
||||||
|
|
||||||
|
The daemon will "daemonize" itself and establish the registry (as well
|
||||||
|
as other central services) replica, and then return a system
|
||||||
|
prompt. You can then run any of these functions. If desired, you can
|
||||||
|
utilize gdb and/or debug options on the persistent orted to
|
||||||
|
watch/debug replica operations as well.
|
Загрузка…
x
Ссылка в новой задаче
Block a user